Converse with Data via PandasAI and LLM

Converse with Data via PandasAI and LLM

Enhance Your Data Analysis with PandasAI and Large Language Models


Generative AI is highly useful for data analysis and data scientists due to its ability to enhance data processing, generate insights, and automate various tasks.

Generative AI is beneficial for Automated Data Preparation, Feature Engineering, Enhanced Data Understanding, NLP, Predictive Analysis, etc.

With generative AI, ideally, we can just talk with data in Natural language to enable more sophisticated and comprehensive insights. This ultimately leads to more efficient, accurate, and innovative data analysis.

In this article, we will use PandasAI which is a Python library that makes it easy to converse with our data in natural language. It helps us to explore, clean, and analyze. Beyond querying, PandasAI offers functionalities to visualize data through graphs, cleanse datasets by addressing missing values, and enhance data quality through feature generation, making it a comprehensive tool for data scientists and analysts.


  • Postgres Database (We will be using Docker to host it in this article, however, this is optional if you want to try database connector.)

  • Python v3.11 or above

Local LLM Setup via Ollama

You need minimum of 8GB RAM for this to run, else you might face performance issues. That is still will return in slower response but that is minimum.

Though PandasAI has support for their in-house Bamboo LLM, however as its beta we may or may not get access to same.

One can go to OpenAI too for quick setup, but we want to set up Mistral LLM via Ollama locally and use that for our Data Analysis purpose in this article. (Though we will let you know how to use OpenAI-based LLM quickly as part of the article.)

With open-source LLM you don't have to worry about Tokens and limits in the long term.

Setup Mistral via Ollama, can be done quickly as below

  • Download Ollama from here as per your OS

  • Just execute the executable file and let the process complete. Verify if the installation is complete and successful by executing ollama --version

    If this returns something like below, installation is successful.

  • We will be using the mistral llm model for blog purposes. All other Open Source LLM models to be run via Ollama can be found here which includes llama3, Gemma, etc as well.

  • Execute the below command to run mistral LLM model locally

      ollama run mistral

    This might take some time depending on your internet provider speed to pull the model from the server to local as it is over 4 GB.

Once mistral is downloaded, it will be served as well as part of the same command.

We will quickly test if our model is fine over HTTP by running the following cURL command

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt":"Here is a story about llamas eating grass"

Or Chat with model

curl http://localhost:11434/api/chat -d '{
  "model": "mistral",
  "messages": [
    { "role": "user", "content": "why is the sky blue?" }

Now that our local LLM in the mold of Mistral via Ollama is running, we can move to step to perform data analysis via PandasAI using this LLM model.

PandasAI Setup

Now that our local LLM is up and running, we will do all required tool setups to perform conversational chat with PandasAI to analyze data.

First, we will setup a virtual environment, so that we can manage our project in a cleaner and maintainable way. We are going to use Python directly, but we can also use something like anaconda, or poetry to manage virtual environments.

Create a virtual environment as below python -m venv pandasai-demo

Once created, activate using .\pandasai-demo\Scripts\activate

We will now install the required modules related to PandasAI via pip as below

pip install pandasai
pip install pandasai[connectors] #optional if you want to skip postgres connector

Once installed, we will also install Jupyter, so we can code our way via Notebooks

pip install jupyter

Once done, we will run below to start a Notebook

jupyter notebook

PandasAI Development

Sample Data Frame

We will now start the coding side of things, starting with the Import of required modules and functions as below

from pandasai import SmartDataframe
from pandasai.llm.local_llm import LocalLLM
import pandas as pd
from pandasai.connectors import PostgreSQLConnector

Create LLM object, we are going to use Ollama LLM via mistral model

ollama_llm = LocalLLM(api_base="http://localhost:11434/v1", model="mistral")

We will now create a SmartDataframe class which is the main class of pandasai. It is used to interact with a single data frame and pass the required LLM object.

sales_by_country = pd.DataFrame({
    "country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
    "sales": [5000, 3200, 2900, 4100, 2300, 2100, 2500, 2600, 4500, 7000]

df = SmartDataframe(sales_by_country, config={"llm": ollama_llm})

Now we just to ask question to our Smart Data Frame which is powered by LLM as below.

response ='Which are the top 5 countries by sales?')

Voila! It's working fine.

External Dataset CSV

In the earlier section, we saw we used hardcoded dummy data internally, Now we will input the external CSV file we auto-generated via ChatGPT for random 100 football player's data with Country, name, Goals, etc, and load it in SmartDataFrame as below

data = pd.read_csv('data/football_data.csv',  encoding = 'ISO-8859-1')
sdf = SmartDataframe(data, config={"llm": ollama_llm})

Now we will chat with our Data Frame as below

response ='Count number of unique country')

response ='Sort players with most goals but give me only top 5')

response ='Which country has most players')

Voila! We are now able to chat with our dataset loaded via CSV file.

Postgres Database Connector

We have seen how we are able to speak to data loaded internally via hardcoded data frame as well as External datasets in the form of CSV files, We will now look at how we can integrate with Database in this case Postgres.

We will now set up a employee table with 100 Randomly generated employees within the Postgres Database.

We will run the following Docker command to start the Postgres Database

docker run --name vo_db -p 5455:5432 -e POSTGRES_USER=vo -e POSTGRES_PASSWORD=vo123 -e POSTGRES_DB=vo_db -d postgres


CREATE TABLE employee (
    employee_id SERIAL PRIMARY KEY,
    first_name VARCHAR(255) NOT NULL,
    last_name VARCHAR(255) NOT NULL,
    phone_number VARCHAR(20),
    hire_date DATE NOT NULL,
    job_title VARCHAR(255) NOT NULL,
    salary NUMERIC(10, 2) NOT NULL,
    gender VARCHAR(10),
    department varchar(32)

Random generation of 100 employees script can be executed as below

    i INT;
    first_names TEXT[] := ARRAY['John', 'Jane', 'Alex', 'Chris', 'Pat', 'Taylor', 'Jordan', 'Morgan', 'Casey', 'Drew'];
    last_names TEXT[] := ARRAY['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor'];
    job_titles TEXT[] := ARRAY['Manager', 'Developer', 'Analyst', 'Engineer', 'Specialist', 'Coordinator', 'Consultant', 'Administrator', 'Director', 'Executive'];
    departments TEXT[] := ARRAY['HR', 'Finance', 'Admin', 'Sales', 'IT', 'Marketing'];
    gender_options TEXT[] := ARRAY['Male', 'Female'];
    start_date DATE := '2023-01-01';
    end_date DATE := '2024-05-17';
    FOR i IN 1..100 LOOP
        INSERT INTO employee (first_name, last_name, email, phone_number, hire_date, job_title, salary, gender, department)
        VALUES (
            first_names[(i % array_length(first_names, 1)) + 1],
            last_names[(i % array_length(last_names, 1)) + 1],
            lower(first_names[(i % array_length(first_names, 1)) + 1] || '.' || last_names[(i % array_length(last_names, 1)) + 1] || i || ''),
            '555-010' || lpad((i % 10)::text, 2, '0'),
            start_date + (random() * (end_date - start_date))::int,
            job_titles[(i % array_length(job_titles, 1)) + 1],
            round((random() * 50000 + 50000)::numeric, 2),
            gender_options[(i % array_length(gender_options, 1)) + 1],
            departments[(i % array_length(departments, 1)) + 1]

Now that the database, table, and associated table are in place we can use PandasAI to talk with our database

First, we set up the Connector below

emp_postgres_connector = PostgreSQLConnector(
        "host": "localhost",
        "port": 5455,
        "database": "vo_db",
        "username": "vo",
        "password": "vo123",
        "table": "employee"

Setup Smart Data Frame backed by Postgres Connector and LLM

df = SmartDataframe(emp_postgres_connector, config={"llm": ollama_llm})

Now we will chat with our database, and gain insights about it'What is the total number of employees?')'Show pie chart of employee group by department?')

For the same chat above, we can confirm by executing a database query too'Show me employee with highest 5 salaries?')

For the same chat above again, we can confirm by executing a database query too

The above shows we can talk to our database in natural language to get the data we need.


Thank you for reading, If you have reached it so far, please like the article, It will encourage me to write more such articles. Do share your valuable suggestions, I appreciate your honest feedback and suggestions!

I would love to connect with you on Twitter | LinkedIn

Did you find this article valuable?

Support Virendra Oswal by becoming a sponsor. Any amount is appreciated!