Converse with Data via PandasAI and LLM
Enhance Your Data Analysis with PandasAI and Large Language Models
Introduction
Generative AI is highly useful for data analysis and data scientists due to its ability to enhance data processing, generate insights, and automate various tasks.
Generative AI is beneficial for Automated Data Preparation, Feature Engineering, Enhanced Data Understanding, NLP, Predictive Analysis, etc.
With generative AI, ideally, we can just talk with data in Natural language to enable more sophisticated and comprehensive insights. This ultimately leads to more efficient, accurate, and innovative data analysis.
In this article, we will use PandasAI which is a Python library that makes it easy to converse with our data in natural language. It helps us to explore, clean, and analyze. Beyond querying, PandasAI offers functionalities to visualize data through graphs, cleanse datasets by addressing missing values, and enhance data quality through feature generation, making it a comprehensive tool for data scientists and analysts.
Pre-requisites
Postgres Database (We will be using Docker to host it in this article, however, this is optional if you want to try database connector.)
Python v3.11 or above
Local LLM Setup via Ollama
You need minimum of 8GB RAM for this to run, else you might face performance issues. That is still will return in slower response but that is minimum.
Though PandasAI has support for their in-house Bamboo LLM, however as its beta we may or may not get access to same.
One can go to OpenAI too for quick setup, but we want to set up Mistral LLM via Ollama locally and use that for our Data Analysis purpose in this article. (Though we will let you know how to use OpenAI-based LLM quickly as part of the article.)
With open-source LLM you don't have to worry about Tokens and limits in the long term.
Setup Mistral via Ollama, can be done quickly as below
Download Ollama from here as per your OS
Just execute the executable file and let the process complete. Verify if the installation is complete and successful by executing
ollama --version
If this returns something like below, installation is successful.
We will be using the mistral llm model for blog purposes. All other Open Source LLM models to be run via Ollama can be found here which includes llama3, Gemma, etc as well.
Execute the below command to run mistral LLM model locally
ollama run mistral
This might take some time depending on your internet provider speed to pull the model from the server to local as it is over 4 GB.
Once mistral is downloaded, it will be served as well as part of the same command.
We will quickly test if our model is fine over HTTP by running the following cURL command
curl -X POST http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt":"Here is a story about llamas eating grass"
}'
Or Chat with model
curl http://localhost:11434/api/chat -d '{
"model": "mistral",
"messages": [
{ "role": "user", "content": "why is the sky blue?" }
]
}'
Now that our local LLM in the mold of Mistral via Ollama is running, we can move to step to perform data analysis via PandasAI using this LLM model.
PandasAI Setup
Now that our local LLM is up and running, we will do all required tool setups to perform conversational chat with PandasAI to analyze data.
First, we will setup a virtual environment, so that we can manage our project in a cleaner and maintainable way. We are going to use Python directly, but we can also use something like anaconda, or poetry to manage virtual environments.
Create a virtual environment as below python -m venv pandasai-demo
Once created, activate using .\pandasai-demo\Scripts\activate
We will now install the required modules related to PandasAI via pip as below
pip install pandasai
pip install pandasai[connectors] #optional if you want to skip postgres connector
Once installed, we will also install Jupyter, so we can code our way via Notebooks
pip install jupyter
Once done, we will run below to start a Notebook
jupyter notebook
PandasAI Development
Sample Data Frame
We will now start the coding side of things, starting with the Import of required modules and functions as below
from pandasai import SmartDataframe
from pandasai.llm.local_llm import LocalLLM
import pandas as pd
from pandasai.connectors import PostgreSQLConnector
Create LLM object, we are going to use Ollama LLM via mistral model
ollama_llm = LocalLLM(api_base="http://localhost:11434/v1", model="mistral")
We will now create a SmartDataframe class which is the main class of pandasai. It is used to interact with a single data frame and pass the required LLM object.
sales_by_country = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"sales": [5000, 3200, 2900, 4100, 2300, 2100, 2500, 2600, 4500, 7000]
})
df = SmartDataframe(sales_by_country, config={"llm": ollama_llm})
Now we just to ask question to our Smart Data Frame which is powered by LLM as below.
response = df.chat('Which are the top 5 countries by sales?')
print(response)
Voila! It's working fine.
External Dataset CSV
In the earlier section, we saw we used hardcoded dummy data internally, Now we will input the external CSV file we auto-generated via ChatGPT for random 100 football player's data with Country, name, Goals, etc, and load it in SmartDataFrame as below
data = pd.read_csv('data/football_data.csv', encoding = 'ISO-8859-1')
sdf = SmartDataframe(data, config={"llm": ollama_llm})
Now we will chat with our Data Frame as below
response = sdf.chat('Count number of unique country')
print(response)
response = sdf.chat('Sort players with most goals but give me only top 5')
print(response)
response = sdf.chat('Which country has most players')
print(response)
Voila! We are now able to chat with our dataset loaded via CSV file.
Postgres Database Connector
We have seen how we are able to speak to data loaded internally via hardcoded data frame as well as External datasets in the form of CSV files, We will now look at how we can integrate with Database in this case Postgres.
We will now set up a employee
table with 100 Randomly generated employees within the Postgres Database.
We will run the following Docker command to start the Postgres Database
docker run --name vo_db -p 5455:5432 -e POSTGRES_USER=vo -e POSTGRES_PASSWORD=vo123 -e POSTGRES_DB=vo_db -d postgres
employee.sql
CREATE TABLE employee (
employee_id SERIAL PRIMARY KEY,
first_name VARCHAR(255) NOT NULL,
last_name VARCHAR(255) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
phone_number VARCHAR(20),
hire_date DATE NOT NULL,
job_title VARCHAR(255) NOT NULL,
salary NUMERIC(10, 2) NOT NULL,
gender VARCHAR(10),
department varchar(32)
);
Random generation of 100 employees script can be executed as below
DO
$$
DECLARE
i INT;
first_names TEXT[] := ARRAY['John', 'Jane', 'Alex', 'Chris', 'Pat', 'Taylor', 'Jordan', 'Morgan', 'Casey', 'Drew'];
last_names TEXT[] := ARRAY['Smith', 'Johnson', 'Williams', 'Jones', 'Brown', 'Davis', 'Miller', 'Wilson', 'Moore', 'Taylor'];
job_titles TEXT[] := ARRAY['Manager', 'Developer', 'Analyst', 'Engineer', 'Specialist', 'Coordinator', 'Consultant', 'Administrator', 'Director', 'Executive'];
departments TEXT[] := ARRAY['HR', 'Finance', 'Admin', 'Sales', 'IT', 'Marketing'];
gender_options TEXT[] := ARRAY['Male', 'Female'];
start_date DATE := '2023-01-01';
end_date DATE := '2024-05-17';
BEGIN
FOR i IN 1..100 LOOP
INSERT INTO employee (first_name, last_name, email, phone_number, hire_date, job_title, salary, gender, department)
VALUES (
first_names[(i % array_length(first_names, 1)) + 1],
last_names[(i % array_length(last_names, 1)) + 1],
lower(first_names[(i % array_length(first_names, 1)) + 1] || '.' || last_names[(i % array_length(last_names, 1)) + 1] || i || '@company.com'),
'555-010' || lpad((i % 10)::text, 2, '0'),
start_date + (random() * (end_date - start_date))::int,
job_titles[(i % array_length(job_titles, 1)) + 1],
round((random() * 50000 + 50000)::numeric, 2),
gender_options[(i % array_length(gender_options, 1)) + 1],
departments[(i % array_length(departments, 1)) + 1]
);
END LOOP;
END
$$;
Now that the database, table, and associated table are in place we can use PandasAI to talk with our database
First, we set up the Connector below
emp_postgres_connector = PostgreSQLConnector(
config={
"host": "localhost",
"port": 5455,
"database": "vo_db",
"username": "vo",
"password": "vo123",
"table": "employee"
}
)
Setup Smart Data Frame backed by Postgres Connector and LLM
df = SmartDataframe(emp_postgres_connector, config={"llm": ollama_llm})
Now we will chat with our database, and gain insights about it
df.chat('What is the total number of employees?')
df.chat('Show pie chart of employee group by department?')
For the same chat above, we can confirm by executing a database query too
df.chat('Show me employee with highest 5 salaries?')
For the same chat above again, we can confirm by executing a database query too
The above shows we can talk to our database in natural language to get the data we need.
Resources
Thank you for reading, If you have reached it so far, please like the article, It will encourage me to write more such articles. Do share your valuable suggestions, I appreciate your honest feedback and suggestions!