Release v3 is currently in beta. This documentation reflects the features and functionality in progress and may change before the final release.

You can train PandaAI to understand your data better and to improve its performance. Training is as easy as calling the train method on the Agent.

Prerequisites

Before you start training PandaAI, you need to set your PandaAI API key. You can generate your API key by signing up at https://app.pandabi.ai.

import pandasai as pai

pai.api_key.set("your-pai-api-key")

It is important that you set the API key, or it will fail with the following error: No vector store provided. Please provide a vector store to train the agent.

Instructions training

Instructions training is used to teach PandaAI how you expect it to respond to certain queries. You can provide generic instructions about how you expect the model to approach certain types of queries, and PandaAI will use these instructions to generate responses to similar queries.

For example, you might want the LLM to be aware that your company’s fiscal year starts in April, or about specific ways you want to handle missing data. Or you might want to teach it about specific business rules or data analysis best practices that are specific to your organization.

To train PandaAI with instructions, you can use the train method on the Agent, as it follows:

The training uses by default the BambooVectorStore to store the training data, and it’s accessible with the API key.

As an alternative, if you want to use a local vector store (enterprise only for production use cases), you can use the ChromaDB, Qdrant or Pinecone vector stores (see examples below).

import pandasai as pai
from pandasai import Agent

pai.api_key.set("your-pai-api-key")

agent = Agent("data.csv")
agent.train(docs="The fiscal year starts in April")

response = agent.chat("What is the total sales for the fiscal year?")
print(response)
# The model will use the information provided in the training to generate a response

Your training data is persisted, so you only need to train the model once.

Q/A training

Q/A training is used to teach PandaAI the desired process to answer specific questions, enhancing the model’s performance and determinism. One of the biggest challenges with LLMs is that they are not deterministic, meaning that the same question can produce different answers at different times. Q/A training can help to mitigate this issue.

To train PandaAI with Q/A, you can use the train method on the Agent, as it follows:

from pandasai import Agent

agent = Agent("data.csv")

# Train the model
query = "What is the total sales for the current fiscal year?"
# The following code is passed as a string to the response variable
response = '\n'.join([
    'import pandas as pd',
    '',
    'df = dfs[0]',
    '',
    '# Calculate the total sales for the current fiscal year',
    'total_sales = df[df[\'date\'] >= pd.to_datetime(\'today\').replace(month=4, day=1)][\'sales\'].sum()',
    'result = { "type": "number", "value": total_sales }'
])

agent.train(queries=[query], codes=[response])

response = agent.chat("What is the total sales for the last fiscal year?")
print(response)

# The model will use the information provided in the training to generate a response

Also in this case, your training data is persisted, so you only need to train the model once.

Training with local Vector stores

If you want to train the model with a local vector store, you can use the local ChromaDB, Qdrant or Pinecone vector stores. Here’s how to do it: An enterprise license is required for using the vector stores locally, (check it out). If you plan to use it in production, contact us.

from pandasai import Agent
from pandasai.ee.vectorstores import ChromaDB
from pandasai.ee.vectorstores import Qdrant
from pandasai.ee.vectorstores import Pinecone
from pandasai.ee.vector_stores import LanceDB

# Instantiate the vector store
vector_store = ChromaDB()
# or with Qdrant
# vector_store = Qdrant()
# or with LanceDB
vector_store = LanceDB()
# or with Pinecone
# vector_store = Pinecone(
#     api_key="*****",
#     embedding_function=embedding_function,
#     dimensions=384, # dimension of your embedding model
# )

# Instantiate the agent with the custom vector store
agent = Agent("data.csv", vectorstore=vector_store)

# Train the model
query = "What is the total sales for the current fiscal year?"
# The following code is passed as a string to the response variable
response = '\n'.join([
    'import pandas as pd',
    '',
    'df = dfs[0]',
    '',
    '# Calculate the total sales for the current fiscal year',
    'total_sales = df[df[\'date\'] >= pd.to_datetime(\'today\').replace(month=4, day=1)][\'sales\'].sum()',
    'result = { "type": "number", "value": total_sales }'
])

agent.train(queries=[query], codes=[response])

response = agent.chat("What is the total sales for the last fiscal year?")
print(response)
# The model will use the information provided in the training to generate a response

Using the Sandbox Environment

To enhance security and protect against malicious code through prompt injection, PandaAI provides a sandbox environment for code execution. The sandbox runs your code in an isolated Docker container, ensuring that potentially harmful operations are contained.

Installation

Before using the sandbox, you need to install Docker on your machine and ensure it is running.

First, install the sandbox package:

pip install pandasai-docker

Basic Usage

Here’s how to use the sandbox with your PandaAI agent:

from pandasai import Agent
from pandasai_docker import DockerSandbox

# Initialize the sandbox
sandbox = DockerSandbox()
sandbox.start()

# Create an agent with the sandbox
df = pai.read_csv("data.csv")
agent = Agent([df], sandbox=sandbox)

# Chat with the agent - code will run in the sandbox
response = agent.chat("Calculate the average sales")

# Don't forget to stop the sandbox when done
sandbox.stop()

Customizing the Sandbox

You can customize the sandbox environment by specifying a custom name and Dockerfile:

sandbox = DockerSandbox(
    "custom-sandbox-name",
    "/path/to/custom/Dockerfile"
)

The sandbox works offline and provides an additional layer of security for code execution. It’s particularly useful when working with untrusted data or when you need to ensure that code execution is isolated from your main system.

Troubleshooting

In some cases, you might get an error like this: No vector store provided. Please provide a vector store to train the agent. It means no API key has been generated to use the BambooVectorStore.

Here’s how to fix it:

First of all, you’ll need to generated an API key (check the prerequisites paragraph above). Once you have generated the API key, you have 2 options:

  1. Override the env variable (os.environ["PANDABI_API_KEY"] = "YOUR_PANDABI_API_KEY")
  2. Instantiate the vector store and pass the API key:
# Instantiate the vector store with the API keys
vector_store = BambooVectorStor(api_key="YOUR_PANDABI_API_KEY")

# Instantiate the agent with the custom vector store
agent = Agent(connector, config={...} vectorstore=vector_store)

Custom Head

In some cases, you might want to provide custom data samples to the conversational agent to improve its understanding and responses. For example, you might want to:

  • Provide better examples that represent your data patterns
  • Avoid sharing sensitive information
  • Guide the agent with specific data scenarios

You can do this by passing a custom head to the agent:

import pandas as pd
import pandasai as pai

# Your original dataframe
df = pd.DataFrame({
    'sensitive_id': [1001, 1002, 1003, 1004, 1005],
    'amount': [150, 200, 300, 400, 500],
    'category': ['A', 'B', 'A', 'C', 'B']
})

# Create a custom head with anonymized data
head_df = pd.DataFrame({
    'sensitive_id': [1, 2, 3, 4, 5],
    'amount': [100, 200, 300, 400, 500],
    'category': ['A', 'B', 'C', 'A', 'B']
})

# Use the custom head
smart_df = pai.SmartDataframe(df, config={
    "custom_head": head_df
})

The agent will use your custom head instead of the default first 5 rows of the dataframe when analyzing and responding to queries.