Create a new schema
Create a new semantic layer schema using the create
method
PandaAI 3.0 is currently in beta. This documentation reflects the latest features and functionality, which may evolve before the final release.
Using the pai.create()
method with CSV and parquet files
The simplest way to define a semantic layer schema is using the create
method:
- path
The path uniquely identifies your dataset in the PandaAI ecosystem using the format “organization/dataset”.
Type: str
- Must follow the format: “organization-identifier/dataset-identifier”
- Organization identifier should be unique to your organization
- Dataset identifier should be unique within your organization
- Can be used both locally and with the PandaAI Data Platform
- Examples: “acme-corp/sales-data”, “my-org/customer-profiles”
- df
The input dataframe that contains your data, typically created using pai.read_csv()
.
Type: DataFrame
- Must be a pandas DataFrame created with
pai.read_csv()
- Contains the raw data you want to enhance with semantic information
- Required parameter for creating a semantic layer
- description
A clear text description that helps others understand the dataset’s contents and purpose.
Type: str
- The purpose of the dataset
- The type of data contained
- Any relevant context about data collection or usage
- Optional but recommended for better data understanding
- columns
Define the structure and metadata of your dataset’s columns to help PandaAI understand your data better.
Note: If the columns
parameter is not provided, all columns from the input dataframe will be included in the semantic layer.
When specified, only the declared columns will be included, allowing you to select specific columns for your semantic layer.
Type: dict[str, dict]
- Keys: column names as they appear in your DataFrame
- Values: dictionary containing:
type
(str): Data type of the column- “string”: IDs, names, categories
- “integer”: counts, whole numbers
- “float”: prices, percentages
- “datetime”: timestamps, dates
- “boolean”: flags, true/false values
description
(str): Clear explanation of what the column represents
Using the pai.create()
method for SQL databases
You need to install the pandasai-sql
extra dependency for this feature.
See SQL installation instructions.
For SQL databases, you can use the create
method to define your data source and schema. Here’s an example using a MySQL database:
In this example:
- The
path
defines where the dataset will be stored in your project - The
description
provides context about the dataset - The
source
object contains:- Database connection details (using environment variables for security)
- Table name to query
- Column definitions with types and descriptions
For security best practices, always use environment variables for sensitive connection details. Never hardcode credentials in your code.
You can then use this dataset like any other:
YAML Semantic Layer Configuration
Whenever you create a semantic layer schema using the create
method, a YAML configuration file is automatically generated for you in the datasets/
directory of your project.
As an alternative, you can use a YAML schema.yaml
file directly in the datasets/organization_name/dataset_name
directory.
The following sections detail all available configuration options for your schema.yaml file:
- description
A clear text description that helps others understand the dataset’s contents and purpose.
Type: str
- The purpose of the dataset, in order for everyone in the organization and for the LLMs to understand
- source (mandatory for SQL datasets)
Specify the data source for your dataset.
The available data sources depends on the installed data extensions (sql databases, data lakehouses, yahoo_finance).
Type: dict
type
(str): Type of data source- “postgresql” for PostgreSQL databases
- “mysql” for MySQL databases
- “bigquery” for Google BigQuery data
- “snowflake” for Snowflake data
- “databricks” for Databricks data
- “oracle” for Oracle databases
- “yahoo_finance” for Yahoo Finance data
connection_string
(str): Connection string for the data sourcequery
(str): Query to retrieve data from the data source
- columns
Define the structure and metadata of your dataset’s columns to help PandaAI understand your data better.
Type: list[dict]
- Each dictionary represents a column.
- Fields:
name
(str): Name of the column.- For tables: Use simple column names (e.g.,
transaction_id
).
- For tables: Use simple column names (e.g.,
type
(str): Data type of the column.- Supported types:
"string"
: IDs, names, categories."integer"
: Counts, whole numbers."float"
: Prices, percentages."datetime"
: Timestamps, dates."boolean"
: Flags, true/false values.
- Supported types:
description
(str): Clear explanation of what the column represents.
Constraints:
- Column names must be unique.
- For views, all column names must be in the format
[table].[column]
.
- transformations
Apply transformations to your data to clean, convert, or anonymize it.
Type: list[dict]
- Each dictionary represents a transformation
type
(str): Type of transformation- “anonymize” for anonymizing data
- “convert_timezone” for converting timezones
params
(dict): Parameters for the transformation
If you want to learn more about transformations, check out the transformations documentation.
Group By Configuration
The group_by
field allows you to specify which columns can be used for grouping operations. This is particularly useful for aggregation queries and data analysis.
Configuration Options:
group_by
(list[str]):- List of column references in the format
table.column
- Specifies which columns can be used for grouping operations
- Can reference any column from any table in your schema
- List of column references in the format
Column expressions and aliases
The expression
field allows you to specify a SQL expression for a column. This expression will be used in the query instead of the column name.
Configuration Options:
-
alias
(str):- Alternative name that can be used to reference the column
- Useful for supporting different naming conventions or more intuitive names
- Must be unique across all columns and their aliases
-
expression
(str):- Formula for calculating derived columns
- Uses other column names as variables
- Supports basic arithmetic operations (+, -, *, /)
- Can reference other columns in the same schema
Best Practices:
- Keep aliases concise and descriptive
- Avoid using special characters or spaces in aliases
- Use consistent naming conventions
- Document the purpose of derived columns in their description
Was this page helpful?