Release v3 is currently in beta. This documentation reflects the features and functionality in progress and may change before the final release.

What’s the Semantic Layer?

The semantic layer allows you to turn raw data into dataframes you can ask questions to and share with your team as conversational AI dashboards. It serves several important purposes:

  1. Data Configuration: Define how your data should be loaded and processed
  2. Semantic Information: Add context and meaning to your data columns
  3. Data Transformation: Specify how data should be cleaned and transformed

How to use the Semantic Layer?

There are two ways to use the semantic layer:

For CSV files: using the create method

The simplest way to create a semantic layer for CSV files is using the create method:

import pandasai as pai

file = pai.read_csv("data.csv")

df = pai.create(
    path="company/sales-data",         # Format: "organization/dataset"
    name="sales-data",                 # Human-readable name
    df = file,                           # Input Dataframe
    description="Sales data from our retail stores",  # Optional description
    columns=[
        {
            "name": "transaction_id",
            "type": "string",
            "description": "Unique identifier for each sale"
        },
        {
            "name": "sale_date"
            "type": "datetime",
            "description": "Date and time of the sale"
        }
    ]
)

name

The name field identifies your dataset in the create method.

file = pai.read_csv("data.csv")

pai.create(
    path="company/sales-data",
    name="sales-data",  # Unique, descriptive name
    ...
)

Type: str

  • A string without special characters or spaces
  • Using kebab-case naming convention
  • Unique within your project
  • Examples: “sales-data”, “customer-profiles”

path

The path uniquely identifies your dataset in the PandaAI ecosystem using the format “organization/dataset”.

file = pai.read_csv("data.csv")

pai.create(
    path="acme-corp/sales-data",  # Format: "organization/dataset"
    ...
)

Type: str

  • Must follow the format: “organization-identifier/dataset-identifier”
  • Organization identifier should be unique to your organization
  • Dataset identifier should be unique within your organization
  • Can be used both locally and with the PandaAI Data Platform
  • Examples: “acme-corp/sales-data”, “my-org/customer-profiles”

df

The input dataframe that contains your data, typically created using pai.read_csv().

file = pai.read_csv("data.csv")  # Create the input dataframe

pai.create(
    path="acme-corp/sales-data",
    df=file,  # Pass your dataframe here
    ...
)

Type: DataFrame

  • Must be a pandas DataFrame created with pai.read_csv()
  • Contains the raw data you want to enhance with semantic information
  • Required parameter for creating a semantic layer

connector

The connector field allows you to connect your data sources like PostgreSQL, MySQL and Sqlite to the semantic layer. For example, if you’re working with a SQL database, you can specify the connection details using the connector field.


pai.create(
    path="acme-corp/sales-data",
    connector={
         "type": "postgres",
         "connection": {
             "host": "postgres-host",
             "port": 5432,
             "user": "postgres",
             "password": "*****",
             "database": "postgres",
         },
         "table": "orders",
     },
    ...
)

Type: Dict

  • Must be a sql connector source dict
  • Required connection string for creating a semantic layer

description

A clear text description that helps others understand the dataset’s contents and purpose.

file = pai.read_csv("data.csv")

pai.create(
    path="company/sales-data",
    name="sales-data",
    df = file,
    description="Daily sales transactions from all retail stores, including transaction IDs, dates, and amounts",
    ...
)

Type: str

  • The purpose of the dataset
  • The type of data contained
  • Any relevant context about data collection or usage
  • Optional but recommended for better data understanding

columns

Define the structure and metadata of your dataset’s columns to help PandaAI understand your data better.

Note: If the columns parameter is not provided, all columns from the input dataframe will be included in the semantic layer. When specified, only the declared columns will be included, allowing you to select specific columns for your semantic layer.

file = pai.read_csv("data.csv")

pai.create(
    path="company/sales-data",
    name="sales-data",
    df = file,
    description="Daily sales transactions from all retail stores",
    columns=[
        {
            "name": "transaction_id",
            "type": "string",
            "description": "Unique identifier for each sale"
        },
        {
            "name": "sale_date"
            "type": "datetime",
            "description": "Date and time of the sale"
        },
        {
            "name": "quantity",
            "type": "integer",
            "description": "Number of units sold"
        },
        {
            "name": "price",
            "type": "float",
            "description": "Price per unit in USD"
        },
        {
            "name": "is_online",
            "type": "boolean",
            "description": "Whether the sale was made online"
        }
    ]
)

Type: dict[str, dict]

  • Keys: column names as they appear in your DataFrame
  • Values: dictionary containing:
    • type (str): Data type of the column
      • “string”: IDs, names, categories
      • “integer”: counts, whole numbers
      • “float”: prices, percentages
      • “datetime”: timestamps, dates
      • “boolean”: flags, true/false values
    • description (str): Clear explanation of what the column represents

For other data sources: YAML configuration

For other data sources (SQL databases, data warehouses, etc.), create a YAML file in your datasets folder:

Keep in mind that you have to install the sql, cloud data (ee), or yahoo_finance data extension to use this feature.

Example PostgreSQL YAML file:

name: SalesData  # Dataset name
description: "Sales data from our SQL database"

source:
  type: postgres
  connection:
    host: postgres-host
    port: 5432
    database: postgres
    user: postgres
    password: ******
  table: orders
  view: false

columns:
  - name: transaction_id
    type: string
    description: Unique identifier for each sale
  - name: sale_date
    type: datetime
    description: Date and time of the sale

Example Sqlite YAML file:

name: SalesData  # Dataset name
description: "Sales data from our SQL database"

source:
  type: sqlite
  connection:
    file_path: /Users/arslan/Documents/SinapTik/pandas-ai/companies.db
  table: companies
  view: false

description: Companies table
columns:
  - name: id
    type: integer
  - name: name
    type: string
  - name: domain
    type: string
  - name: year_founded
    type: float

YAML Semantic Layer Configuration

The following sections detail all available configuration options for your schema.yaml file:

name (mandatory)

The name field identifies your dataset in the schema.yaml file.

name: sales-data

Type: str

  • A string without special characters or spaces
  • Using kebab-case naming convention
  • Unique within your project
  • Examples: “sales-data”, “customer-profiles”

columns

Define the structure and metadata of your dataset’s columns to help PandaAI understand your data better.

columns:
  - name: transaction_id
    type: string
    description: Unique identifier for each sale
  - name: sale_date
    type: datetime
    description: Date and time of the sale

Type: list[dict]

  • Each dictionary represents a column.
  • Fields:
    • name (str): Name of the column.
      • For tables: Use simple column names (e.g., transaction_id).
    • type (str): Data type of the column.
      • Supported types:
        • "string": IDs, names, categories.
        • "integer": Counts, whole numbers.
        • "float": Prices, percentages.
        • "datetime": Timestamps, dates.
        • "boolean": Flags, true/false values.
    • description (str): Clear explanation of what the column represents.

Constraints:

  1. Column names must be unique.
  2. For views, all column names must be in the format [table].[column].

transformations

Apply transformations to your data to clean, convert, or anonymize it.

transformations:
  - type: anonymize
    params:
      columns:
        - transaction_id
      method: hash
  - type: convert_timezone
    params:
      columns:
        - sale_date
      from_timezone: UTC
      to_timezone: America/New_York

Type: list[dict]

  • Each dictionary represents a transformation
  • type (str): Type of transformation
    • “anonymize” for anonymizing data
    • “convert_timezone” for converting timezones
  • params (dict): Parameters for the transformation

source (mandatory)

Specify the data source for your dataset.

source:
  type: postgres
  connection:
    host: postgres-host
    port: 5432
    database: postgres
    user: postgres
    password: ******
  table: orders
  view: false

The available data sources depends on the installed data extensions (sql, cloud data (ee), yahoo_finance).

Type: dict

  • type (str): Type of data source
    • “postgresql” for PostgreSQL databases
    • “mysql” for MySQL databases
    • “sqlite” for SQLite databases
    • “yahoo_finance” for Yahoo Finance data
    • “bigquery” for Google BigQuery data
    • “snowflake” for Snowflake data
    • “databricks” for Databricks data
    • “oracle” for Oracle databases
  • connection_string (str): Connection string for the data source
  • query (str): Query to retrieve data from the data source

/* commented as destination and update frequency will be only in the materialized case */

destination (mandatory)

Specify the destination for your dataset.

Type: dict

  • type (str): Type of destination
    • “local” for local storage
  • format (str): Format of the data
    • “parquet” for Parquet format
  • path (str): Path to store the data
destination:
  type: local
  format: parquet
  path: /path/to/data

update_frequency

Specify the frequency of updates for your dataset.

Type: str

  • “daily” for daily updates
  • “weekly” for weekly updates
  • “monthly” for monthly updates
  • “yearly” for yearly updates
update_frequency: daily

/* end of materialized case comment */

order_by

Specify the columns to order by.

Type: list[str]

  • Each string should be in the format “column_name DESC” or “column_name ASC”
order_by:
  - transaction_id DESC
  - sale_date ASC

limit

Specify the maximum number of records to load.

Type: int

limit: 1000

View Configuration

The following sections detail all available configurations for view options in your schema.yaml file. Similar to views in SQL, you can define multiple tables and the relationships between them.

Example Configuration

name: table_heart
source:
  type: postgres
  connection:
    host: postgres-host
    port: 5432
    database: postgres
    user: postgres
    password: ******
  table: heart
  view: false
columns:
  - name: parents.id
  - name: parents.name
  - name: parents.age
  - name: children.name
  - name: children.age
relations:
  - name: parent_to_children
    description: Relation linking the parent to its children
    from: parents.id
    to: children.id

Constraints

  1. Mutual Exclusivity:

    • A schema cannot define both table and view simultaneously.
    • If source.view is true, then the schema represents a view.
  2. Column Format:

    • For views:
      • All columns must follow the format [table].[column].
      • from and to fields in relations must follow the [table].[column] format.
      • Example: parents.id, children.name.
  3. Relationships for Views:

    • Each table referenced in columns must have at least one relationship defined in relations.
    • Relationships must specify from and to attributes in the [table].[column] format.