Semantic Layer
Turn raw data into semantic-enhanced and clean dataframes
Release v3 is currently in beta. This documentation reflects the features and functionality in progress and may change before the final release.
What’s the Semantic Layer?
The semantic layer allows you to turn raw data into dataframes you can ask questions to and share with your team as conversational AI dashboards. It serves several important purposes:
- Data Configuration: Define how your data should be loaded and processed
- Semantic Information: Add context and meaning to your data columns
- Data Transformation: Specify how data should be cleaned and transformed
How to use the Semantic Layer?
There are two ways to use the semantic layer:
For CSV files: using the create method
The simplest way to create a semantic layer for CSV files is using the create
method:
name
The name field identifies your dataset in the create method.
Type: str
- A string without special characters or spaces
- Using kebab-case naming convention
- Unique within your project
- Examples: “sales-data”, “customer-profiles”
path
The path uniquely identifies your dataset in the PandaAI ecosystem using the format “organization/dataset”.
Type: str
- Must follow the format: “organization-identifier/dataset-identifier”
- Organization identifier should be unique to your organization
- Dataset identifier should be unique within your organization
- Can be used both locally and with the PandaAI Data Platform
- Examples: “acme-corp/sales-data”, “my-org/customer-profiles”
df
The input dataframe that contains your data, typically created using pai.read_csv()
.
Type: DataFrame
- Must be a pandas DataFrame created with
pai.read_csv()
- Contains the raw data you want to enhance with semantic information
- Required parameter for creating a semantic layer
connector
The connector field allows you to connect your data sources like PostgreSQL, MySQL and Sqlite to the semantic layer. For example, if you’re working with a SQL database, you can specify the connection details using the connector field.
Type: Dict
- Must be a sql connector source dict
- Required connection string for creating a semantic layer
description
A clear text description that helps others understand the dataset’s contents and purpose.
Type: str
- The purpose of the dataset
- The type of data contained
- Any relevant context about data collection or usage
- Optional but recommended for better data understanding
columns
Define the structure and metadata of your dataset’s columns to help PandaAI understand your data better.
Note: If the columns
parameter is not provided, all columns from the input dataframe will be included in the semantic layer.
When specified, only the declared columns will be included, allowing you to select specific columns for your semantic layer.
Type: dict[str, dict]
- Keys: column names as they appear in your DataFrame
- Values: dictionary containing:
type
(str): Data type of the column- “string”: IDs, names, categories
- “integer”: counts, whole numbers
- “float”: prices, percentages
- “datetime”: timestamps, dates
- “boolean”: flags, true/false values
description
(str): Clear explanation of what the column represents
For other data sources: YAML configuration
For other data sources (SQL databases, data warehouses, etc.), create a YAML file in your datasets folder:
Keep in mind that you have to install the sql, cloud data (ee), or yahoo_finance data extension to use this feature.
Example PostgreSQL YAML file:
Example Sqlite YAML file:
YAML Semantic Layer Configuration
The following sections detail all available configuration options for your schema.yaml file:
name (mandatory)
The name field identifies your dataset in the schema.yaml file.
Type: str
- A string without special characters or spaces
- Using kebab-case naming convention
- Unique within your project
- Examples: “sales-data”, “customer-profiles”
columns
Define the structure and metadata of your dataset’s columns to help PandaAI understand your data better.
Type: list[dict]
- Each dictionary represents a column.
- Fields:
name
(str): Name of the column.- For tables: Use simple column names (e.g.,
transaction_id
).
- For tables: Use simple column names (e.g.,
type
(str): Data type of the column.- Supported types:
"string"
: IDs, names, categories."integer"
: Counts, whole numbers."float"
: Prices, percentages."datetime"
: Timestamps, dates."boolean"
: Flags, true/false values.
- Supported types:
description
(str): Clear explanation of what the column represents.
Constraints:
- Column names must be unique.
- For views, all column names must be in the format
[table].[column]
.
transformations
Apply transformations to your data to clean, convert, or anonymize it.
Type: list[dict]
- Each dictionary represents a transformation
type
(str): Type of transformation- “anonymize” for anonymizing data
- “convert_timezone” for converting timezones
params
(dict): Parameters for the transformation
source (mandatory)
Specify the data source for your dataset.
The available data sources depends on the installed data extensions (sql, cloud data (ee), yahoo_finance).
Type: dict
type
(str): Type of data source- “postgresql” for PostgreSQL databases
- “mysql” for MySQL databases
- “sqlite” for SQLite databases
- “yahoo_finance” for Yahoo Finance data
- “bigquery” for Google BigQuery data
- “snowflake” for Snowflake data
- “databricks” for Databricks data
- “oracle” for Oracle databases
connection_string
(str): Connection string for the data sourcequery
(str): Query to retrieve data from the data source
/* commented as destination and update frequency will be only in the materialized case */
destination (mandatory)
Specify the destination for your dataset.
Type: dict
type
(str): Type of destination- “local” for local storage
format
(str): Format of the data- “parquet” for Parquet format
path
(str): Path to store the data
update_frequency
Specify the frequency of updates for your dataset.
Type: str
- “daily” for daily updates
- “weekly” for weekly updates
- “monthly” for monthly updates
- “yearly” for yearly updates
/* end of materialized case comment */
order_by
Specify the columns to order by.
Type: list[str]
- Each string should be in the format “column_name DESC” or “column_name ASC”
limit
Specify the maximum number of records to load.
Type: int
View Configuration
The following sections detail all available configurations for view options in your schema.yaml
file. Similar to views in SQL, you can define multiple tables and the relationships between them.
Example Configuration
Constraints
-
Mutual Exclusivity:
- A schema cannot define both
table
andview
simultaneously. - If
source.view
istrue
, then the schema represents a view.
- A schema cannot define both
-
Column Format:
- For views:
- All columns must follow the format
[table].[column]
. from
andto
fields inrelations
must follow the[table].[column]
format.- Example:
parents.id
,children.name
.
- All columns must follow the format
- For views:
-
Relationships for Views:
- Each table referenced in
columns
must have at least one relationship defined inrelations
. - Relationships must specify
from
andto
attributes in the[table].[column]
format.
- Each table referenced in
Was this page helpful?