Defining Sources
Sources are the foundation of your Olytix Core project. They define the raw data tables that feed into your models and semantic layer, enabling Olytix Core to track lineage from warehouse tables through to business metrics.
What is a Source?
A source is a YAML declaration that describes an existing table in your data warehouse. Sources do not create or modify tables; they document what already exists and enable the source() function in your models.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Data Source │ │ Sources │ │ Models │
│ (Warehouse) │────►│ (YAML Defs) │────►│ (SQL + Jinja) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Sources provide:
- Documentation for raw data tables and columns
- Lineage tracking from source systems to metrics
- Freshness monitoring to detect stale data
- Validation of table and column existence
- Centralized configuration for database and schema settings
Source File Structure
Source definitions live in the sources/ directory as YAML files. Each file can contain multiple sources, and each source can contain multiple tables.
Basic Source Definition
Create sources/raw.yml:
version: 2
sources:
- name: raw
description: "Raw e-commerce data from the production database"
database: analytics_db
schema: raw_data
tables:
- name: orders
description: "Customer order transactions"
- name: customers
description: "Customer master data"
- name: products
description: "Product catalog"
Complete Source Definition
Here is a comprehensive example with all available options:
version: 2
sources:
- name: salesforce
description: "Salesforce CRM data synced via Fivetran"
database: raw_salesforce
schema: salesforce
# Default freshness for all tables in this source
freshness:
warn_after:
count: 12
period: hour
error_after:
count: 24
period: hour
# Default loaded_at field for freshness checks
loaded_at_field: _fivetran_synced
# Source-level metadata
meta:
owner: sales-ops
sync_tool: fivetran
sync_schedule: "every 30 minutes"
tables:
- name: opportunity
description: "Sales opportunities and deal pipeline"
# Override source-level freshness
freshness:
warn_after:
count: 6
period: hour
# Override loaded_at field
loaded_at_field: last_modified_date
# Table-level metadata
meta:
priority: high
pii: false
columns:
- name: id
description: "Salesforce opportunity ID"
data_type: varchar
tests:
- unique
- not_null
- name: account_id
description: "Foreign key to account"
data_type: varchar
- name: amount
description: "Opportunity value in USD"
data_type: decimal
- name: stage
description: "Sales stage (Prospecting, Qualification, etc.)"
data_type: varchar
tests:
- not_null
- name: close_date
description: "Expected close date"
data_type: date
Using the source() Function
Once sources are defined, reference them in your models using the source() function:
-- models/staging/stg_orders.sql
{{ config(materialized='view') }}
SELECT
id AS order_id,
customer_id,
total_amount,
status,
created_at AS order_date
FROM {{ source('raw', 'orders') }}
source() Syntax
{{ source('source_name', 'table_name') }}
| Parameter | Description |
|---|---|
source_name | The name property of the source in your YAML |
table_name | The name property of the table within that source |
How source() Resolves
Olytix Core resolves source('raw', 'orders') to the fully qualified table name:
-- If database: analytics_db, schema: raw_data
SELECT * FROM analytics_db.raw_data.orders
This resolution is automatic and consistent across all models, ensuring changes to database or schema names only require updating the source YAML.
Warehouse-Specific Configurations
PostgreSQL
sources:
- name: production
database: production_db
schema: public
tables:
- name: users
description: "Application users table"
columns:
- name: id
data_type: bigint
- name: email
data_type: varchar
- name: created_at
data_type: timestamp with time zone
Snowflake
sources:
- name: raw
database: RAW_DATABASE
schema: PUBLIC
# Snowflake-specific metadata
meta:
warehouse: TRANSFORM_WH
role: TRANSFORMER_ROLE
tables:
- name: EVENTS
description: "Raw event stream from application"
columns:
- name: EVENT_ID
data_type: VARCHAR
- name: EVENT_TYPE
data_type: VARCHAR
- name: PAYLOAD
data_type: VARIANT
- name: CREATED_AT
data_type: TIMESTAMP_NTZ
BigQuery
sources:
- name: analytics
# BigQuery uses project as database
database: my-gcp-project
schema: analytics_dataset
tables:
- name: page_views
description: "Website page view events"
columns:
- name: event_id
data_type: STRING
- name: user_id
data_type: STRING
- name: page_url
data_type: STRING
- name: event_timestamp
data_type: TIMESTAMP
- name: event_params
data_type: RECORD
meta:
mode: REPEATED
Multiple Sources
You can define multiple sources in a single file or across multiple files:
Single File with Multiple Sources
# sources/external.yml
version: 2
sources:
- name: stripe
database: raw_stripe
schema: stripe
description: "Payment data from Stripe"
tables:
- name: charges
- name: customers
- name: subscriptions
- name: segment
database: raw_segment
schema: segment
description: "Event data from Segment"
tables:
- name: tracks
- name: identifies
- name: pages
Multiple Source Files
sources/
├── raw.yml # Internal database sources
├── salesforce.yml # Salesforce CRM data
├── stripe.yml # Payment platform data
└── segment.yml # Analytics event data
Source Inheritance
Sources support inheritance for reducing duplication:
version: 2
# Define common configuration
x-defaults: &source_defaults
meta:
team: data-engineering
sync_frequency: hourly
freshness:
warn_after:
count: 6
period: hour
sources:
- name: crm
<<: *source_defaults
database: raw_crm
schema: salesforce
tables:
- name: accounts
- name: contacts
- name: marketing
<<: *source_defaults
database: raw_marketing
schema: hubspot
tables:
- name: campaigns
- name: emails
Validating Sources
After defining sources, validate them using the CLI:
# Compile project and validate sources
olytix-core compile
# List all sources
olytix-core list --resource-type source
Expected output:
✓ Loaded 3 sources with 12 tables
- raw (4 tables)
- salesforce (5 tables)
- stripe (3 tables)
Source Properties Reference
Source-Level Properties
| Property | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Unique identifier for the source |
description | string | No | Human-readable description |
database | string | No | Database name (uses project default if omitted) |
schema | string | No | Schema name (uses project default if omitted) |
freshness | object | No | Default freshness configuration for all tables |
loaded_at_field | string | No | Default timestamp column for freshness checks |
meta | object | No | Custom metadata key-value pairs |
tables | list | Yes | List of table definitions |
Table-Level Properties
| Property | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Table name as it exists in the warehouse |
description | string | No | Human-readable description |
freshness | object | No | Override source-level freshness |
loaded_at_field | string | No | Override timestamp column |
columns | list | No | Column definitions |
meta | object | No | Custom metadata |
Column-Level Properties
| Property | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Column name |
description | string | No | Human-readable description |
data_type | string | No | Column data type |
tests | list | No | Data tests to apply |
meta | object | No | Custom metadata |
Best Practices
Naming Conventions
- Use lowercase, snake_case for source names
- Match source names to the system they represent (
stripe,salesforce,segment) - Use descriptive names that indicate data origin
Organization
- Group related tables in the same source
- Use separate files for different data systems
- Keep source files in the
sources/directory
Documentation
- Always include descriptions for sources and tables
- Document column meanings, especially for cryptic names
- Add metadata for ownership and data classification
Configuration
- Set database and schema at the source level to avoid repetition
- Use freshness monitoring for critical data sources
- Define tests for primary keys and required columns
Next Steps
Now that you understand source definitions:
- Freshness Monitoring - Set up data freshness checks
- Schema Documentation - Document columns and data types
- Your First Model - Use sources in SQL models