Skip to main content

Defining Sources

For Data Analysts

Sources are the foundation of your Olytix Core project. They define the raw data tables that feed into your models and semantic layer, enabling Olytix Core to track lineage from warehouse tables through to business metrics.

What is a Source?

A source is a YAML declaration that describes an existing table in your data warehouse. Sources do not create or modify tables; they document what already exists and enable the source() function in your models.

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│ Data Source │ │ Sources │ │ Models │
│ (Warehouse) │────►│ (YAML Defs) │────►│ (SQL + Jinja) │
└─────────────────┘ └─────────────────┘ └─────────────────┘

Sources provide:

  • Documentation for raw data tables and columns
  • Lineage tracking from source systems to metrics
  • Freshness monitoring to detect stale data
  • Validation of table and column existence
  • Centralized configuration for database and schema settings

Source File Structure

Source definitions live in the sources/ directory as YAML files. Each file can contain multiple sources, and each source can contain multiple tables.

Basic Source Definition

Create sources/raw.yml:

version: 2

sources:
- name: raw
description: "Raw e-commerce data from the production database"
database: analytics_db
schema: raw_data

tables:
- name: orders
description: "Customer order transactions"

- name: customers
description: "Customer master data"

- name: products
description: "Product catalog"

Complete Source Definition

Here is a comprehensive example with all available options:

version: 2

sources:
- name: salesforce
description: "Salesforce CRM data synced via Fivetran"
database: raw_salesforce
schema: salesforce

# Default freshness for all tables in this source
freshness:
warn_after:
count: 12
period: hour
error_after:
count: 24
period: hour

# Default loaded_at field for freshness checks
loaded_at_field: _fivetran_synced

# Source-level metadata
meta:
owner: sales-ops
sync_tool: fivetran
sync_schedule: "every 30 minutes"

tables:
- name: opportunity
description: "Sales opportunities and deal pipeline"

# Override source-level freshness
freshness:
warn_after:
count: 6
period: hour

# Override loaded_at field
loaded_at_field: last_modified_date

# Table-level metadata
meta:
priority: high
pii: false

columns:
- name: id
description: "Salesforce opportunity ID"
data_type: varchar
tests:
- unique
- not_null

- name: account_id
description: "Foreign key to account"
data_type: varchar

- name: amount
description: "Opportunity value in USD"
data_type: decimal

- name: stage
description: "Sales stage (Prospecting, Qualification, etc.)"
data_type: varchar
tests:
- not_null

- name: close_date
description: "Expected close date"
data_type: date

Using the source() Function

Once sources are defined, reference them in your models using the source() function:

-- models/staging/stg_orders.sql
{{ config(materialized='view') }}

SELECT
id AS order_id,
customer_id,
total_amount,
status,
created_at AS order_date

FROM {{ source('raw', 'orders') }}

source() Syntax

{{ source('source_name', 'table_name') }}
ParameterDescription
source_nameThe name property of the source in your YAML
table_nameThe name property of the table within that source

How source() Resolves

Olytix Core resolves source('raw', 'orders') to the fully qualified table name:

-- If database: analytics_db, schema: raw_data
SELECT * FROM analytics_db.raw_data.orders

This resolution is automatic and consistent across all models, ensuring changes to database or schema names only require updating the source YAML.

Warehouse-Specific Configurations

PostgreSQL

sources:
- name: production
database: production_db
schema: public

tables:
- name: users
description: "Application users table"
columns:
- name: id
data_type: bigint
- name: email
data_type: varchar
- name: created_at
data_type: timestamp with time zone

Snowflake

sources:
- name: raw
database: RAW_DATABASE
schema: PUBLIC

# Snowflake-specific metadata
meta:
warehouse: TRANSFORM_WH
role: TRANSFORMER_ROLE

tables:
- name: EVENTS
description: "Raw event stream from application"
columns:
- name: EVENT_ID
data_type: VARCHAR
- name: EVENT_TYPE
data_type: VARCHAR
- name: PAYLOAD
data_type: VARIANT
- name: CREATED_AT
data_type: TIMESTAMP_NTZ

BigQuery

sources:
- name: analytics
# BigQuery uses project as database
database: my-gcp-project
schema: analytics_dataset

tables:
- name: page_views
description: "Website page view events"
columns:
- name: event_id
data_type: STRING
- name: user_id
data_type: STRING
- name: page_url
data_type: STRING
- name: event_timestamp
data_type: TIMESTAMP
- name: event_params
data_type: RECORD
meta:
mode: REPEATED

Multiple Sources

You can define multiple sources in a single file or across multiple files:

Single File with Multiple Sources

# sources/external.yml
version: 2

sources:
- name: stripe
database: raw_stripe
schema: stripe
description: "Payment data from Stripe"

tables:
- name: charges
- name: customers
- name: subscriptions

- name: segment
database: raw_segment
schema: segment
description: "Event data from Segment"

tables:
- name: tracks
- name: identifies
- name: pages

Multiple Source Files

sources/
├── raw.yml # Internal database sources
├── salesforce.yml # Salesforce CRM data
├── stripe.yml # Payment platform data
└── segment.yml # Analytics event data

Source Inheritance

Sources support inheritance for reducing duplication:

version: 2

# Define common configuration
x-defaults: &source_defaults
meta:
team: data-engineering
sync_frequency: hourly
freshness:
warn_after:
count: 6
period: hour

sources:
- name: crm
<<: *source_defaults
database: raw_crm
schema: salesforce
tables:
- name: accounts
- name: contacts

- name: marketing
<<: *source_defaults
database: raw_marketing
schema: hubspot
tables:
- name: campaigns
- name: emails

Validating Sources

After defining sources, validate them using the CLI:

# Compile project and validate sources
olytix-core compile

# List all sources
olytix-core list --resource-type source

Expected output:

✓ Loaded 3 sources with 12 tables
- raw (4 tables)
- salesforce (5 tables)
- stripe (3 tables)

Source Properties Reference

Source-Level Properties

PropertyTypeRequiredDescription
namestringYesUnique identifier for the source
descriptionstringNoHuman-readable description
databasestringNoDatabase name (uses project default if omitted)
schemastringNoSchema name (uses project default if omitted)
freshnessobjectNoDefault freshness configuration for all tables
loaded_at_fieldstringNoDefault timestamp column for freshness checks
metaobjectNoCustom metadata key-value pairs
tableslistYesList of table definitions

Table-Level Properties

PropertyTypeRequiredDescription
namestringYesTable name as it exists in the warehouse
descriptionstringNoHuman-readable description
freshnessobjectNoOverride source-level freshness
loaded_at_fieldstringNoOverride timestamp column
columnslistNoColumn definitions
metaobjectNoCustom metadata

Column-Level Properties

PropertyTypeRequiredDescription
namestringYesColumn name
descriptionstringNoHuman-readable description
data_typestringNoColumn data type
testslistNoData tests to apply
metaobjectNoCustom metadata

Best Practices

Naming Conventions

  • Use lowercase, snake_case for source names
  • Match source names to the system they represent (stripe, salesforce, segment)
  • Use descriptive names that indicate data origin

Organization

  • Group related tables in the same source
  • Use separate files for different data systems
  • Keep source files in the sources/ directory

Documentation

  • Always include descriptions for sources and tables
  • Document column meanings, especially for cryptic names
  • Add metadata for ownership and data classification

Configuration

  • Set database and schema at the source level to avoid repetition
  • Use freshness monitoring for critical data sources
  • Define tests for primary keys and required columns

Next Steps

Now that you understand source definitions:

  1. Freshness Monitoring - Set up data freshness checks
  2. Schema Documentation - Document columns and data types
  3. Your First Model - Use sources in SQL models