Defining Sources

For Data Analysts

Sources are the foundation of your Olytix Core project. They define the raw data tables that feed into your models and semantic layer, enabling Olytix Core to track lineage from warehouse tables through to business metrics.

What is a Source?

A source is a YAML declaration that describes an existing table in your data warehouse. Sources do not create or modify tables; they document what already exists and enable the source() function in your models.

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Data Source   │     │    Sources      │     │     Models      │
│   (Warehouse)   │────►│  (YAML Defs)    │────►│  (SQL + Jinja)  │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Sources provide:

Documentation for raw data tables and columns
Lineage tracking from source systems to metrics
Freshness monitoring to detect stale data
Validation of table and column existence
Centralized configuration for database and schema settings

Source File Structure

Source definitions live in the sources/ directory as YAML files. Each file can contain multiple sources, and each source can contain multiple tables.

Basic Source Definition

Create sources/raw.yml:

version: 2

sources:
  - name: raw
    description: "Raw e-commerce data from the production database"
    database: analytics_db
    schema: raw_data

    tables:
      - name: orders
        description: "Customer order transactions"

      - name: customers
        description: "Customer master data"

      - name: products
        description: "Product catalog"

Complete Source Definition

Here is a comprehensive example with all available options:

version: 2

sources:
  - name: salesforce
    description: "Salesforce CRM data synced via Fivetran"
    database: raw_salesforce
    schema: salesforce

    # Default freshness for all tables in this source
    freshness:
      warn_after:
        count: 12
        period: hour
      error_after:
        count: 24
        period: hour

    # Default loaded_at field for freshness checks
    loaded_at_field: _fivetran_synced

    # Source-level metadata
    meta:
      owner: sales-ops
      sync_tool: fivetran
      sync_schedule: "every 30 minutes"

    tables:
      - name: opportunity
        description: "Sales opportunities and deal pipeline"

        # Override source-level freshness
        freshness:
          warn_after:
            count: 6
            period: hour

        # Override loaded_at field
        loaded_at_field: last_modified_date

        # Table-level metadata
        meta:
          priority: high
          pii: false

        columns:
          - name: id
            description: "Salesforce opportunity ID"
            data_type: varchar
            tests:
              - unique
              - not_null

          - name: account_id
            description: "Foreign key to account"
            data_type: varchar

          - name: amount
            description: "Opportunity value in USD"
            data_type: decimal

          - name: stage
            description: "Sales stage (Prospecting, Qualification, etc.)"
            data_type: varchar
            tests:
              - not_null

          - name: close_date
            description: "Expected close date"
            data_type: date

Using the source() Function

Once sources are defined, reference them in your models using the source() function:

-- models/staging/stg_orders.sql
{{ config(materialized='view') }}

SELECT
    id AS order_id,
    customer_id,
    total_amount,
    status,
    created_at AS order_date

FROM {{ source('raw', 'orders') }}

source() Syntax

{{ source('source_name', 'table_name') }}

Parameter	Description
`source_name`	The `name` property of the source in your YAML
`table_name`	The `name` property of the table within that source

How source() Resolves

Olytix Core resolves source('raw', 'orders') to the fully qualified table name:

-- If database: analytics_db, schema: raw_data
SELECT * FROM analytics_db.raw_data.orders

This resolution is automatic and consistent across all models, ensuring changes to database or schema names only require updating the source YAML.

Warehouse-Specific Configurations

PostgreSQL

sources:
  - name: production
    database: production_db
    schema: public

    tables:
      - name: users
        description: "Application users table"
        columns:
          - name: id
            data_type: bigint
          - name: email
            data_type: varchar
          - name: created_at
            data_type: timestamp with time zone

Snowflake

sources:
  - name: raw
    database: RAW_DATABASE
    schema: PUBLIC

    # Snowflake-specific metadata
    meta:
      warehouse: TRANSFORM_WH
      role: TRANSFORMER_ROLE

    tables:
      - name: EVENTS
        description: "Raw event stream from application"
        columns:
          - name: EVENT_ID
            data_type: VARCHAR
          - name: EVENT_TYPE
            data_type: VARCHAR
          - name: PAYLOAD
            data_type: VARIANT
          - name: CREATED_AT
            data_type: TIMESTAMP_NTZ

BigQuery

sources:
  - name: analytics
    # BigQuery uses project as database
    database: my-gcp-project
    schema: analytics_dataset

    tables:
      - name: page_views
        description: "Website page view events"
        columns:
          - name: event_id
            data_type: STRING
          - name: user_id
            data_type: STRING
          - name: page_url
            data_type: STRING
          - name: event_timestamp
            data_type: TIMESTAMP
          - name: event_params
            data_type: RECORD
            meta:
              mode: REPEATED

Multiple Sources

You can define multiple sources in a single file or across multiple files:

Single File with Multiple Sources

# sources/external.yml
version: 2

sources:
  - name: stripe
    database: raw_stripe
    schema: stripe
    description: "Payment data from Stripe"

    tables:
      - name: charges
      - name: customers
      - name: subscriptions

  - name: segment
    database: raw_segment
    schema: segment
    description: "Event data from Segment"

    tables:
      - name: tracks
      - name: identifies
      - name: pages

Multiple Source Files

sources/
├── raw.yml           # Internal database sources
├── salesforce.yml    # Salesforce CRM data
├── stripe.yml        # Payment platform data
└── segment.yml       # Analytics event data

Source Inheritance

Sources support inheritance for reducing duplication:

version: 2

# Define common configuration
x-defaults: &source_defaults
  meta:
    team: data-engineering
    sync_frequency: hourly
  freshness:
    warn_after:
      count: 6
      period: hour

sources:
  - name: crm
    <<: *source_defaults
    database: raw_crm
    schema: salesforce
    tables:
      - name: accounts
      - name: contacts

  - name: marketing
    <<: *source_defaults
    database: raw_marketing
    schema: hubspot
    tables:
      - name: campaigns
      - name: emails

Validating Sources

After defining sources, validate them using the CLI:

# Compile project and validate sources
olytix-core compile

# List all sources
olytix-core list --resource-type source

Expected output:

✓ Loaded 3 sources with 12 tables
  - raw (4 tables)
  - salesforce (5 tables)
  - stripe (3 tables)

Source Properties Reference

Source-Level Properties

Property	Type	Required	Description
`name`	string	Yes	Unique identifier for the source
`description`	string	No	Human-readable description
`database`	string	No	Database name (uses project default if omitted)
`schema`	string	No	Schema name (uses project default if omitted)
`freshness`	object	No	Default freshness configuration for all tables
`loaded_at_field`	string	No	Default timestamp column for freshness checks
`meta`	object	No	Custom metadata key-value pairs
`tables`	list	Yes	List of table definitions

Table-Level Properties

Property	Type	Required	Description
`name`	string	Yes	Table name as it exists in the warehouse
`description`	string	No	Human-readable description
`freshness`	object	No	Override source-level freshness
`loaded_at_field`	string	No	Override timestamp column
`columns`	list	No	Column definitions
`meta`	object	No	Custom metadata

Column-Level Properties

Property	Type	Required	Description
`name`	string	Yes	Column name
`description`	string	No	Human-readable description
`data_type`	string	No	Column data type
`tests`	list	No	Data tests to apply
`meta`	object	No	Custom metadata

Best Practices

Naming Conventions

Use lowercase, snake_case for source names
Match source names to the system they represent (stripe, salesforce, segment)
Use descriptive names that indicate data origin

Organization

Group related tables in the same source
Use separate files for different data systems
Keep source files in the sources/ directory

Documentation

Always include descriptions for sources and tables
Document column meanings, especially for cryptic names
Add metadata for ownership and data classification

Configuration

Set database and schema at the source level to avoid repetition
Use freshness monitoring for critical data sources
Define tests for primary keys and required columns

Next Steps

Now that you understand source definitions:

Freshness Monitoring - Set up data freshness checks
Schema Documentation - Document columns and data types
Your First Model - Use sources in SQL models

What is a Source?​

Source File Structure​

Basic Source Definition​

Complete Source Definition​

Using the source() Function​

source() Syntax​

How source() Resolves​

Warehouse-Specific Configurations​

PostgreSQL​

Snowflake​

BigQuery​

Multiple Sources​

Single File with Multiple Sources​

Multiple Source Files​

Source Inheritance​

Validating Sources​

Source Properties Reference​

Source-Level Properties​

Table-Level Properties​

Column-Level Properties​

Best Practices​

Naming Conventions​

Organization​

Documentation​

Configuration​

Next Steps​