Skip to main content

Schema Documentation

For Data Analysts

Comprehensive schema documentation transforms your sources from opaque database tables into a self-service data catalog. Well-documented schemas improve data discoverability, reduce onboarding time, and establish a shared understanding of your data.

Why Document Schemas?

Schema documentation provides:

  • Discoverability - Help analysts find the right data
  • Understanding - Explain what columns mean in business terms
  • Data Quality - Enable automated testing and validation
  • Lineage - Power column-level lineage tracking
  • Governance - Document ownership, sensitivity, and compliance requirements

Column Documentation

Basic Column Definition

Document columns within each table:

version: 2

sources:
- name: raw
database: analytics_db
schema: raw_data

tables:
- name: orders
description: "Customer order transactions from the e-commerce platform"

columns:
- name: id
description: "Unique order identifier (primary key)"

- name: customer_id
description: "Foreign key to customers table"

- name: total_amount
description: "Order total in USD before tax and shipping"

- name: status
description: "Order fulfillment status"

- name: created_at
description: "Timestamp when the order was placed"

Complete Column Definition

Include all available properties for comprehensive documentation:

columns:
- name: customer_id
description: |
Unique identifier for the customer who placed the order.

This is a foreign key reference to the customers.id column.
For guest checkouts, this will be NULL.

# Data type as it exists in the warehouse
data_type: bigint

# Data quality tests
tests:
- not_null
- relationships:
to: source('raw', 'customers')
field: id

# Custom metadata
meta:
pii: false
business_owner: customer-success
source_system: shopify

Data Types

Documenting data types improves understanding and enables type checking.

Common Data Types by Warehouse

PostgreSQL

columns:
- name: id
data_type: bigint

- name: email
data_type: varchar(255)

- name: amount
data_type: numeric(12,2)

- name: is_active
data_type: boolean

- name: created_at
data_type: timestamp with time zone

- name: metadata
data_type: jsonb

- name: tags
data_type: text[]

Snowflake

columns:
- name: ID
data_type: NUMBER(38,0)

- name: EMAIL
data_type: VARCHAR(16777216)

- name: AMOUNT
data_type: NUMBER(12,2)

- name: IS_ACTIVE
data_type: BOOLEAN

- name: CREATED_AT
data_type: TIMESTAMP_NTZ

- name: METADATA
data_type: VARIANT

- name: TAGS
data_type: ARRAY

BigQuery

columns:
- name: id
data_type: INT64

- name: email
data_type: STRING

- name: amount
data_type: NUMERIC

- name: is_active
data_type: BOOL

- name: created_at
data_type: TIMESTAMP

- name: metadata
data_type: JSON

- name: address
data_type: STRUCT
meta:
fields:
- street: STRING
- city: STRING
- country: STRING

- name: tags
data_type: ARRAY<STRING>

Column Metadata

The meta property allows custom key-value pairs for additional documentation.

PII and Data Classification

columns:
- name: email
description: "Customer email address"
data_type: varchar
meta:
pii: true
data_classification: confidential
retention_days: 365
gdpr_relevant: true

- name: phone_number
description: "Customer phone number"
data_type: varchar
meta:
pii: true
data_classification: confidential
masking_policy: partial_mask

- name: order_total
description: "Order total amount"
data_type: decimal
meta:
pii: false
data_classification: internal

Business Context

columns:
- name: status
description: "Order fulfillment status"
data_type: varchar
meta:
business_owner: fulfillment-team
allowed_values:
- pending
- processing
- shipped
- delivered
- cancelled
- returned
update_frequency: real-time

- name: revenue
description: "Net revenue after discounts and refunds"
data_type: decimal
meta:
business_owner: finance-team
calculation: "gross_amount - discount_amount - refund_amount"
currency: USD
certified: true

Technical Metadata

columns:
- name: event_id
description: "Unique event identifier"
data_type: uuid
meta:
source_system: segment
original_name: anonymousId
indexed: true
partition_key: false

- name: event_timestamp
description: "When the event occurred"
data_type: timestamp
meta:
timezone: UTC
indexed: true
partition_key: true
cluster_order: 1

Column Tests

Add tests to validate data quality at the source level.

Built-in Tests

columns:
- name: id
description: "Primary key"
tests:
- unique
- not_null

- name: email
description: "Customer email"
tests:
- not_null
- unique

- name: status
description: "Order status"
tests:
- not_null
- accepted_values:
values: ['pending', 'completed', 'cancelled']

- name: customer_id
description: "Foreign key to customers"
tests:
- not_null
- relationships:
to: source('raw', 'customers')
field: id

- name: amount
description: "Order amount (must be positive)"
tests:
- not_null

Test Reference

TestDescriptionParameters
uniqueColumn values must be uniqueNone
not_nullColumn cannot contain NULL valuesNone
accepted_valuesColumn must contain only specified valuesvalues: list of allowed values
relationshipsValues must exist in another tableto: target table, field: target column

Documenting Complex Types

JSON/JSONB Columns

columns:
- name: properties
description: |
Event properties stored as JSON.

Common fields:
- `page_url`: URL of the page where the event occurred
- `referrer`: Referring URL
- `utm_source`: Marketing campaign source
- `utm_medium`: Marketing campaign medium
data_type: jsonb
meta:
schema:
type: object
properties:
page_url:
type: string
referrer:
type: string
utm_source:
type: string
utm_medium:
type: string

Array Columns

columns:
- name: tags
description: |
Product tags as an array of strings.

Examples: ["electronics", "sale", "featured"]
data_type: text[]
meta:
element_type: string
max_length: 50

Nested Structures (BigQuery)

columns:
- name: address
description: "Customer shipping address"
data_type: STRUCT
meta:
fields:
street:
type: STRING
description: "Street address"
city:
type: STRING
description: "City name"
state:
type: STRING
description: "State/province code"
postal_code:
type: STRING
description: "ZIP or postal code"
country:
type: STRING
description: "ISO 3166-1 alpha-2 country code"

Complete Table Example

Here is a comprehensive example documenting a production orders table:

version: 2

sources:
- name: ecommerce
description: "E-commerce platform data from Shopify"
database: raw_shopify
schema: shopify

meta:
owner: data-engineering
sync_tool: fivetran
documentation_url: https://wiki.company.com/data/shopify

tables:
- name: orders
description: |
## Orders Table

Contains all customer orders from the Shopify e-commerce platform.

### Data Source
- **System**: Shopify
- **Sync**: Fivetran (every 15 minutes)
- **Historical Data**: Since 2020-01-01

### Key Relationships
- `customer_id` -> `customers.id`
- `billing_address_id` -> `addresses.id`
- `shipping_address_id` -> `addresses.id`

### Business Rules
- Orders with status='cancelled' should be excluded from revenue calculations
- Refunded orders have negative `total_amount` adjustment rows

meta:
owner: ecommerce-team
pii_columns: [email, phone, billing_address_id, shipping_address_id]
update_frequency: 15 minutes
row_count_estimate: 5000000

freshness:
warn_after:
count: 30
period: minute
error_after:
count: 1
period: hour

loaded_at_field: _fivetran_synced

columns:
- name: id
description: "Unique Shopify order ID (primary key)"
data_type: bigint
tests:
- unique
- not_null
meta:
primary_key: true

- name: order_number
description: "Human-readable order number shown to customers (e.g., #1001)"
data_type: varchar
tests:
- not_null

- name: customer_id
description: "Foreign key to the customer who placed the order"
data_type: bigint
tests:
- relationships:
to: source('ecommerce', 'customers')
field: id
meta:
foreign_key: customers.id

- name: email
description: "Customer email address at time of order"
data_type: varchar
tests:
- not_null
meta:
pii: true
masking_policy: email_mask

- name: total_amount
description: |
Order total in the shop's base currency (USD).
Includes line items, taxes, and shipping.
Does not include post-order refunds.
data_type: numeric(12,2)
tests:
- not_null
meta:
currency: USD
aggregation_type: sum

- name: subtotal_amount
description: "Sum of line item prices before tax and shipping"
data_type: numeric(12,2)
meta:
currency: USD

- name: tax_amount
description: "Total tax charged on the order"
data_type: numeric(12,2)
meta:
currency: USD

- name: shipping_amount
description: "Shipping charges for the order"
data_type: numeric(12,2)
meta:
currency: USD

- name: discount_amount
description: "Total discounts applied to the order"
data_type: numeric(12,2)
meta:
currency: USD

- name: status
description: |
Current order fulfillment status:
- `pending`: Order placed, not yet processed
- `processing`: Order being prepared
- `shipped`: Order shipped to customer
- `delivered`: Order delivered
- `cancelled`: Order cancelled before fulfillment
- `refunded`: Order refunded after delivery
data_type: varchar
tests:
- not_null
- accepted_values:
values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded']
meta:
allowed_values:
pending: Order placed, not yet processed
processing: Order being prepared
shipped: Order shipped to customer
delivered: Order delivered
cancelled: Order cancelled before fulfillment
refunded: Order refunded after delivery

- name: payment_status
description: "Payment processing status"
data_type: varchar
tests:
- accepted_values:
values: ['pending', 'authorized', 'captured', 'refunded', 'failed']

- name: currency_code
description: "ISO 4217 currency code for the order"
data_type: varchar(3)
meta:
format: ISO 4217

- name: created_at
description: "Timestamp when the order was placed"
data_type: timestamp with time zone
tests:
- not_null
meta:
timezone: UTC
indexed: true

- name: updated_at
description: "Timestamp of last order update"
data_type: timestamp with time zone
meta:
timezone: UTC

- name: _fivetran_synced
description: "Fivetran sync timestamp"
data_type: timestamp with time zone
meta:
system_column: true

Generating Documentation

View Documentation

# Generate and serve documentation site
olytix-core docs generate
olytix-core docs serve

# Export as JSON
olytix-core docs generate --format json --output docs.json

API Access

Query documentation via the Olytix Core API:

# Get source documentation
curl http://localhost:8000/api/v1/sources/ecommerce

# Get table columns
curl http://localhost:8000/api/v1/sources/ecommerce/tables/orders/columns

Documentation Best Practices

Write for Your Audience

  • Use business terms, not just technical jargon
  • Explain what the data represents, not just its structure
  • Include examples where helpful

Be Specific About Edge Cases

- name: discount_amount
description: |
Total discount applied to the order.

- Returns 0 for orders without discounts (not NULL)
- Multiple discount codes are summed
- Does not include loyalty point redemptions (see `points_redeemed`)

Document Business Logic

- name: is_first_order
description: |
Boolean flag indicating if this is the customer's first order.

**Calculation:** TRUE if no prior orders exist for this customer_id
with created_at < this order's created_at.

**Note:** Guest checkout orders (customer_id IS NULL) are always FALSE.

Include Data Quality Notes

- name: phone_number
description: |
Customer phone number.

**Data Quality Notes:**
- Format varies (some include country code, some don't)
- ~15% of records have NULL values
- Historical data before 2022 may have invalid formats

Reference External Documentation

meta:
documentation_url: https://wiki.company.com/data/orders
api_docs: https://shopify.dev/api/admin-rest/2024-01/resources/order
data_dictionary: https://docs.google.com/spreadsheets/d/abc123

Next Steps

Now that you understand schema documentation:

  1. Defining Sources - Review source configuration basics
  2. Freshness Monitoring - Set up data freshness checks
  3. Column-Level Lineage - Track data flow through columns
  4. Data Tests - Add quality tests to your sources