Schema Documentation
Comprehensive schema documentation transforms your sources from opaque database tables into a self-service data catalog. Well-documented schemas improve data discoverability, reduce onboarding time, and establish a shared understanding of your data.
Why Document Schemas?
Schema documentation provides:
- Discoverability - Help analysts find the right data
- Understanding - Explain what columns mean in business terms
- Data Quality - Enable automated testing and validation
- Lineage - Power column-level lineage tracking
- Governance - Document ownership, sensitivity, and compliance requirements
Column Documentation
Basic Column Definition
Document columns within each table:
version: 2
sources:
- name: raw
database: analytics_db
schema: raw_data
tables:
- name: orders
description: "Customer order transactions from the e-commerce platform"
columns:
- name: id
description: "Unique order identifier (primary key)"
- name: customer_id
description: "Foreign key to customers table"
- name: total_amount
description: "Order total in USD before tax and shipping"
- name: status
description: "Order fulfillment status"
- name: created_at
description: "Timestamp when the order was placed"
Complete Column Definition
Include all available properties for comprehensive documentation:
columns:
- name: customer_id
description: |
Unique identifier for the customer who placed the order.
This is a foreign key reference to the customers.id column.
For guest checkouts, this will be NULL.
# Data type as it exists in the warehouse
data_type: bigint
# Data quality tests
tests:
- not_null
- relationships:
to: source('raw', 'customers')
field: id
# Custom metadata
meta:
pii: false
business_owner: customer-success
source_system: shopify
Data Types
Documenting data types improves understanding and enables type checking.
Common Data Types by Warehouse
PostgreSQL
columns:
- name: id
data_type: bigint
- name: email
data_type: varchar(255)
- name: amount
data_type: numeric(12,2)
- name: is_active
data_type: boolean
- name: created_at
data_type: timestamp with time zone
- name: metadata
data_type: jsonb
- name: tags
data_type: text[]
Snowflake
columns:
- name: ID
data_type: NUMBER(38,0)
- name: EMAIL
data_type: VARCHAR(16777216)
- name: AMOUNT
data_type: NUMBER(12,2)
- name: IS_ACTIVE
data_type: BOOLEAN
- name: CREATED_AT
data_type: TIMESTAMP_NTZ
- name: METADATA
data_type: VARIANT
- name: TAGS
data_type: ARRAY
BigQuery
columns:
- name: id
data_type: INT64
- name: email
data_type: STRING
- name: amount
data_type: NUMERIC
- name: is_active
data_type: BOOL
- name: created_at
data_type: TIMESTAMP
- name: metadata
data_type: JSON
- name: address
data_type: STRUCT
meta:
fields:
- street: STRING
- city: STRING
- country: STRING
- name: tags
data_type: ARRAY<STRING>
Column Metadata
The meta property allows custom key-value pairs for additional documentation.
PII and Data Classification
columns:
- name: email
description: "Customer email address"
data_type: varchar
meta:
pii: true
data_classification: confidential
retention_days: 365
gdpr_relevant: true
- name: phone_number
description: "Customer phone number"
data_type: varchar
meta:
pii: true
data_classification: confidential
masking_policy: partial_mask
- name: order_total
description: "Order total amount"
data_type: decimal
meta:
pii: false
data_classification: internal
Business Context
columns:
- name: status
description: "Order fulfillment status"
data_type: varchar
meta:
business_owner: fulfillment-team
allowed_values:
- pending
- processing
- shipped
- delivered
- cancelled
- returned
update_frequency: real-time
- name: revenue
description: "Net revenue after discounts and refunds"
data_type: decimal
meta:
business_owner: finance-team
calculation: "gross_amount - discount_amount - refund_amount"
currency: USD
certified: true
Technical Metadata
columns:
- name: event_id
description: "Unique event identifier"
data_type: uuid
meta:
source_system: segment
original_name: anonymousId
indexed: true
partition_key: false
- name: event_timestamp
description: "When the event occurred"
data_type: timestamp
meta:
timezone: UTC
indexed: true
partition_key: true
cluster_order: 1
Column Tests
Add tests to validate data quality at the source level.
Built-in Tests
columns:
- name: id
description: "Primary key"
tests:
- unique
- not_null
- name: email
description: "Customer email"
tests:
- not_null
- unique
- name: status
description: "Order status"
tests:
- not_null
- accepted_values:
values: ['pending', 'completed', 'cancelled']
- name: customer_id
description: "Foreign key to customers"
tests:
- not_null
- relationships:
to: source('raw', 'customers')
field: id
- name: amount
description: "Order amount (must be positive)"
tests:
- not_null
Test Reference
| Test | Description | Parameters |
|---|---|---|
unique | Column values must be unique | None |
not_null | Column cannot contain NULL values | None |
accepted_values | Column must contain only specified values | values: list of allowed values |
relationships | Values must exist in another table | to: target table, field: target column |
Documenting Complex Types
JSON/JSONB Columns
columns:
- name: properties
description: |
Event properties stored as JSON.
Common fields:
- `page_url`: URL of the page where the event occurred
- `referrer`: Referring URL
- `utm_source`: Marketing campaign source
- `utm_medium`: Marketing campaign medium
data_type: jsonb
meta:
schema:
type: object
properties:
page_url:
type: string
referrer:
type: string
utm_source:
type: string
utm_medium:
type: string
Array Columns
columns:
- name: tags
description: |
Product tags as an array of strings.
Examples: ["electronics", "sale", "featured"]
data_type: text[]
meta:
element_type: string
max_length: 50
Nested Structures (BigQuery)
columns:
- name: address
description: "Customer shipping address"
data_type: STRUCT
meta:
fields:
street:
type: STRING
description: "Street address"
city:
type: STRING
description: "City name"
state:
type: STRING
description: "State/province code"
postal_code:
type: STRING
description: "ZIP or postal code"
country:
type: STRING
description: "ISO 3166-1 alpha-2 country code"
Complete Table Example
Here is a comprehensive example documenting a production orders table:
version: 2
sources:
- name: ecommerce
description: "E-commerce platform data from Shopify"
database: raw_shopify
schema: shopify
meta:
owner: data-engineering
sync_tool: fivetran
documentation_url: https://wiki.company.com/data/shopify
tables:
- name: orders
description: |
## Orders Table
Contains all customer orders from the Shopify e-commerce platform.
### Data Source
- **System**: Shopify
- **Sync**: Fivetran (every 15 minutes)
- **Historical Data**: Since 2020-01-01
### Key Relationships
- `customer_id` -> `customers.id`
- `billing_address_id` -> `addresses.id`
- `shipping_address_id` -> `addresses.id`
### Business Rules
- Orders with status='cancelled' should be excluded from revenue calculations
- Refunded orders have negative `total_amount` adjustment rows
meta:
owner: ecommerce-team
pii_columns: [email, phone, billing_address_id, shipping_address_id]
update_frequency: 15 minutes
row_count_estimate: 5000000
freshness:
warn_after:
count: 30
period: minute
error_after:
count: 1
period: hour
loaded_at_field: _fivetran_synced
columns:
- name: id
description: "Unique Shopify order ID (primary key)"
data_type: bigint
tests:
- unique
- not_null
meta:
primary_key: true
- name: order_number
description: "Human-readable order number shown to customers (e.g., #1001)"
data_type: varchar
tests:
- not_null
- name: customer_id
description: "Foreign key to the customer who placed the order"
data_type: bigint
tests:
- relationships:
to: source('ecommerce', 'customers')
field: id
meta:
foreign_key: customers.id
- name: email
description: "Customer email address at time of order"
data_type: varchar
tests:
- not_null
meta:
pii: true
masking_policy: email_mask
- name: total_amount
description: |
Order total in the shop's base currency (USD).
Includes line items, taxes, and shipping.
Does not include post-order refunds.
data_type: numeric(12,2)
tests:
- not_null
meta:
currency: USD
aggregation_type: sum
- name: subtotal_amount
description: "Sum of line item prices before tax and shipping"
data_type: numeric(12,2)
meta:
currency: USD
- name: tax_amount
description: "Total tax charged on the order"
data_type: numeric(12,2)
meta:
currency: USD
- name: shipping_amount
description: "Shipping charges for the order"
data_type: numeric(12,2)
meta:
currency: USD
- name: discount_amount
description: "Total discounts applied to the order"
data_type: numeric(12,2)
meta:
currency: USD
- name: status
description: |
Current order fulfillment status:
- `pending`: Order placed, not yet processed
- `processing`: Order being prepared
- `shipped`: Order shipped to customer
- `delivered`: Order delivered
- `cancelled`: Order cancelled before fulfillment
- `refunded`: Order refunded after delivery
data_type: varchar
tests:
- not_null
- accepted_values:
values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled', 'refunded']
meta:
allowed_values:
pending: Order placed, not yet processed
processing: Order being prepared
shipped: Order shipped to customer
delivered: Order delivered
cancelled: Order cancelled before fulfillment
refunded: Order refunded after delivery
- name: payment_status
description: "Payment processing status"
data_type: varchar
tests:
- accepted_values:
values: ['pending', 'authorized', 'captured', 'refunded', 'failed']
- name: currency_code
description: "ISO 4217 currency code for the order"
data_type: varchar(3)
meta:
format: ISO 4217
- name: created_at
description: "Timestamp when the order was placed"
data_type: timestamp with time zone
tests:
- not_null
meta:
timezone: UTC
indexed: true
- name: updated_at
description: "Timestamp of last order update"
data_type: timestamp with time zone
meta:
timezone: UTC
- name: _fivetran_synced
description: "Fivetran sync timestamp"
data_type: timestamp with time zone
meta:
system_column: true
Generating Documentation
View Documentation
# Generate and serve documentation site
olytix-core docs generate
olytix-core docs serve
# Export as JSON
olytix-core docs generate --format json --output docs.json
API Access
Query documentation via the Olytix Core API:
# Get source documentation
curl http://localhost:8000/api/v1/sources/ecommerce
# Get table columns
curl http://localhost:8000/api/v1/sources/ecommerce/tables/orders/columns
Documentation Best Practices
Write for Your Audience
- Use business terms, not just technical jargon
- Explain what the data represents, not just its structure
- Include examples where helpful
Be Specific About Edge Cases
- name: discount_amount
description: |
Total discount applied to the order.
- Returns 0 for orders without discounts (not NULL)
- Multiple discount codes are summed
- Does not include loyalty point redemptions (see `points_redeemed`)
Document Business Logic
- name: is_first_order
description: |
Boolean flag indicating if this is the customer's first order.
**Calculation:** TRUE if no prior orders exist for this customer_id
with created_at < this order's created_at.
**Note:** Guest checkout orders (customer_id IS NULL) are always FALSE.
Include Data Quality Notes
- name: phone_number
description: |
Customer phone number.
**Data Quality Notes:**
- Format varies (some include country code, some don't)
- ~15% of records have NULL values
- Historical data before 2022 may have invalid formats
Reference External Documentation
meta:
documentation_url: https://wiki.company.com/data/orders
api_docs: https://shopify.dev/api/admin-rest/2024-01/resources/order
data_dictionary: https://docs.google.com/spreadsheets/d/abc123
Next Steps
Now that you understand schema documentation:
- Defining Sources - Review source configuration basics
- Freshness Monitoring - Set up data freshness checks
- Column-Level Lineage - Track data flow through columns
- Data Tests - Add quality tests to your sources