BigQuery Integration

For Data Analysts

Google BigQuery is a serverless, highly scalable data warehouse with built-in machine learning capabilities. Olytix Core integrates with BigQuery using the official Google Cloud Python client library.

Prerequisites

Google Cloud project with BigQuery API enabled
Service account with BigQuery permissions (or user credentials for development)
Network access to Google Cloud APIs

Installation

Install the BigQuery connector:

pip install olytix-core[bigquery]

# Or install the client directly
pip install google-cloud-bigquery

Configuration

Basic Configuration

Add the warehouse configuration to your olytix-core_project.yml:

name: my_analytics
version: 1.0.0

warehouse:
  type: bigquery
  project_id: my-gcp-project
  database: analytics_dataset
  credentials_path: ${OLYTIX_GCP_CREDENTIALS_PATH}

Environment Variables

Set your credentials path:

export OLYTIX_GCP_CREDENTIALS_PATH=/path/to/service-account.json

# Or use Google Application Default Credentials
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Complete Configuration Options

warehouse:
  type: bigquery

  # Project settings
  project_id: my-gcp-project

  # Dataset (equivalent to schema in other warehouses)
  database: analytics_dataset

  # Authentication
  credentials_path: /path/to/service-account.json

  # Query settings
  query_timeout: 300  # seconds

  # Optional settings
  extra:
    location: US                    # Dataset location
    maximum_bytes_billed: 1073741824  # 1GB limit
    use_legacy_sql: false
    priority: INTERACTIVE          # or BATCH

Authentication Methods

Service Account (Recommended for Production)

Create a service account and download the JSON key file:

warehouse:
  type: bigquery
  project_id: my-gcp-project
  credentials_path: /etc/olytix-core/service-account.json

Required roles for the service account:

roles/bigquery.dataViewer - Read access to datasets
roles/bigquery.jobUser - Run queries
roles/bigquery.dataEditor - Create tables (for pre-aggregations)

# Create service account
gcloud iam service-accounts create olytix-core-analytics \
    --description="Olytix Core Analytics Service Account" \
    --display-name="Olytix Core Analytics"

# Grant permissions
gcloud projects add-iam-policy-binding my-gcp-project \
    --member="serviceAccount:olytix-core-analytics@my-gcp-project.iam.gserviceaccount.com" \
    --role="roles/bigquery.dataViewer"

gcloud projects add-iam-policy-binding my-gcp-project \
    --member="serviceAccount:olytix-core-analytics@my-gcp-project.iam.gserviceaccount.com" \
    --role="roles/bigquery.jobUser"

# Create and download key
gcloud iam service-accounts keys create service-account.json \
    --iam-account=olytix-core-analytics@my-gcp-project.iam.gserviceaccount.com

Application Default Credentials

For development environments or GCE/GKE:

warehouse:
  type: bigquery
  project_id: my-gcp-project
  # credentials_path omitted - uses ADC

Set up ADC:

# For local development
gcloud auth application-default login

# For GCE/GKE - uses metadata server automatically

Workload Identity (GKE)

For Kubernetes deployments with Workload Identity:

# kubernetes/deployment.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: olytix-core-sa
  annotations:
    iam.gke.io/gcp-service-account: olytix-core-analytics@my-project.iam.gserviceaccount.com

# olytix-core_project.yml - no credentials needed
warehouse:
  type: bigquery
  project_id: my-gcp-project

Testing the Connection

Verify your connection is working:

olytix-core test-connection

Expected output:

Testing connection to bigquery://my-gcp-project
✓ Connection successful
✓ Project 'my-gcp-project' accessible
✓ Dataset 'analytics_dataset' found
✓ Found 28 tables

Using Python:

from olytix-core.engine.connectors.bigquery import BigQueryConnector
from olytix-core.engine.connectors.base import ConnectionConfig, ConnectorType

config = ConnectionConfig(
    type=ConnectorType.BIGQUERY,
    project_id="my-gcp-project",
    database="analytics_dataset",
    credentials_path="/path/to/service-account.json",
)

async def test():
    connector = BigQueryConnector(config)
    async with connector:
        result = await connector.execute("SELECT 1 as test")
        print(f"Connection successful: {result.data}")

        # Check query cost
        estimate = await connector.get_query_cost_estimate(
            "SELECT * FROM `my-gcp-project.analytics.orders`"
        )
        print(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")

Defining Sources

Once connected, define your data sources:

# sources/bigquery_raw.yml
version: 2

sources:
  - name: raw
    database: raw_data
    schema: ecommerce
    tables:
      - name: orders
        description: Raw e-commerce orders
        identifier: "my-gcp-project.raw_data.orders"
        columns:
          - name: order_id
            description: Order identifier
          - name: customer_id
            description: Customer reference
          - name: order_total
            description: Order total in USD
          - name: order_date
            description: Order timestamp

      - name: customers
        description: Customer master data
        freshness:
          warn_after: { count: 24, period: hour }
          error_after: { count: 48, period: hour }

BigQuery-Specific Features

Query Cost Estimation

Estimate query cost before execution:

estimate = await connector.get_query_cost_estimate(
    "SELECT * FROM `analytics.orders` WHERE order_date > '2024-01-01'"
)

print(f"Bytes to process: {estimate['bytes_processed']:,}")
print(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")
print(f"Cached: {estimate['is_cached']}")

Query Metadata

Olytix Core captures BigQuery-specific metadata:

result = await connector.execute(
    "SELECT region, SUM(amount) FROM orders GROUP BY region"
)

print(f"Job ID: {result.metadata.get('job_id')}")
print(f"Bytes processed: {result.bytes_processed:,}")
print(f"Bytes billed: {result.bytes_billed:,}")
print(f"Cache hit: {result.cache_hit}")
print(f"Slot milliseconds: {result.metadata.get('slot_millis')}")

DataFrame Loading

Load data directly from pandas DataFrames:

import pandas as pd

df = pd.DataFrame({
    "date": ["2024-01-01", "2024-01-02"],
    "revenue": [50000, 52000],
})

rows = await connector.load_from_dataframe(
    table="daily_revenue",
    dataframe=df,
    write_disposition="WRITE_TRUNCATE"  # or WRITE_APPEND
)
print(f"Loaded {rows} rows")

Create Table from Query

Create tables directly from SQL:

await connector.create_table_from_query(
    table="aggregated_orders",
    sql="""
        SELECT
            DATE_TRUNC(order_date, DAY) as date,
            region,
            SUM(order_total) as revenue
        FROM `analytics.orders`
        GROUP BY 1, 2
    """,
    write_disposition="WRITE_TRUNCATE"
)

Performance Optimization

Partitioning

Use table partitioning for large datasets:

-- Create partitioned table
CREATE TABLE `analytics.orders_partitioned`
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, region
AS SELECT * FROM `analytics.orders`;

Reference partitioned tables in Olytix Core:

sources:
  - name: analytics
    tables:
      - name: orders_partitioned
        description: Partitioned orders table
        meta:
          partition_by: order_date
          partition_type: DAY

Clustering

Add clustering for frequently filtered columns:

-- Cluster by commonly filtered columns
CREATE TABLE `analytics.orders_clustered`
PARTITION BY DATE(order_date)
CLUSTER BY region, product_category
AS SELECT * FROM `analytics.orders`;

Materialized Views

Create materialized views for common aggregations:

CREATE MATERIALIZED VIEW `analytics.mv_daily_revenue`
PARTITION BY date
CLUSTER BY region
AS
SELECT
    DATE(order_date) as date,
    region,
    SUM(order_total) as revenue,
    COUNT(*) as order_count
FROM `analytics.orders`
GROUP BY 1, 2;

BI Engine Acceleration

Enable BI Engine for sub-second query response:

# Reserve BI Engine capacity (via gcloud or Console)
gcloud alpha bq bi-reservations create \
    --location=US \
    --size=10GB

Cost Management

Byte Limits

Set maximum bytes billed to prevent runaway queries:

warehouse:
  type: bigquery
  extra:
    maximum_bytes_billed: 10737418240  # 10GB limit

Query Priority

Use BATCH priority for non-urgent queries (lower cost):

warehouse:
  type: bigquery
  extra:
    priority: BATCH

Slot Reservations

For predictable pricing, use flat-rate slots:

# Create slot commitment
gcloud alpha bq reservations create \
    --location=US \
    --slots=100 \
    --commitment-plan=FLEX

Cost Monitoring

Monitor query costs:

-- Query cost analysis (last 30 days)
SELECT
    user_email,
    DATE(creation_time) as query_date,
    COUNT(*) as query_count,
    SUM(total_bytes_billed) / (1024*1024*1024*1024) as tb_billed,
    SUM(total_bytes_billed) / (1024*1024*1024*1024) * 5 as estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY estimated_cost_usd DESC;

Pre-aggregation Tables

Olytix Core can create and manage pre-aggregation tables in BigQuery:

# cubes/orders.yml
cubes:
  - name: orders
    sql: "SELECT * FROM {{ ref('fct_orders') }}"

    pre_aggregations:
      - name: daily_revenue_by_region
        measures: [total_revenue, order_count]
        dimensions: [region]
        time_dimension: order_date
        granularity: day
        partition_granularity: month
        build_range_start: "2023-01-01"
        refresh_key:
          every: "1 hour"
        external:
          type: bigquery
          dataset: preagg_analytics

Troubleshooting

Permission Denied

Error: Access Denied: Project my-gcp-project: User does not have permission

Solutions:

Verify service account has required roles
Check project ID is correct
Ensure BigQuery API is enabled

Dataset Not Found

Error: Not found: Dataset my-gcp-project:analytics_dataset

Solutions:

Verify dataset name and location
Check service account has access to dataset
Ensure project ID is correct

Quota Exceeded

Error: Quota exceeded: Your project exceeded quota for concurrent queries

Solutions:

Implement query queuing in Olytix Core
Request quota increase from Google Cloud
Use slot reservations for predictable capacity

Query Too Large

Error: Query exceeds resource limits

Solutions:

Add maximum_bytes_billed limit
Optimize query to scan less data
Use partitioned tables with partition filters
Consider pre-aggregations

Authentication Failed

Error: Could not automatically determine credentials

Solutions:

Set GOOGLE_APPLICATION_CREDENTIALS environment variable
Provide explicit credentials_path in configuration
Run gcloud auth application-default login for development

Multi-Region Configuration

For global deployments:

# US region
warehouse:
  type: bigquery
  project_id: my-gcp-project
  database: analytics_us
  extra:
    location: US

# EU region (separate configuration)
warehouse_eu:
  type: bigquery
  project_id: my-gcp-project-eu
  database: analytics_eu
  extra:
    location: EU

Prerequisites​

Installation​

Configuration​

Basic Configuration​

Environment Variables​

Complete Configuration Options​

Authentication Methods​

Service Account (Recommended for Production)​

Application Default Credentials​

Workload Identity (GKE)​

Testing the Connection​

Defining Sources​

BigQuery-Specific Features​

Query Cost Estimation​

Query Metadata​

DataFrame Loading​

Create Table from Query​

Performance Optimization​

Partitioning​

Clustering​

Materialized Views​

BI Engine Acceleration​

Cost Management​

Byte Limits​

Query Priority​

Slot Reservations​

Cost Monitoring​

Pre-aggregation Tables​

Troubleshooting​

Permission Denied​

Dataset Not Found​

Quota Exceeded​

Query Too Large​

Authentication Failed​

Multi-Region Configuration​

Next Steps​

Prerequisites

Installation

Configuration

Basic Configuration

Environment Variables

Complete Configuration Options

Authentication Methods

Service Account (Recommended for Production)

Application Default Credentials

Workload Identity (GKE)

Testing the Connection

Defining Sources

BigQuery-Specific Features

Query Cost Estimation

Query Metadata

DataFrame Loading

Create Table from Query

Performance Optimization

Partitioning

Clustering

Materialized Views

BI Engine Acceleration

Cost Management

Byte Limits

Query Priority

Slot Reservations

Cost Monitoring

Pre-aggregation Tables

Troubleshooting

Permission Denied

Dataset Not Found

Quota Exceeded

Query Too Large

Authentication Failed

Multi-Region Configuration

Next Steps