Skip to main content

BigQuery Integration

For Data Analysts

Google BigQuery is a serverless, highly scalable data warehouse with built-in machine learning capabilities. Olytix Core integrates with BigQuery using the official Google Cloud Python client library.

Prerequisites

  • Google Cloud project with BigQuery API enabled
  • Service account with BigQuery permissions (or user credentials for development)
  • Network access to Google Cloud APIs

Installation

Install the BigQuery connector:

pip install olytix-core[bigquery]

# Or install the client directly
pip install google-cloud-bigquery

Configuration

Basic Configuration

Add the warehouse configuration to your olytix-core_project.yml:

name: my_analytics
version: 1.0.0

warehouse:
type: bigquery
project_id: my-gcp-project
database: analytics_dataset
credentials_path: ${OLYTIX_GCP_CREDENTIALS_PATH}

Environment Variables

Set your credentials path:

export OLYTIX_GCP_CREDENTIALS_PATH=/path/to/service-account.json

# Or use Google Application Default Credentials
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json

Complete Configuration Options

warehouse:
type: bigquery

# Project settings
project_id: my-gcp-project

# Dataset (equivalent to schema in other warehouses)
database: analytics_dataset

# Authentication
credentials_path: /path/to/service-account.json

# Query settings
query_timeout: 300 # seconds

# Optional settings
extra:
location: US # Dataset location
maximum_bytes_billed: 1073741824 # 1GB limit
use_legacy_sql: false
priority: INTERACTIVE # or BATCH

Authentication Methods

Create a service account and download the JSON key file:

warehouse:
type: bigquery
project_id: my-gcp-project
credentials_path: /etc/olytix-core/service-account.json

Required roles for the service account:

  • roles/bigquery.dataViewer - Read access to datasets
  • roles/bigquery.jobUser - Run queries
  • roles/bigquery.dataEditor - Create tables (for pre-aggregations)
# Create service account
gcloud iam service-accounts create olytix-core-analytics \
--description="Olytix Core Analytics Service Account" \
--display-name="Olytix Core Analytics"

# Grant permissions
gcloud projects add-iam-policy-binding my-gcp-project \
--member="serviceAccount:olytix-core-analytics@my-gcp-project.iam.gserviceaccount.com" \
--role="roles/bigquery.dataViewer"

gcloud projects add-iam-policy-binding my-gcp-project \
--member="serviceAccount:olytix-core-analytics@my-gcp-project.iam.gserviceaccount.com" \
--role="roles/bigquery.jobUser"

# Create and download key
gcloud iam service-accounts keys create service-account.json \
--iam-account=olytix-core-analytics@my-gcp-project.iam.gserviceaccount.com

Application Default Credentials

For development environments or GCE/GKE:

warehouse:
type: bigquery
project_id: my-gcp-project
# credentials_path omitted - uses ADC

Set up ADC:

# For local development
gcloud auth application-default login

# For GCE/GKE - uses metadata server automatically

Workload Identity (GKE)

For Kubernetes deployments with Workload Identity:

# kubernetes/deployment.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: olytix-core-sa
annotations:
iam.gke.io/gcp-service-account: olytix-core-analytics@my-project.iam.gserviceaccount.com
# olytix-core_project.yml - no credentials needed
warehouse:
type: bigquery
project_id: my-gcp-project

Testing the Connection

Verify your connection is working:

olytix-core test-connection

Expected output:

Testing connection to bigquery://my-gcp-project
✓ Connection successful
✓ Project 'my-gcp-project' accessible
✓ Dataset 'analytics_dataset' found
✓ Found 28 tables

Using Python:

from olytix-core.engine.connectors.bigquery import BigQueryConnector
from olytix-core.engine.connectors.base import ConnectionConfig, ConnectorType

config = ConnectionConfig(
type=ConnectorType.BIGQUERY,
project_id="my-gcp-project",
database="analytics_dataset",
credentials_path="/path/to/service-account.json",
)

async def test():
connector = BigQueryConnector(config)
async with connector:
result = await connector.execute("SELECT 1 as test")
print(f"Connection successful: {result.data}")

# Check query cost
estimate = await connector.get_query_cost_estimate(
"SELECT * FROM `my-gcp-project.analytics.orders`"
)
print(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")

Defining Sources

Once connected, define your data sources:

# sources/bigquery_raw.yml
version: 2

sources:
- name: raw
database: raw_data
schema: ecommerce
tables:
- name: orders
description: Raw e-commerce orders
identifier: "my-gcp-project.raw_data.orders"
columns:
- name: order_id
description: Order identifier
- name: customer_id
description: Customer reference
- name: order_total
description: Order total in USD
- name: order_date
description: Order timestamp

- name: customers
description: Customer master data
freshness:
warn_after: { count: 24, period: hour }
error_after: { count: 48, period: hour }

BigQuery-Specific Features

Query Cost Estimation

Estimate query cost before execution:

estimate = await connector.get_query_cost_estimate(
"SELECT * FROM `analytics.orders` WHERE order_date > '2024-01-01'"
)

print(f"Bytes to process: {estimate['bytes_processed']:,}")
print(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")
print(f"Cached: {estimate['is_cached']}")

Query Metadata

Olytix Core captures BigQuery-specific metadata:

result = await connector.execute(
"SELECT region, SUM(amount) FROM orders GROUP BY region"
)

print(f"Job ID: {result.metadata.get('job_id')}")
print(f"Bytes processed: {result.bytes_processed:,}")
print(f"Bytes billed: {result.bytes_billed:,}")
print(f"Cache hit: {result.cache_hit}")
print(f"Slot milliseconds: {result.metadata.get('slot_millis')}")

DataFrame Loading

Load data directly from pandas DataFrames:

import pandas as pd

df = pd.DataFrame({
"date": ["2024-01-01", "2024-01-02"],
"revenue": [50000, 52000],
})

rows = await connector.load_from_dataframe(
table="daily_revenue",
dataframe=df,
write_disposition="WRITE_TRUNCATE" # or WRITE_APPEND
)
print(f"Loaded {rows} rows")

Create Table from Query

Create tables directly from SQL:

await connector.create_table_from_query(
table="aggregated_orders",
sql="""
SELECT
DATE_TRUNC(order_date, DAY) as date,
region,
SUM(order_total) as revenue
FROM `analytics.orders`
GROUP BY 1, 2
""",
write_disposition="WRITE_TRUNCATE"
)

Performance Optimization

Partitioning

Use table partitioning for large datasets:

-- Create partitioned table
CREATE TABLE `analytics.orders_partitioned`
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, region
AS SELECT * FROM `analytics.orders`;

Reference partitioned tables in Olytix Core:

sources:
- name: analytics
tables:
- name: orders_partitioned
description: Partitioned orders table
meta:
partition_by: order_date
partition_type: DAY

Clustering

Add clustering for frequently filtered columns:

-- Cluster by commonly filtered columns
CREATE TABLE `analytics.orders_clustered`
PARTITION BY DATE(order_date)
CLUSTER BY region, product_category
AS SELECT * FROM `analytics.orders`;

Materialized Views

Create materialized views for common aggregations:

CREATE MATERIALIZED VIEW `analytics.mv_daily_revenue`
PARTITION BY date
CLUSTER BY region
AS
SELECT
DATE(order_date) as date,
region,
SUM(order_total) as revenue,
COUNT(*) as order_count
FROM `analytics.orders`
GROUP BY 1, 2;

BI Engine Acceleration

Enable BI Engine for sub-second query response:

# Reserve BI Engine capacity (via gcloud or Console)
gcloud alpha bq bi-reservations create \
--location=US \
--size=10GB

Cost Management

Byte Limits

Set maximum bytes billed to prevent runaway queries:

warehouse:
type: bigquery
extra:
maximum_bytes_billed: 10737418240 # 10GB limit

Query Priority

Use BATCH priority for non-urgent queries (lower cost):

warehouse:
type: bigquery
extra:
priority: BATCH

Slot Reservations

For predictable pricing, use flat-rate slots:

# Create slot commitment
gcloud alpha bq reservations create \
--location=US \
--slots=100 \
--commitment-plan=FLEX

Cost Monitoring

Monitor query costs:

-- Query cost analysis (last 30 days)
SELECT
user_email,
DATE(creation_time) as query_date,
COUNT(*) as query_count,
SUM(total_bytes_billed) / (1024*1024*1024*1024) as tb_billed,
SUM(total_bytes_billed) / (1024*1024*1024*1024) * 5 as estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY estimated_cost_usd DESC;

Pre-aggregation Tables

Olytix Core can create and manage pre-aggregation tables in BigQuery:

# cubes/orders.yml
cubes:
- name: orders
sql: "SELECT * FROM {{ ref('fct_orders') }}"

pre_aggregations:
- name: daily_revenue_by_region
measures: [total_revenue, order_count]
dimensions: [region]
time_dimension: order_date
granularity: day
partition_granularity: month
build_range_start: "2023-01-01"
refresh_key:
every: "1 hour"
external:
type: bigquery
dataset: preagg_analytics

Troubleshooting

Permission Denied

Error: Access Denied: Project my-gcp-project: User does not have permission

Solutions:

  1. Verify service account has required roles
  2. Check project ID is correct
  3. Ensure BigQuery API is enabled

Dataset Not Found

Error: Not found: Dataset my-gcp-project:analytics_dataset

Solutions:

  1. Verify dataset name and location
  2. Check service account has access to dataset
  3. Ensure project ID is correct

Quota Exceeded

Error: Quota exceeded: Your project exceeded quota for concurrent queries

Solutions:

  1. Implement query queuing in Olytix Core
  2. Request quota increase from Google Cloud
  3. Use slot reservations for predictable capacity

Query Too Large

Error: Query exceeds resource limits

Solutions:

  1. Add maximum_bytes_billed limit
  2. Optimize query to scan less data
  3. Use partitioned tables with partition filters
  4. Consider pre-aggregations

Authentication Failed

Error: Could not automatically determine credentials

Solutions:

  1. Set GOOGLE_APPLICATION_CREDENTIALS environment variable
  2. Provide explicit credentials_path in configuration
  3. Run gcloud auth application-default login for development

Multi-Region Configuration

For global deployments:

# US region
warehouse:
type: bigquery
project_id: my-gcp-project
database: analytics_us
extra:
location: US

# EU region (separate configuration)
warehouse_eu:
type: bigquery
project_id: my-gcp-project-eu
database: analytics_eu
extra:
location: EU

Next Steps