BigQuery Integration
Google BigQuery is a serverless, highly scalable data warehouse with built-in machine learning capabilities. Olytix Core integrates with BigQuery using the official Google Cloud Python client library.
Prerequisites
- Google Cloud project with BigQuery API enabled
- Service account with BigQuery permissions (or user credentials for development)
- Network access to Google Cloud APIs
Installation
Install the BigQuery connector:
pip install olytix-core[bigquery]
# Or install the client directly
pip install google-cloud-bigquery
Configuration
Basic Configuration
Add the warehouse configuration to your olytix-core_project.yml:
name: my_analytics
version: 1.0.0
warehouse:
type: bigquery
project_id: my-gcp-project
database: analytics_dataset
credentials_path: ${OLYTIX_GCP_CREDENTIALS_PATH}
Environment Variables
Set your credentials path:
export OLYTIX_GCP_CREDENTIALS_PATH=/path/to/service-account.json
# Or use Google Application Default Credentials
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
Complete Configuration Options
warehouse:
type: bigquery
# Project settings
project_id: my-gcp-project
# Dataset (equivalent to schema in other warehouses)
database: analytics_dataset
# Authentication
credentials_path: /path/to/service-account.json
# Query settings
query_timeout: 300 # seconds
# Optional settings
extra:
location: US # Dataset location
maximum_bytes_billed: 1073741824 # 1GB limit
use_legacy_sql: false
priority: INTERACTIVE # or BATCH
Authentication Methods
Service Account (Recommended for Production)
Create a service account and download the JSON key file:
warehouse:
type: bigquery
project_id: my-gcp-project
credentials_path: /etc/olytix-core/service-account.json
Required roles for the service account:
roles/bigquery.dataViewer- Read access to datasetsroles/bigquery.jobUser- Run queriesroles/bigquery.dataEditor- Create tables (for pre-aggregations)
# Create service account
gcloud iam service-accounts create olytix-core-analytics \
--description="Olytix Core Analytics Service Account" \
--display-name="Olytix Core Analytics"
# Grant permissions
gcloud projects add-iam-policy-binding my-gcp-project \
--member="serviceAccount:olytix-core-analytics@my-gcp-project.iam.gserviceaccount.com" \
--role="roles/bigquery.dataViewer"
gcloud projects add-iam-policy-binding my-gcp-project \
--member="serviceAccount:olytix-core-analytics@my-gcp-project.iam.gserviceaccount.com" \
--role="roles/bigquery.jobUser"
# Create and download key
gcloud iam service-accounts keys create service-account.json \
--iam-account=olytix-core-analytics@my-gcp-project.iam.gserviceaccount.com
Application Default Credentials
For development environments or GCE/GKE:
warehouse:
type: bigquery
project_id: my-gcp-project
# credentials_path omitted - uses ADC
Set up ADC:
# For local development
gcloud auth application-default login
# For GCE/GKE - uses metadata server automatically
Workload Identity (GKE)
For Kubernetes deployments with Workload Identity:
# kubernetes/deployment.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: olytix-core-sa
annotations:
iam.gke.io/gcp-service-account: olytix-core-analytics@my-project.iam.gserviceaccount.com
# olytix-core_project.yml - no credentials needed
warehouse:
type: bigquery
project_id: my-gcp-project
Testing the Connection
Verify your connection is working:
olytix-core test-connection
Expected output:
Testing connection to bigquery://my-gcp-project
✓ Connection successful
✓ Project 'my-gcp-project' accessible
✓ Dataset 'analytics_dataset' found
✓ Found 28 tables
Using Python:
from olytix-core.engine.connectors.bigquery import BigQueryConnector
from olytix-core.engine.connectors.base import ConnectionConfig, ConnectorType
config = ConnectionConfig(
type=ConnectorType.BIGQUERY,
project_id="my-gcp-project",
database="analytics_dataset",
credentials_path="/path/to/service-account.json",
)
async def test():
connector = BigQueryConnector(config)
async with connector:
result = await connector.execute("SELECT 1 as test")
print(f"Connection successful: {result.data}")
# Check query cost
estimate = await connector.get_query_cost_estimate(
"SELECT * FROM `my-gcp-project.analytics.orders`"
)
print(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")
Defining Sources
Once connected, define your data sources:
# sources/bigquery_raw.yml
version: 2
sources:
- name: raw
database: raw_data
schema: ecommerce
tables:
- name: orders
description: Raw e-commerce orders
identifier: "my-gcp-project.raw_data.orders"
columns:
- name: order_id
description: Order identifier
- name: customer_id
description: Customer reference
- name: order_total
description: Order total in USD
- name: order_date
description: Order timestamp
- name: customers
description: Customer master data
freshness:
warn_after: { count: 24, period: hour }
error_after: { count: 48, period: hour }
BigQuery-Specific Features
Query Cost Estimation
Estimate query cost before execution:
estimate = await connector.get_query_cost_estimate(
"SELECT * FROM `analytics.orders` WHERE order_date > '2024-01-01'"
)
print(f"Bytes to process: {estimate['bytes_processed']:,}")
print(f"Estimated cost: ${estimate['estimated_cost_usd']:.4f}")
print(f"Cached: {estimate['is_cached']}")
Query Metadata
Olytix Core captures BigQuery-specific metadata:
result = await connector.execute(
"SELECT region, SUM(amount) FROM orders GROUP BY region"
)
print(f"Job ID: {result.metadata.get('job_id')}")
print(f"Bytes processed: {result.bytes_processed:,}")
print(f"Bytes billed: {result.bytes_billed:,}")
print(f"Cache hit: {result.cache_hit}")
print(f"Slot milliseconds: {result.metadata.get('slot_millis')}")
DataFrame Loading
Load data directly from pandas DataFrames:
import pandas as pd
df = pd.DataFrame({
"date": ["2024-01-01", "2024-01-02"],
"revenue": [50000, 52000],
})
rows = await connector.load_from_dataframe(
table="daily_revenue",
dataframe=df,
write_disposition="WRITE_TRUNCATE" # or WRITE_APPEND
)
print(f"Loaded {rows} rows")
Create Table from Query
Create tables directly from SQL:
await connector.create_table_from_query(
table="aggregated_orders",
sql="""
SELECT
DATE_TRUNC(order_date, DAY) as date,
region,
SUM(order_total) as revenue
FROM `analytics.orders`
GROUP BY 1, 2
""",
write_disposition="WRITE_TRUNCATE"
)
Performance Optimization
Partitioning
Use table partitioning for large datasets:
-- Create partitioned table
CREATE TABLE `analytics.orders_partitioned`
PARTITION BY DATE(order_date)
CLUSTER BY customer_id, region
AS SELECT * FROM `analytics.orders`;
Reference partitioned tables in Olytix Core:
sources:
- name: analytics
tables:
- name: orders_partitioned
description: Partitioned orders table
meta:
partition_by: order_date
partition_type: DAY
Clustering
Add clustering for frequently filtered columns:
-- Cluster by commonly filtered columns
CREATE TABLE `analytics.orders_clustered`
PARTITION BY DATE(order_date)
CLUSTER BY region, product_category
AS SELECT * FROM `analytics.orders`;
Materialized Views
Create materialized views for common aggregations:
CREATE MATERIALIZED VIEW `analytics.mv_daily_revenue`
PARTITION BY date
CLUSTER BY region
AS
SELECT
DATE(order_date) as date,
region,
SUM(order_total) as revenue,
COUNT(*) as order_count
FROM `analytics.orders`
GROUP BY 1, 2;
BI Engine Acceleration
Enable BI Engine for sub-second query response:
# Reserve BI Engine capacity (via gcloud or Console)
gcloud alpha bq bi-reservations create \
--location=US \
--size=10GB
Cost Management
Byte Limits
Set maximum bytes billed to prevent runaway queries:
warehouse:
type: bigquery
extra:
maximum_bytes_billed: 10737418240 # 10GB limit
Query Priority
Use BATCH priority for non-urgent queries (lower cost):
warehouse:
type: bigquery
extra:
priority: BATCH
Slot Reservations
For predictable pricing, use flat-rate slots:
# Create slot commitment
gcloud alpha bq reservations create \
--location=US \
--slots=100 \
--commitment-plan=FLEX
Cost Monitoring
Monitor query costs:
-- Query cost analysis (last 30 days)
SELECT
user_email,
DATE(creation_time) as query_date,
COUNT(*) as query_count,
SUM(total_bytes_billed) / (1024*1024*1024*1024) as tb_billed,
SUM(total_bytes_billed) / (1024*1024*1024*1024) * 5 as estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY estimated_cost_usd DESC;
Pre-aggregation Tables
Olytix Core can create and manage pre-aggregation tables in BigQuery:
# cubes/orders.yml
cubes:
- name: orders
sql: "SELECT * FROM {{ ref('fct_orders') }}"
pre_aggregations:
- name: daily_revenue_by_region
measures: [total_revenue, order_count]
dimensions: [region]
time_dimension: order_date
granularity: day
partition_granularity: month
build_range_start: "2023-01-01"
refresh_key:
every: "1 hour"
external:
type: bigquery
dataset: preagg_analytics
Troubleshooting
Permission Denied
Error: Access Denied: Project my-gcp-project: User does not have permission
Solutions:
- Verify service account has required roles
- Check project ID is correct
- Ensure BigQuery API is enabled
Dataset Not Found
Error: Not found: Dataset my-gcp-project:analytics_dataset
Solutions:
- Verify dataset name and location
- Check service account has access to dataset
- Ensure project ID is correct
Quota Exceeded
Error: Quota exceeded: Your project exceeded quota for concurrent queries
Solutions:
- Implement query queuing in Olytix Core
- Request quota increase from Google Cloud
- Use slot reservations for predictable capacity
Query Too Large
Error: Query exceeds resource limits
Solutions:
- Add
maximum_bytes_billedlimit - Optimize query to scan less data
- Use partitioned tables with partition filters
- Consider pre-aggregations
Authentication Failed
Error: Could not automatically determine credentials
Solutions:
- Set
GOOGLE_APPLICATION_CREDENTIALSenvironment variable - Provide explicit
credentials_pathin configuration - Run
gcloud auth application-default loginfor development
Multi-Region Configuration
For global deployments:
# US region
warehouse:
type: bigquery
project_id: my-gcp-project
database: analytics_us
extra:
location: US
# EU region (separate configuration)
warehouse_eu:
type: bigquery
project_id: my-gcp-project-eu
database: analytics_eu
extra:
location: EU