Data Profiling

For Data Analysts

Olytix Core's data profiling service helps you understand your data deeply through automated statistical analysis, quality scoring, and insight detection.

Overview

┌─────────────────────────────────────────────────────────────────────┐
│                      DATA PROFILING PIPELINE                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐         │
│   │   SAMPLING   │───▶│  STATISTICS  │───▶│ DISTRIBUTIONS │         │
│   │              │    │              │    │              │         │
│   │ Random       │    │ Numeric      │    │ Histograms   │         │
│   │ Stratified   │    │ String       │    │ Frequencies  │         │
│   │ Systematic   │    │ Temporal     │    │ Percentiles  │         │
│   └──────────────┘    └──────────────┘    └──────────────┘         │
│                                                  │                   │
│                                                  ▼                   │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐         │
│   │   PROFILE    │◀───│   QUALITY    │◀───│   INSIGHTS   │         │
│   │   OUTPUT     │    │   SCORING    │    │  DETECTION   │         │
│   │              │    │              │    │              │         │
│   │ CubeProfile  │    │ Completeness │    │ Patterns     │         │
│   │ Dimension    │    │ Validity     │    │ Outliers     │         │
│   │ Measure      │    │ Consistency  │    │ Anomalies    │         │
│   └──────────────┘    └──────────────┘    └──────────────┘         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key Features

Statistical Profiling

Comprehensive statistics for different data types:

Numeric Data:

Count, min, max, mean, median
Standard deviation, variance
Percentiles (25th, 50th, 75th, 90th, 95th, 99th)
Skewness and kurtosis
Zero count, negative count

String Data:

Count, unique count, cardinality ratio
Min/max/average length
Empty count, null count
Pattern detection (emails, phones, etc.)
Most frequent values

Temporal Data:

Count, min date, max date
Date range and span
Gaps in time series
Day of week distribution
Seasonality indicators

Distribution Analysis

Distribution Types:
├── Histogram: Bucket numeric values into ranges
├── Frequency: Count occurrences of each value
├── Percentile: Distribution across percentiles
└── Time Series: Distribution over time periods

Quality Scoring

Each profile includes a quality score with components:

Component	Description	Weight
Completeness	Percentage of non-null values	30%
Validity	Values matching expected format/range	25%
Uniqueness	Appropriate cardinality for the field	20%
Consistency	Values following consistent patterns	15%
Accuracy	Values within expected bounds	10%

Insight Detection

Automatically detect patterns and issues:

Outliers: Values far from the norm
Gaps: Missing periods in time series
Skew: Highly skewed distributions
High Cardinality: Dimensions with too many unique values
Low Cardinality: Potential for optimization
Patterns: Detected formats (dates, IDs, etc.)

Usage

Profile a Cube

from olytix-core.profiling.service import ProfilingService
from olytix-core.profiling.profiles.models import ProfileConfig

service = ProfilingService()

# Profile a cube with sample data
profile = await service.profile_cube(
    cube_name="Orders",
    data=orders_data,  # List of dicts
    dimension_names=["region", "product_category", "customer_type"],
    measure_names=["revenue", "quantity", "discount"],
    dimension_types={
        "region": "STRING",
        "product_category": "STRING",
        "customer_type": "STRING"
    },
    measure_types={
        "revenue": "NUMBER",
        "quantity": "NUMBER",
        "discount": "NUMBER"
    },
    config=ProfileConfig(
        sample_size=10000,
        compute_histograms=True,
        compute_frequencies=True,
        detect_insights=True
    )
)

Profile Output

CubeProfile:
├── cube_name: "Orders"
├── row_count: 150000
├── profiled_at: "2024-01-15T10:30:00Z"
├── sample_size: 10000
│
├── dimension_profiles:
│   ├── region:
│   │   ├── type: "STRING"
│   │   ├── count: 10000
│   │   ├── unique_count: 5
│   │   ├── null_count: 0
│   │   ├── top_values: [("North", 3500), ("South", 2800), ...]
│   │   └── quality_score: 0.95
│   │
│   └── product_category:
│       ├── type: "STRING"
│       ├── count: 10000
│       ├── unique_count: 25
│       └── ...
│
├── measure_profiles:
│   └── revenue:
│       ├── type: "NUMBER"
│       ├── count: 10000
│       ├── null_count: 0
│       ├── min: 10.50
│       ├── max: 15000.00
│       ├── mean: 450.25
│       ├── median: 325.00
│       ├── std_dev: 380.50
│       ├── percentiles: {25: 150, 50: 325, 75: 600, 90: 950, 99: 3500}
│       ├── histogram: [Bucket(0-100, 1500), Bucket(100-500, 5500), ...]
│       └── quality_score: 0.92
│
├── overall_quality_score: 0.91
│
└── insights:
    ├── Insight("revenue has positive skew (2.3)", severity="INFO")
    ├── Insight("15 outliers detected in revenue", severity="WARNING")
    └── Insight("product_category has high cardinality", severity="INFO")

Profile a Single Dimension

profile = await service.profile_dimension(
    dimension_name="region",
    values=region_values,
    dim_type="STRING"
)

# Access profile data
print(f"Unique values: {profile.unique_count}")
print(f"Top values: {profile.top_values}")
print(f"Quality score: {profile.quality_score}")

Profile a Single Measure

profile = await service.profile_measure(
    measure_name="revenue",
    values=revenue_values,
    measure_type="NUMBER",
    config=ProfileConfig(
        compute_histograms=True,
        histogram_bins=20
    )
)

# Access statistics
print(f"Mean: {profile.mean}")
print(f"Median: {profile.median}")
print(f"Std Dev: {profile.std_dev}")
print(f"P99: {profile.percentiles[99]}")

API Endpoints

Profile Cube

POST /api/v1/profiling/cubes/<cube_name>/profile
Content-Type: application/json

{
  "config": {
    "sample_size": 10000,
    "compute_histograms": true,
    "histogram_bins": 20,
    "compute_frequencies": true,
    "top_n_values": 10,
    "detect_insights": true
  },
  "filters": {
    "date_range": {
      "start": "2024-01-01",
      "end": "2024-01-31"
    }
  }
}

Get Profile

GET /api/v1/profiling/cubes/<cube_name>/profile?
  include_histograms=true&
  include_insights=true

Profile Dimension

POST /api/v1/profiling/dimensions/<cube>/<dimension>/profile

Get Quality Score

GET /api/v1/profiling/cubes/<cube_name>/quality-score

Sampling Strategies

For large datasets, profiling uses intelligent sampling:

Random Sampling (Default)

from olytix-core.profiling.sampling.strategies import RandomSampler

sampler = RandomSampler(seed=42)  # Reproducible sampling
service = ProfilingService(sampler=sampler)

Stratified Sampling

Ensures representation across dimension values:

from olytix-core.profiling.sampling.strategies import StratifiedSampler

sampler = StratifiedSampler(
    stratify_by="region",
    min_per_stratum=100
)

Systematic Sampling

Every nth record:

from olytix-core.profiling.sampling.strategies import SystematicSampler

sampler = SystematicSampler(interval=10)  # Every 10th record

Quality Scoring Details

Completeness Score

completeness = (total_count - null_count) / total_count

# Example: 9500 non-null out of 10000
# Completeness = 0.95

Validity Score

For numeric measures:

validity = values_in_expected_range / total_count

# Example: Revenue should be > 0
# 9800 positive values out of 10000 = 0.98

For string dimensions:

validity = values_matching_pattern / total_count

# Example: Email format validation
# 9200 valid emails out of 10000 = 0.92

Uniqueness Score

# For dimensions that should be unique (like IDs):
uniqueness = unique_count / total_count

# For dimensions with expected cardinality:
uniqueness = 1 - abs(expected_cardinality - actual_cardinality) / expected_cardinality

Overall Score

overall = (
    completeness * 0.30 +
    validity * 0.25 +
    uniqueness * 0.20 +
    consistency * 0.15 +
    accuracy * 0.10
)

Insight Detection

Outlier Detection

# Using IQR method
Q1 = percentile(values, 25)
Q3 = percentile(values, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = [v for v in values if v < lower_bound or v > upper_bound]

Skewness Detection

if abs(skewness) > 2:
    insight = Insight(
        message=f"Distribution is highly skewed ({skewness:.2f})",
        severity="WARNING",
        recommendation="Consider log transformation for analysis"
    )

Cardinality Analysis

cardinality_ratio = unique_count / total_count

if cardinality_ratio > 0.9:
    insight = Insight(
        message="High cardinality - may not be suitable as a dimension",
        severity="WARNING"
    )
elif cardinality_ratio < 0.001:
    insight = Insight(
        message="Very low cardinality - consider as a filter",
        severity="INFO"
    )

Configuration Options

ProfileConfig:
├── sample_size: int = 10000  # Max rows to sample
├── compute_histograms: bool = True
├── histogram_bins: int = 20
├── compute_frequencies: bool = True
├── top_n_values: int = 10
├── compute_percentiles: bool = True
├── percentile_list: list = [25, 50, 75, 90, 95, 99]
├── detect_insights: bool = True
├── insight_sensitivity: float = 0.8
└── quality_thresholds: dict = {
    "completeness_warning": 0.95,
    "validity_warning": 0.90,
    "uniqueness_warning": 0.80
}

Best Practices

When to Profile

New data sources: Profile before integrating
Schema changes: Profile after modifications
Data quality issues: Profile to diagnose problems
Regular monitoring: Schedule periodic profiling

Interpreting Results

Quality scores below 0.8: Investigate data issues
High outlier counts: Review data validation rules
Unexpected cardinality: Check for data integrity issues
Pattern mismatches: Verify data transformations

Performance Tips

Use sampling for large datasets (>100k rows)
Profile incrementally for time-series data
Cache profiles for frequently accessed cubes
Schedule off-peak for full profiles

Next Steps

Query Assistant - Explore profiled data
Anomaly Detection - Monitor data quality
Data Tests - Automate quality checks

Overview​

Key Features​

Statistical Profiling​

Distribution Analysis​

Quality Scoring​

Insight Detection​

Usage​

Profile a Cube​

Profile Output​

Profile a Single Dimension​

Profile a Single Measure​

API Endpoints​

Profile Cube​

Get Profile​

Profile Dimension​

Get Quality Score​

Sampling Strategies​

Random Sampling (Default)​

Stratified Sampling​

Systematic Sampling​

Quality Scoring Details​

Completeness Score​

Validity Score​

Uniqueness Score​

Overall Score​

Insight Detection​

Outlier Detection​

Skewness Detection​

Cardinality Analysis​

Configuration Options​

Best Practices​

When to Profile​

Interpreting Results​

Performance Tips​

Next Steps​

Overview

Key Features

Statistical Profiling

Distribution Analysis

Quality Scoring

Insight Detection

Usage

Profile a Cube

Profile Output

Profile a Single Dimension

Profile a Single Measure

API Endpoints

Profile Cube

Get Profile

Profile Dimension

Get Quality Score

Sampling Strategies

Random Sampling (Default)

Stratified Sampling

Systematic Sampling

Quality Scoring Details

Completeness Score

Validity Score

Uniqueness Score

Overall Score

Insight Detection

Outlier Detection

Skewness Detection

Cardinality Analysis

Configuration Options

Best Practices

When to Profile

Interpreting Results

Performance Tips

Next Steps