Skip to main content

Data Profiling

For Data Analysts

Olytix Core's data profiling service helps you understand your data deeply through automated statistical analysis, quality scoring, and insight detection.

Overview

┌─────────────────────────────────────────────────────────────────────┐
│ DATA PROFILING PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ SAMPLING │───▶│ STATISTICS │───▶│ DISTRIBUTIONS │ │
│ │ │ │ │ │ │ │
│ │ Random │ │ Numeric │ │ Histograms │ │
│ │ Stratified │ │ String │ │ Frequencies │ │
│ │ Systematic │ │ Temporal │ │ Percentiles │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PROFILE │◀───│ QUALITY │◀───│ INSIGHTS │ │
│ │ OUTPUT │ │ SCORING │ │ DETECTION │ │
│ │ │ │ │ │ │ │
│ │ CubeProfile │ │ Completeness │ │ Patterns │ │
│ │ Dimension │ │ Validity │ │ Outliers │ │
│ │ Measure │ │ Consistency │ │ Anomalies │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘

Key Features

Statistical Profiling

Comprehensive statistics for different data types:

Numeric Data:

  • Count, min, max, mean, median
  • Standard deviation, variance
  • Percentiles (25th, 50th, 75th, 90th, 95th, 99th)
  • Skewness and kurtosis
  • Zero count, negative count

String Data:

  • Count, unique count, cardinality ratio
  • Min/max/average length
  • Empty count, null count
  • Pattern detection (emails, phones, etc.)
  • Most frequent values

Temporal Data:

  • Count, min date, max date
  • Date range and span
  • Gaps in time series
  • Day of week distribution
  • Seasonality indicators

Distribution Analysis

Distribution Types:
├── Histogram: Bucket numeric values into ranges
├── Frequency: Count occurrences of each value
├── Percentile: Distribution across percentiles
└── Time Series: Distribution over time periods

Quality Scoring

Each profile includes a quality score with components:

ComponentDescriptionWeight
CompletenessPercentage of non-null values30%
ValidityValues matching expected format/range25%
UniquenessAppropriate cardinality for the field20%
ConsistencyValues following consistent patterns15%
AccuracyValues within expected bounds10%

Insight Detection

Automatically detect patterns and issues:

  • Outliers: Values far from the norm
  • Gaps: Missing periods in time series
  • Skew: Highly skewed distributions
  • High Cardinality: Dimensions with too many unique values
  • Low Cardinality: Potential for optimization
  • Patterns: Detected formats (dates, IDs, etc.)

Usage

Profile a Cube

from olytix-core.profiling.service import ProfilingService
from olytix-core.profiling.profiles.models import ProfileConfig

service = ProfilingService()

# Profile a cube with sample data
profile = await service.profile_cube(
cube_name="Orders",
data=orders_data, # List of dicts
dimension_names=["region", "product_category", "customer_type"],
measure_names=["revenue", "quantity", "discount"],
dimension_types={
"region": "STRING",
"product_category": "STRING",
"customer_type": "STRING"
},
measure_types={
"revenue": "NUMBER",
"quantity": "NUMBER",
"discount": "NUMBER"
},
config=ProfileConfig(
sample_size=10000,
compute_histograms=True,
compute_frequencies=True,
detect_insights=True
)
)

Profile Output

CubeProfile:
├── cube_name: "Orders"
├── row_count: 150000
├── profiled_at: "2024-01-15T10:30:00Z"
├── sample_size: 10000

├── dimension_profiles:
│ ├── region:
│ │ ├── type: "STRING"
│ │ ├── count: 10000
│ │ ├── unique_count: 5
│ │ ├── null_count: 0
│ │ ├── top_values: [("North", 3500), ("South", 2800), ...]
│ │ └── quality_score: 0.95
│ │
│ └── product_category:
│ ├── type: "STRING"
│ ├── count: 10000
│ ├── unique_count: 25
│ └── ...

├── measure_profiles:
│ └── revenue:
│ ├── type: "NUMBER"
│ ├── count: 10000
│ ├── null_count: 0
│ ├── min: 10.50
│ ├── max: 15000.00
│ ├── mean: 450.25
│ ├── median: 325.00
│ ├── std_dev: 380.50
│ ├── percentiles: {25: 150, 50: 325, 75: 600, 90: 950, 99: 3500}
│ ├── histogram: [Bucket(0-100, 1500), Bucket(100-500, 5500), ...]
│ └── quality_score: 0.92

├── overall_quality_score: 0.91

└── insights:
├── Insight("revenue has positive skew (2.3)", severity="INFO")
├── Insight("15 outliers detected in revenue", severity="WARNING")
└── Insight("product_category has high cardinality", severity="INFO")

Profile a Single Dimension

profile = await service.profile_dimension(
dimension_name="region",
values=region_values,
dim_type="STRING"
)

# Access profile data
print(f"Unique values: {profile.unique_count}")
print(f"Top values: {profile.top_values}")
print(f"Quality score: {profile.quality_score}")

Profile a Single Measure

profile = await service.profile_measure(
measure_name="revenue",
values=revenue_values,
measure_type="NUMBER",
config=ProfileConfig(
compute_histograms=True,
histogram_bins=20
)
)

# Access statistics
print(f"Mean: {profile.mean}")
print(f"Median: {profile.median}")
print(f"Std Dev: {profile.std_dev}")
print(f"P99: {profile.percentiles[99]}")

API Endpoints

Profile Cube

POST /api/v1/profiling/cubes/<cube_name>/profile
Content-Type: application/json

{
"config": {
"sample_size": 10000,
"compute_histograms": true,
"histogram_bins": 20,
"compute_frequencies": true,
"top_n_values": 10,
"detect_insights": true
},
"filters": {
"date_range": {
"start": "2024-01-01",
"end": "2024-01-31"
}
}
}

Get Profile

GET /api/v1/profiling/cubes/<cube_name>/profile?
include_histograms=true&
include_insights=true

Profile Dimension

POST /api/v1/profiling/dimensions/<cube>/<dimension>/profile

Get Quality Score

GET /api/v1/profiling/cubes/<cube_name>/quality-score

Sampling Strategies

For large datasets, profiling uses intelligent sampling:

Random Sampling (Default)

from olytix-core.profiling.sampling.strategies import RandomSampler

sampler = RandomSampler(seed=42) # Reproducible sampling
service = ProfilingService(sampler=sampler)

Stratified Sampling

Ensures representation across dimension values:

from olytix-core.profiling.sampling.strategies import StratifiedSampler

sampler = StratifiedSampler(
stratify_by="region",
min_per_stratum=100
)

Systematic Sampling

Every nth record:

from olytix-core.profiling.sampling.strategies import SystematicSampler

sampler = SystematicSampler(interval=10) # Every 10th record

Quality Scoring Details

Completeness Score

completeness = (total_count - null_count) / total_count

# Example: 9500 non-null out of 10000
# Completeness = 0.95

Validity Score

For numeric measures:

validity = values_in_expected_range / total_count

# Example: Revenue should be > 0
# 9800 positive values out of 10000 = 0.98

For string dimensions:

validity = values_matching_pattern / total_count

# Example: Email format validation
# 9200 valid emails out of 10000 = 0.92

Uniqueness Score

# For dimensions that should be unique (like IDs):
uniqueness = unique_count / total_count

# For dimensions with expected cardinality:
uniqueness = 1 - abs(expected_cardinality - actual_cardinality) / expected_cardinality

Overall Score

overall = (
completeness * 0.30 +
validity * 0.25 +
uniqueness * 0.20 +
consistency * 0.15 +
accuracy * 0.10
)

Insight Detection

Outlier Detection

# Using IQR method
Q1 = percentile(values, 25)
Q3 = percentile(values, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = [v for v in values if v < lower_bound or v > upper_bound]

Skewness Detection

if abs(skewness) > 2:
insight = Insight(
message=f"Distribution is highly skewed ({skewness:.2f})",
severity="WARNING",
recommendation="Consider log transformation for analysis"
)

Cardinality Analysis

cardinality_ratio = unique_count / total_count

if cardinality_ratio > 0.9:
insight = Insight(
message="High cardinality - may not be suitable as a dimension",
severity="WARNING"
)
elif cardinality_ratio < 0.001:
insight = Insight(
message="Very low cardinality - consider as a filter",
severity="INFO"
)

Configuration Options

ProfileConfig:
├── sample_size: int = 10000 # Max rows to sample
├── compute_histograms: bool = True
├── histogram_bins: int = 20
├── compute_frequencies: bool = True
├── top_n_values: int = 10
├── compute_percentiles: bool = True
├── percentile_list: list = [25, 50, 75, 90, 95, 99]
├── detect_insights: bool = True
├── insight_sensitivity: float = 0.8
└── quality_thresholds: dict = {
"completeness_warning": 0.95,
"validity_warning": 0.90,
"uniqueness_warning": 0.80
}

Best Practices

When to Profile

  1. New data sources: Profile before integrating
  2. Schema changes: Profile after modifications
  3. Data quality issues: Profile to diagnose problems
  4. Regular monitoring: Schedule periodic profiling

Interpreting Results

  1. Quality scores below 0.8: Investigate data issues
  2. High outlier counts: Review data validation rules
  3. Unexpected cardinality: Check for data integrity issues
  4. Pattern mismatches: Verify data transformations

Performance Tips

  1. Use sampling for large datasets (>100k rows)
  2. Profile incrementally for time-series data
  3. Cache profiles for frequently accessed cubes
  4. Schedule off-peak for full profiles

Next Steps