Data Profiling
Olytix Core's data profiling service helps you understand your data deeply through automated statistical analysis, quality scoring, and insight detection.
Overview
┌─────────────────────────────────────────────────────────────────────┐
│ DATA PROFILING PIPELINE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ SAMPLING │───▶│ STATISTICS │───▶│ DISTRIBUTIONS │ │
│ │ │ │ │ │ │ │
│ │ Random │ │ Numeric │ │ Histograms │ │
│ │ Stratified │ │ String │ │ Frequencies │ │
│ │ Systematic │ │ Temporal │ │ Percentiles │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PROFILE │◀───│ QUALITY │◀───│ INSIGHTS │ │
│ │ OUTPUT │ │ SCORING │ │ DETECTION │ │
│ │ │ │ │ │ │ │
│ │ CubeProfile │ │ Completeness │ │ Patterns │ │
│ │ Dimension │ │ Validity │ │ Outliers │ │
│ │ Measure │ │ Consistency │ │ Anomalies │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Key Features
Statistical Profiling
Comprehensive statistics for different data types:
Numeric Data:
- Count, min, max, mean, median
- Standard deviation, variance
- Percentiles (25th, 50th, 75th, 90th, 95th, 99th)
- Skewness and kurtosis
- Zero count, negative count
String Data:
- Count, unique count, cardinality ratio
- Min/max/average length
- Empty count, null count
- Pattern detection (emails, phones, etc.)
- Most frequent values
Temporal Data:
- Count, min date, max date
- Date range and span
- Gaps in time series
- Day of week distribution
- Seasonality indicators
Distribution Analysis
Distribution Types:
├── Histogram: Bucket numeric values into ranges
├── Frequency: Count occurrences of each value
├── Percentile: Distribution across percentiles
└── Time Series: Distribution over time periods
Quality Scoring
Each profile includes a quality score with components:
| Component | Description | Weight |
|---|---|---|
| Completeness | Percentage of non-null values | 30% |
| Validity | Values matching expected format/range | 25% |
| Uniqueness | Appropriate cardinality for the field | 20% |
| Consistency | Values following consistent patterns | 15% |
| Accuracy | Values within expected bounds | 10% |
Insight Detection
Automatically detect patterns and issues:
- Outliers: Values far from the norm
- Gaps: Missing periods in time series
- Skew: Highly skewed distributions
- High Cardinality: Dimensions with too many unique values
- Low Cardinality: Potential for optimization
- Patterns: Detected formats (dates, IDs, etc.)
Usage
Profile a Cube
from olytix-core.profiling.service import ProfilingService
from olytix-core.profiling.profiles.models import ProfileConfig
service = ProfilingService()
# Profile a cube with sample data
profile = await service.profile_cube(
cube_name="Orders",
data=orders_data, # List of dicts
dimension_names=["region", "product_category", "customer_type"],
measure_names=["revenue", "quantity", "discount"],
dimension_types={
"region": "STRING",
"product_category": "STRING",
"customer_type": "STRING"
},
measure_types={
"revenue": "NUMBER",
"quantity": "NUMBER",
"discount": "NUMBER"
},
config=ProfileConfig(
sample_size=10000,
compute_histograms=True,
compute_frequencies=True,
detect_insights=True
)
)
Profile Output
CubeProfile:
├── cube_name: "Orders"
├── row_count: 150000
├── profiled_at: "2024-01-15T10:30:00Z"
├── sample_size: 10000
│
├── dimension_profiles:
│ ├── region:
│ │ ├── type: "STRING"
│ │ ├── count: 10000
│ │ ├── unique_count: 5
│ │ ├── null_count: 0
│ │ ├── top_values: [("North", 3500), ("South", 2800), ...]
│ │ └── quality_score: 0.95
│ │
│ └── product_category:
│ ├── type: "STRING"
│ ├── count: 10000
│ ├── unique_count: 25
│ └── ...
│
├── measure_profiles:
│ └── revenue:
│ ├── type: "NUMBER"
│ ├── count: 10000
│ ├── null_count: 0
│ ├── min: 10.50
│ ├── max: 15000.00
│ ├── mean: 450.25
│ ├── median: 325.00
│ ├── std_dev: 380.50
│ ├── percentiles: {25: 150, 50: 325, 75: 600, 90: 950, 99: 3500}
│ ├── histogram: [Bucket(0-100, 1500), Bucket(100-500, 5500), ...]
│ └── quality_score: 0.92
│
├── overall_quality_score: 0.91
│
└── insights:
├── Insight("revenue has positive skew (2.3)", severity="INFO")
├── Insight("15 outliers detected in revenue", severity="WARNING")
└── Insight("product_category has high cardinality", severity="INFO")
Profile a Single Dimension
profile = await service.profile_dimension(
dimension_name="region",
values=region_values,
dim_type="STRING"
)
# Access profile data
print(f"Unique values: {profile.unique_count}")
print(f"Top values: {profile.top_values}")
print(f"Quality score: {profile.quality_score}")
Profile a Single Measure
profile = await service.profile_measure(
measure_name="revenue",
values=revenue_values,
measure_type="NUMBER",
config=ProfileConfig(
compute_histograms=True,
histogram_bins=20
)
)
# Access statistics
print(f"Mean: {profile.mean}")
print(f"Median: {profile.median}")
print(f"Std Dev: {profile.std_dev}")
print(f"P99: {profile.percentiles[99]}")
API Endpoints
Profile Cube
POST /api/v1/profiling/cubes/<cube_name>/profile
Content-Type: application/json
{
"config": {
"sample_size": 10000,
"compute_histograms": true,
"histogram_bins": 20,
"compute_frequencies": true,
"top_n_values": 10,
"detect_insights": true
},
"filters": {
"date_range": {
"start": "2024-01-01",
"end": "2024-01-31"
}
}
}
Get Profile
GET /api/v1/profiling/cubes/<cube_name>/profile?
include_histograms=true&
include_insights=true
Profile Dimension
POST /api/v1/profiling/dimensions/<cube>/<dimension>/profile
Get Quality Score
GET /api/v1/profiling/cubes/<cube_name>/quality-score
Sampling Strategies
For large datasets, profiling uses intelligent sampling:
Random Sampling (Default)
from olytix-core.profiling.sampling.strategies import RandomSampler
sampler = RandomSampler(seed=42) # Reproducible sampling
service = ProfilingService(sampler=sampler)
Stratified Sampling
Ensures representation across dimension values:
from olytix-core.profiling.sampling.strategies import StratifiedSampler
sampler = StratifiedSampler(
stratify_by="region",
min_per_stratum=100
)
Systematic Sampling
Every nth record:
from olytix-core.profiling.sampling.strategies import SystematicSampler
sampler = SystematicSampler(interval=10) # Every 10th record
Quality Scoring Details
Completeness Score
completeness = (total_count - null_count) / total_count
# Example: 9500 non-null out of 10000
# Completeness = 0.95
Validity Score
For numeric measures:
validity = values_in_expected_range / total_count
# Example: Revenue should be > 0
# 9800 positive values out of 10000 = 0.98
For string dimensions:
validity = values_matching_pattern / total_count
# Example: Email format validation
# 9200 valid emails out of 10000 = 0.92
Uniqueness Score
# For dimensions that should be unique (like IDs):
uniqueness = unique_count / total_count
# For dimensions with expected cardinality:
uniqueness = 1 - abs(expected_cardinality - actual_cardinality) / expected_cardinality
Overall Score
overall = (
completeness * 0.30 +
validity * 0.25 +
uniqueness * 0.20 +
consistency * 0.15 +
accuracy * 0.10
)
Insight Detection
Outlier Detection
# Using IQR method
Q1 = percentile(values, 25)
Q3 = percentile(values, 75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = [v for v in values if v < lower_bound or v > upper_bound]
Skewness Detection
if abs(skewness) > 2:
insight = Insight(
message=f"Distribution is highly skewed ({skewness:.2f})",
severity="WARNING",
recommendation="Consider log transformation for analysis"
)
Cardinality Analysis
cardinality_ratio = unique_count / total_count
if cardinality_ratio > 0.9:
insight = Insight(
message="High cardinality - may not be suitable as a dimension",
severity="WARNING"
)
elif cardinality_ratio < 0.001:
insight = Insight(
message="Very low cardinality - consider as a filter",
severity="INFO"
)
Configuration Options
ProfileConfig:
├── sample_size: int = 10000 # Max rows to sample
├── compute_histograms: bool = True
├── histogram_bins: int = 20
├── compute_frequencies: bool = True
├── top_n_values: int = 10
├── compute_percentiles: bool = True
├── percentile_list: list = [25, 50, 75, 90, 95, 99]
├── detect_insights: bool = True
├── insight_sensitivity: float = 0.8
└── quality_thresholds: dict = {
"completeness_warning": 0.95,
"validity_warning": 0.90,
"uniqueness_warning": 0.80
}
Best Practices
When to Profile
- New data sources: Profile before integrating
- Schema changes: Profile after modifications
- Data quality issues: Profile to diagnose problems
- Regular monitoring: Schedule periodic profiling
Interpreting Results
- Quality scores below 0.8: Investigate data issues
- High outlier counts: Review data validation rules
- Unexpected cardinality: Check for data integrity issues
- Pattern mismatches: Verify data transformations
Performance Tips
- Use sampling for large datasets (>100k rows)
- Profile incrementally for time-series data
- Cache profiles for frequently accessed cubes
- Schedule off-peak for full profiles
Next Steps
- Query Assistant - Explore profiled data
- Anomaly Detection - Monitor data quality
- Data Tests - Automate quality checks