Representativeness Metric
The Representativeness metric compares the distribution of your data to a target reference distribution (such as Normal or Uniform). It ensures that the dataset conforms to a given specification or requirement.
What It Measures
Representativeness checks if your data follows a known distribution. Use it to:
- Validate synthetic data — Ensure generated data matches expected patterns
- Check for data drift — Detect changes in data distribution over time
- Ensure balanced datasets — Verify class distributions in classification tasks
Available Statistical Tests
| Test | What It Does | Best For |
|---|---|---|
| Chi-Square (χ²) | Goodness-of-fit test | Categorical/binned data |
| Kolmogorov-Smirnov (KS) | Non-parametric test | Continuous distributions |
| Shannon Entropy | Measures information diversity | General diversity |
| GRTE | Granular Relative Theoretical Entropy | Complex distributions (Confiance.ai) |
Use Cases
- Validate that training data matches expected distributions
- Detect distribution shifts in production data
- Check if data augmentation preserved original distribution
- Verify synthetic data quality
Processor Information
- Class:
RepresentativenessProcessor - Package:
dqm-ml-core - Type Name:
representativeness
Configuration Parameters
input_columns: Numeric columns to analyze.distribution: Reference distribution name (normal,uniform).metrics: List of statistical tests to run (chi-square,kolmogorov-smirnov,shannon-entropy,grte).bins: Number of bins for histogram-based tests (default: 10).distribution_params: (Optional) Parameters for the reference distribution (e.g.,mean,stdfor normal).
Example YAML Configuration
metrics_processor:
dist_check:
type: representativeness
input_columns: ["feature_x"]
distribution: "normal"
metrics: ["chi-square", "kolmogorov-smirnov"]
bins: 30
distribution_params:
mean: 0.0
std: 1.0
Output
The processor returns a dictionary with scores for each requested metric and column:
<column_name>_<metric_name>_score: The statistical test score.<column_name>_<metric_name>_pvalue: The p-value associated with the test.