Skip to content

Representativeness Metric

The Representativeness metric compares the distribution of your data to a target reference distribution (such as Normal or Uniform). It ensures that the dataset conforms to a given specification or requirement.

What It Measures

Representativeness checks if your data follows a known distribution. Use it to:

  • Validate synthetic data — Ensure generated data matches expected patterns
  • Check for data drift — Detect changes in data distribution over time
  • Ensure balanced datasets — Verify class distributions in classification tasks

Available Statistical Tests

Test What It Does Best For
Chi-Square (χ²) Goodness-of-fit test Categorical/binned data
Kolmogorov-Smirnov (KS) Non-parametric test Continuous distributions
Shannon Entropy Measures information diversity General diversity
GRTE Granular Relative Theoretical Entropy Complex distributions (Confiance.ai)

Use Cases

  • Validate that training data matches expected distributions
  • Detect distribution shifts in production data
  • Check if data augmentation preserved original distribution
  • Verify synthetic data quality

Processor Information

  • Class: RepresentativenessProcessor
  • Package: dqm-ml-core
  • Type Name: representativeness

Configuration Parameters

  • input_columns: Numeric columns to analyze.
  • distribution: Reference distribution name (normal, uniform).
  • metrics: List of statistical tests to run (chi-square, kolmogorov-smirnov, shannon-entropy, grte).
  • bins: Number of bins for histogram-based tests (default: 10).
  • distribution_params: (Optional) Parameters for the reference distribution (e.g., mean, std for normal).

Example YAML Configuration

metrics_processor:
  dist_check:
    type: representativeness
    input_columns: ["feature_x"]
    distribution: "normal"
    metrics: ["chi-square", "kolmogorov-smirnov"]
    bins: 30
    distribution_params:
      mean: 0.0
      std: 1.0

Output

The processor returns a dictionary with scores for each requested metric and column:

  • <column_name>_<metric_name>_score: The statistical test score.
  • <column_name>_<metric_name>_pvalue: The p-value associated with the test.