Metrics Guide

DQM-ML provides several types of metrics to assess different aspects of data quality. This guide helps you choose the right metric for your needs.

Quick Decision Guide

Not sure which metric you need? Use this guide:

If you need to...	Use this metric	Complexity
Find missing values	Completeness	Low (CPU only)
Check if data matches a distribution	Representativeness	Low (CPU only)
Compare train/test distributions	Domain Gap	High (requires PyTorch)
Check image quality	Visual Features	Medium (CPU only)

Each metric is based on established statistical methods:

Completeness: Ratio of non-null values
Representativeness: χ² (Chi-Square), KS (Kolmogorov-Smirnov), Shannon Entropy, GRTE
Domain Gap: MMD (Maximum Mean Discrepancy), FID (Fréchet Inception Distance), Wasserstein
Visual Features: Laplacian variance, histogram entropy

These are the most commonly used metrics for tabular data quality:

Completeness - Checks for missing/null values in your data
Representativeness - Validates that data follows an expected distribution (Normal, Uniform)

For analyzing image datasets:

Visual Features - Extracts image quality indicators like brightness, contrast, sharpness, and entropy

For comparing datasets using deep learning embeddings:

Domain Gap - Measures statistical distance between two datasets (useful for detecting data drift)

Each metric is configured in the metrics_processor section of your YAML config. See the Configuration Guide for details.

Each metric page has: - Configuration parameters - Example YAML config - Output format