Metrics Guide
DQM-ML provides several types of metrics to assess different aspects of data quality. This guide helps you choose the right metric for your needs.
Quick Decision Guide
Not sure which metric you need? Use this guide:
| If you need to... | Use this metric | Complexity |
|---|---|---|
| Find missing values | Completeness | Low (CPU only) |
| Check if data matches a distribution | Representativeness | Low (CPU only) |
| Compare train/test distributions | Domain Gap | High (requires PyTorch) |
| Check image quality | Visual Features | Medium (CPU only) |
Complexity Guide
- Low (CPU only): Completeness, Representativeness — runs on any machine
- Medium: Visual Features — requires opencv, but no GPU needed
- High (GPU recommended): Domain Gap — uses PyTorch, faster with GPU
The Math behind the metrics
Each metric is based on established statistical methods:
- Completeness: Ratio of non-null values
- Representativeness: χ² (Chi-Square), KS (Kolmogorov-Smirnov), Shannon Entropy, GRTE
- Domain Gap: MMD (Maximum Mean Discrepancy), FID (Fréchet Inception Distance), Wasserstein
- Visual Features: Laplacian variance, histogram entropy
Available Metrics by Package
Core Metrics (dqm-ml-core)
These are the most commonly used metrics for tabular data quality:
- Completeness - Checks for missing/null values in your data
- Representativeness - Validates that data follows an expected distribution (Normal, Uniform)
Visual Metrics (dqm-ml-images)
For analyzing image datasets:
- Visual Features - Extracts image quality indicators like brightness, contrast, sharpness, and entropy
Advanced Metrics (dqm-ml-pytorch)
For comparing datasets using deep learning embeddings:
- Domain Gap - Measures statistical distance between two datasets (useful for detecting data drift)
How Metrics are Configured
Each metric is configured in the metrics_processor section of your YAML config. See the Configuration Guide for details.
Each metric page has: - Configuration parameters - Example YAML config - Output format