DQM-ML Core
Core package for DQM-ML V2 providing the foundational API and standard metrics for data quality assessment.
Installation
pip install dqm-ml-core
Note:
dqm-ml-coreprovides metric processors only — no CLI or job orchestration. Use directly via Python or withdqm-ml-jobfor YAML config execution.
Quick Start
Completeness Example
from dqm_ml_core import CompletenessProcessor
processor = CompletenessProcessor(
name="my_check",
config={"input_columns": ["col_a", "col_b"]}
)
result = processor.compute({})
print(f"Completeness: {result['overall_completeness']}")
Representativeness Example
from dqm_ml_core import RepresentativenessProcessor
import numpy as np
# Create sample data (e.g., 1000 samples from normal distribution)
data = np.random.randn(1000)
processor = RepresentativenessProcessor(
name="dist_check",
config={
"input_columns": ["feature"],
"distribution": "normal",
"metrics": ["chi-square", "kolmogorov-smirnov"],
"distribution_params": {"mean": 0.0, "std": 1.0}
}
)
result = processor.compute({})
print(f"Chi-Square p-value: {result['feature_chi-square_pvalue']}")
print(f"KS p-value: {result['feature_kolmogorov-smirnov_pvalue']}")
With dqm-ml-job
For running from a YAML config, install together with dqm-ml-job:
pip install dqm-ml-job dqm-ml-core
Then use this config:
dataloaders:
train:
type: parquet
path: data/train.parquet
metrics_processor:
completeness:
type: completeness
input_columns: [col_a, col_b]
representativeness:
type: representativeness
input_columns: [feature_x]
distribution: "normal"
Key Concepts
DatametricProcessor
The base class for all metrics and feature extractors. It supports a streaming architecture by splitting computation into two phases:
- Batch Level:
compute_batch_metric()updates intermediate statistics for a single chunk of data. - Dataset Level:
compute()aggregates these statistics into final scores.
Included Metrics
| Metric | Description |
|---|---|
| Completeness | Analyzes null/missing values in your dataset |
| Representativeness | Statistical distribution analysis (Chi-Square, KS, Shannon Entropy, GRTE) |
For Developers
To create a new metric:
- Subclass
dqm_ml_core.api.data_processor.DatametricProcessor. - Define
needed_columns(),generated_features(), andgenerated_metrics(). - Implement the streaming logic in
compute_batch_metric()andcompute().
Dependencies
DQM-ML is modular. For core metrics:
# Minimal: use as library only
pip install dqm-ml-core
# For YAML config execution
pip install dqm-ml-job dqm-ml-core
# Full stack with all metrics
pip install dqm-ml-job dqm-ml-core dqm-ml-images dqm-ml-pytorch