DQM-ML V2 Project Overview
This page is for developers who want to understand how DQM-ML is structured and how to work with the codebase. For a general introduction to what DQM-ML does, check out the Home page.
Package Architecture
The project is organized as a Python monorepo using uv workspace.
Directory Structure
dqm-ml-workspace/
├── packages/
│ ├── dqm-ml-core/ # Core API & standard metrics
│ ├── dqm-ml-job/ # Pipeline orchestration & data loaders
│ ├── dqm-ml-images/ # Image feature extraction
│ ├── dqm-ml-pytorch/ # PyTorch-based metrics (Domain Gap)
│ └── dqm-ml/ # CLI wrapper & entry point
├── tests/ # Test suite
├── docs/ # Documentation
└── examples/ # Example configurations
Here's how the packages relate to each other:
flowchart TB
job[dqm-ml-job<br/>Orchestration] --> core[dqm-ml-core<br/>Core API]
images[dqm-ml-images<br/>Visual Features] --> core
pytorch[dqm-ml-pytorch<br/>PyTorch Metrics] --> core
cli[dqm-ml<br/>CLI Wrapper] --> job
cli --> core
cli --> images
cli --> pytorch
What Each Package Does
| Package | Purpose |
|---|---|
| dqm-ml-core | Defines the base DatametricProcessor class and core metrics (Completeness, Representativeness). The foundation everything else builds on. |
| dqm-ml-job | Handles the data pipeline: loading data, processing batches, and writing results. Think of it as the "engine room." |
| dqm-ml-images | Extracts visual features from images (luminosity, contrast, blur, entropy) - useful for checking image dataset quality. |
| dqm-ml-pytorch | Advanced metrics that need PyTorch, like Domain Gap (measuring difference between train/test distributions). |
| dqm-ml | The CLI entry point - what you use from the command line to run jobs. |
Key Technologies
DQM-ML uses these tools to be fast and reliable:
- uv: Fast Python package manager and workspace orchestrator
- PyArrow: Efficient batch processing and memory-efficient data handling
- nox: Task runner for testing, linting, and documentation
- mkdocs-material: The beautiful documentation you're reading now
Building and Running
Here's how to get started with development:
Setup
# Synchronize workspace and install dependencies
uv sync
Running the CLI
The main entry point is the dqm-ml CLI:
# List available metrics and data loaders
uv run dqm-ml list
# Execute a pipeline from a configuration file
uv run dqm-ml process -p config.yaml
Testing and Quality Checks
We maintain high code quality with automated checks:
# Run all tests, linting, and type checking
uv run nox
# Run specific checks
uv run nox -s test # Run tests
uv run nox -s lint # Check code style
uv run nox -s type_check # Type checking
uv run nox -s docs # Build documentation
Developing New Metrics
If you want to add a new metric (awesome!), here's how it works:
The Metric Processor Pattern
All metrics inherit from DatametricProcessor and implement these methods:
flowchart TB
F[compute_features] --> B[compute_batch_metric]
B --> C[compute]
C --> D[compute_delta]
| Method | Purpose | Required? |
|---|---|---|
compute_features() |
Extract per-sample features from raw data | Optional |
compute_batch_metric() |
Compute intermediate stats for one batch | Yes |
compute() |
Aggregate all batches into final metric | Yes |
compute_delta() |
Compare two datasets (e.g., train vs test) | Optional |
Data Loading Pattern
Data loaders work in two tiers:
DataLoader: Factory that discovers what data is availableDataSelection: Handles iterating through a specific data subset in batches
Coding Standards
We keep the codebase consistent with these tools:
- Linting: ruff - Line length is 120 characters
- Type Checking: mypy - Strict mode enabled
- Formatting:
ruff format- Consistent code style - Testing: pytest with 300-second timeout per test
Running Quality Checks
# Fix auto-fixable issues
uv run nox -s lint_fix
# Run type checking
uv run nox -s type_check