Why DQM-ML V2?

Curious about why we rebuilt DQM-ML from scratch? This page explains the design decisions behind V2 and how the streaming architecture works. For a general introduction, check out the Home page.

The Problem with V1

The original dqm-ml library worked well for small datasets, but had limitations:

Memory issues: Loading entire datasets into Pandas DataFrames crashed on large files
Fixed metrics: Adding new metrics required modifying core code
Tight coupling: You needed all dependencies even if using just one metric

How V2 Solves This

V2 was designed around four key principles:

Streaming: Process data in batches without loading everything into memory
Modularity: Install only what you need (don't need PyTorch? Don't install it!)
Extensibility: Add new metrics via plugins without touching core code
Unified API: One consistent interface for all metric types

DQM-ML V2 Architecture

Here's how data flows through the DQM-ML V2 system:

flowchart LR A1[Parquet Files] --> B[DataLoader] A2[CSV Files] --> B A3[Databases] --> B B --> C[Streaming Batches] C --> D[Metric Processor] D --> E[Intermediate Stats] E --> F[Final Metrics] F --> G[Output Writer] G --> H1[Parquet Files] G --> H2[CSV Files] G --> H3[Dashboards]

How it works:

DataLoader loads your data (Parquet, CSV, etc.)
Streaming Batches process data in chunks — never loads the whole dataset into memory
Metric Processor computes features and intermediate statistics for each batch
Intermediate Stats accumulate as batches are processed
Final Metrics aggregate all intermediate stats into dataset-level scores
Output Writer saves results to your preferred format

Memory Efficiency (Why Streaming Matters)

Unlike V1, which loads entire datasets into memory, V2 processes data in batches:

flowchart LR subgraph "V1 (old way)" direction TB V1["Load entire dataset into RAM"] V1c["Process all at once"] end subgraph "V2 (streaming)" direction TB V2a["Load batch 1 (e.g., 10K rows)"] V2p1["Process batch 1 → stats"] V2b["Load batch 2"] V2p2["Process batch 2 → stats"] V2c["..."] V2agg["Aggregate all batch stats"] end style V1 fill:#ffcdd2 style V2 fill:#c8e6c9

Why This Matters

Key difference: With streaming, you can now process datasets larger than your available RAM. Whether you have a 100MB or 100GB file, memory usage stays constant.

Dataset Size	V1 Memory	V2 Memory
100 MB	~300 MB	~10 MB
1 GB	~3 GB	~10 MB
100 GB	Crashes	~10 MB

Performance Improvements

V2 shows significant improvements over V1:

Metric	V1	V2	Improvement
Memory usage	Full dataset in RAM	Constant (batch size)	~10-100x less
Large Parquet files	Slow / crashes	Fast streaming	~2-5x faster
Adding new metrics	Modify core code	Plugin system	No core changes needed

What's Different from V1

Feature	V1	V2
Data handling	Load into memory	Stream in batches
New metrics	Modify core	Plugin system
Dependencies	All or nothing	Install only what you need
API	ad-hoc	Unified `DatametricProcessor`
Image features	Separate tool	Built into pipeline

The legacy dqm-ml package is still available for reference, but new development should use the V2 API.