Configuration Guide

DQM-ML pipelines are configured using YAML files. This guide explains how to write configuration files that tell DQM-ML where to find data, which metrics to compute, and where to save results.

Quick Links

YAML Basics - Quick reference for YAML syntax
Data Loaders - Define data sources and selections
Metrics Computation - How metrics are computed
CLI Reference - Command-line interface

Configuration Structure Overview

flowchart TB subgraph "config.yaml" direction TB subgraph "dataloaders" DL1["data_1: parquet"] DL2["data_2: csv"] DLn["..."] end subgraph "metrics_processor" MP1["completeness"] MP2["domain_gap"] MPn["..."] end subgraph "outputs" O1["metrics"] O2["delta_metrics"] end end DL1 --> MP1 DL2 --> MP1 DL1 --> MP2 DLn --> MPn MP1 --> O1 MP2 --> O2

Configuration has 3 main sections:

Section	Description
`dataloaders`	Where to load data from (files, filters, splits)
`metrics_processor`	Which metrics to compute
`outputs`	Where to save results

Configuration Structure

A DQM-ML config has three main sections:

1. `dataloaders` - Where's the data?

Defines where your data comes from. You can define multiple data sources.

dataloaders:
  # Give your data a memorable name
  my_training_data:
    # Use "parquet" for .parquet files, "csv" for .csv files
    type: parquet
    # Path to the file (relative paths work too!)
    path: "data/train.parquet"
    # Optional: batch size for processing (default: 10000)
    batch_size: 5000

2. `metrics_processor` - What to compute

Defines which metrics or feature extractors to run on your data.

metrics_processor:
  # Give each metric a unique name
  null_check:
    # The type of metric (from the registry)
    type: completeness
    # Which columns to analyze
    input_columns: ["age", "income", "email"]
    # Include per-column scores?
    include_per_column: true
    # Include overall score?
    include_overall: true

  image_quality:
    type: visual_metric
    input_columns: ["image_bytes"]

3. `outputs` - Where to save results

Defines where to save generated features (optional - if omitted, only metrics are saved).

outputs:
  feature_output:
    type: parquet
    # Where to save
    path: "output/enriched_data.parquet"
    # Which columns to include (original + generated)
    columns: ["sample_id", "m_luminosity", "m_blur_level"]

Complete Example

Here's a full configuration file that does something useful:

dataloaders:
  train_data:
    type: parquet
    path: "data/train.parquet"
    batch_size: 10000

metrics_processor:
  # Check for missing values
  completeness:
    type: completeness
    input_columns: ["age", "income", "zip_code"]
    include_per_column: true
    include_overall: true

  # Extract image features
  visual:
    type: visual_metric
    input_columns: ["image_data"]
    grayscale: true

outputs:
  save_results:
    type: parquet
    path: "output/enriched_data.parquet"
    columns: ["sample_id", "m_luminosity", "m_contrast"]

See Data Loaders for detailed documentation on: - Single selection (full dataset) - Filtered selection (filter: config) - Split selection (split_by: + split_values: config) - Selection names

Metrics Processor

Defines which metrics or feature extractors to run on your data.

metrics_processor:
  # Give each metric a unique name
  null_check:
    # The type of metric (from the registry)
    type: completeness
    # Which columns to analyze
    input_columns: ["age", "income", "email"]
    # Include per-column scores?
    include_per_column: true
    # Include overall score?
    include_overall: true

See Metrics Computation for detailed documentation on: - Per-selection metrics - Delta metrics (pairwise comparisons) - Which metrics support delta - Output file formats

Output Files

Defines where to save generated metrics and features.

outputs:
  feature_output:
    type: parquet
    # Where to save
    path: "output/enriched_data.parquet"
    # Which columns to include (original + generated)
    columns: ["sample_id", "m_luminosity", "m_blur_level"]

See Metrics Computation for detailed documentation on: - Per-selection output format - Delta output format - Filename patterns

Running Your Configuration

Once you've written your config file, run it with the -p (or --path-config) flag:

uv run dqm-ml process -p my_config.yaml

That's it! DQM-ML will: 1. Load your data in batches 2. Run the metrics you specified 3. Save results to the output location

Choosing Your Data Format

Format	Best For	Limitations
Parquet	Large datasets, columnar data, analytics	Not human-readable
CSV	Small datasets, interoperability	Slower, larger files
Protobuf	Streaming, schema evolution	Requires schema definition

Recommendation

< 1 GB: CSV is fine
> 1 GB: Use Parquet for speed and memory efficiency
Production pipelines: Parquet with compression

Batch Size Guide

How to choose the right batch size:

Data Size	Recommended Batch Size	Why
< 100 MB	10,000	Default works well
100 MB - 1 GB	10,000 - 50,000	Balance memory/performance
1 - 10 GB	50,000 - 100,000	Larger batches more efficient
> 10 GB	100,000+	Adjust based on available RAM

Tip: Start with the default (10,000) and increase if you have plenty of RAM. The smaller the batch, the more overhead.

Production vs Development

For production environments, consider:

Environment variables: Set DQM_ML_BATCH_SIZE to override config
Monitoring: Add logging to track metric computation time
Error handling: Use try/except around execute() for graceful failures

For development:

Use smaller sample data to iterate faster
Enable debug logging with logging.basicConfig(level=logging.DEBUG)

Common Patterns

Running Multiple Metrics

metrics_processor:
  completeness_check:
    type: completeness
    input_columns: ["col_a", "col_b"]

  representativeness_check:
    type: representativeness
    input_columns: ["feature_x", "feature_y"]
    distribution: "normal"

Using split_by for Multiple Selections

Instead of defining multiple dataloaders, use split_by to create multiple selections from a single source:

dataloaders:
  data:
    type: parquet
    path: "data/all.parquet"
    split_by: dataset
    split_values: [train, test]

metrics_processor:
  completeness:
    type: completeness
    input_columns: ["col_a"]

This creates data_train and data_test selections, and computes completeness for both in one run.