dqm_ml_core.api
API modules for DQM ML Core.
This package contains the base API components for data metric processors, including the DatametricProcessor base class that all metric processors must inherit from.
__all__ = ['DatametricProcessor']
module-attribute
DatametricProcessor
Base class for all Data Quality metrics and feature extractors.
The processor follows a streaming lifecycle designed to handle large datasets without loading them entirely into memory:
- Feature Extraction (
compute_features): Transformation of raw data into relevant features (e.g., image -> luminosity). - Batch Aggregation (
compute_batch_metric): Compression of features into intermediate statistics (e.g., count, partial sum, histogram). - Global Computation (
compute): Final aggregation of all batch-level statistics into dataset-level scores.
Source code in packages/dqm-ml-core/src/dqm_ml_core/api/data_processor.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 | |
config = config or {}
instance-attribute
input_columns = self.config['input_columns']
instance-attribute
name = name
instance-attribute
outputs_columns = self.config['output_columns']
instance-attribute
__init__(name: str, config: dict[str, Any] | None)
Initialize the dataset processor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Unique name of the processor instance. |
required |
config
|
dict[str, Any] | None
|
Configuration dictionary (optional). |
required |
Source code in packages/dqm-ml-core/src/dqm_ml_core/api/data_processor.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | |
compute(batch_metrics: dict[str, pa.Array]) -> dict[str, Any]
Perform the final dataset-level metric calculation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_metrics
|
dict[str, Array]
|
The aggregated intermediate statistics from all batches. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary containing the final metrics. |
Source code in packages/dqm-ml-core/src/dqm_ml_core/api/data_processor.py
135 136 137 138 139 140 141 142 143 144 145 | |
compute_batch_metric(features: dict[str, pa.Array]) -> dict[str, pa.Array]
Aggregate features into intermediate statistics for the current batch.
This method is critical for scalability. It should return a compact representation of the data (e.g., partial sums) that can be efficiently combined later.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
dict[str, Array]
|
Dictionary of feature arrays computed on the batch. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Array]
|
A dictionary of aggregated statistics. |
Source code in packages/dqm-ml-core/src/dqm_ml_core/api/data_processor.py
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | |
compute_delta(source: dict[str, Any], target: dict[str, Any]) -> dict[str, Any]
Compare metrics between two different dataselection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source
|
dict[str, Any]
|
Final metrics from the source dataselection. |
required |
target
|
dict[str, Any]
|
Final metrics from the target dataselection. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dictionary containing distance or difference scores. |
Source code in packages/dqm-ml-core/src/dqm_ml_core/api/data_processor.py
147 148 149 150 151 152 153 154 155 156 157 158 | |
compute_features(batch: pa.RecordBatch, prev_features: dict[str, pa.Array]) -> dict[str, pa.Array]
Transform a raw data batch into features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch
|
RecordBatch
|
The input pyarrow RecordBatch. |
required |
prev_features
|
dict[str, Array]
|
Features already computed by preceding processors. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Array]
|
A dictionary mapping feature names to pyarrow Arrays. |
Source code in packages/dqm-ml-core/src/dqm_ml_core/api/data_processor.py
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | |
generated_features() -> list[str]
Return the list of columns generated by this processor during feature extraction.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of feature names. |
Source code in packages/dqm-ml-core/src/dqm_ml_core/api/data_processor.py
72 73 74 75 76 77 78 79 80 81 | |
generated_metrics() -> list[str]
Return the names of the final metrics produced by this processor.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of metric names. |
Source code in packages/dqm-ml-core/src/dqm_ml_core/api/data_processor.py
83 84 85 86 87 88 89 90 91 92 | |
needed_columns() -> list[str]
Return the list of raw input columns required for feature extraction.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of column names. |
Source code in packages/dqm-ml-core/src/dqm_ml_core/api/data_processor.py
63 64 65 66 67 68 69 70 | |