Skip to content

dqm_ml_job.dataloaders.proto

Protocol definitions for data loaders and selections.

This module contains the DataLoader and DataSelection protocol classes that define the interface for data loading implementations.

DataLoader

Bases: Protocol

Protocol for Data Loader factories.

A DataLoader is responsible for scanning a source (disk, DB, S3) and discovering available DataSelections based on its configuration.

Source code in packages/dqm-ml-job/src/dqm_ml_job/dataloaders/proto.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
@runtime_checkable
class DataLoader(Protocol):
    """
    Protocol for Data Loader factories.

    A DataLoader is responsible for scanning a source (disk, DB, S3) and
    discovering available DataSelections based on its configuration.
    """

    def get_selections(self) -> list[DataSelection]:
        """
        Discover and return the list of available selections for this loader.

        Returns:
            A list of initialized DataSelection instances.
        """

get_selections() -> list[DataSelection]

Discover and return the list of available selections for this loader.

Returns:

Type Description
list[DataSelection]

A list of initialized DataSelection instances.

Source code in packages/dqm-ml-job/src/dqm_ml_job/dataloaders/proto.py
52
53
54
55
56
57
58
def get_selections(self) -> list[DataSelection]:
    """
    Discover and return the list of available selections for this loader.

    Returns:
        A list of initialized DataSelection instances.
    """

DataSelection

Bases: Protocol

Protocol for a specific subset of data discovered by a DataLoader.

A DataSelection represents a concrete set of samples (e.g., a specific folder, a filtered view of a database, or a single file) and provides an iterator over data batches.

Source code in packages/dqm-ml-job/src/dqm_ml_job/dataloaders/proto.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
@runtime_checkable
class DataSelection(Protocol):
    """
    Protocol for a specific subset of data discovered by a DataLoader.

    A DataSelection represents a concrete set of samples (e.g., a specific folder,
    a filtered view of a database, or a single file) and provides an iterator
    over data batches.
    """

    name: str

    def bootstrap(self, columns_list: list[str]) -> None:
        """
        Perform initial setup for the selection before iteration starts.

        Args:
            columns_list: List of column names that must be loaded for this selection.
        """

    def get_nb_batches(self) -> int:
        """
        Return the estimated number of batches in this selection.

        Used primarily for progress bar estimation.
        """

    def __iter__(self) -> Any:
        """
        Iterate over the selection, yielding pyarrow.RecordBatch objects.
        """

name: str instance-attribute

__iter__() -> Any

Iterate over the selection, yielding pyarrow.RecordBatch objects.

Source code in packages/dqm-ml-job/src/dqm_ml_job/dataloaders/proto.py
37
38
39
40
def __iter__(self) -> Any:
    """
    Iterate over the selection, yielding pyarrow.RecordBatch objects.
    """

bootstrap(columns_list: list[str]) -> None

Perform initial setup for the selection before iteration starts.

Parameters:

Name Type Description Default
columns_list list[str]

List of column names that must be loaded for this selection.

required
Source code in packages/dqm-ml-job/src/dqm_ml_job/dataloaders/proto.py
22
23
24
25
26
27
28
def bootstrap(self, columns_list: list[str]) -> None:
    """
    Perform initial setup for the selection before iteration starts.

    Args:
        columns_list: List of column names that must be loaded for this selection.
    """

get_nb_batches() -> int

Return the estimated number of batches in this selection.

Used primarily for progress bar estimation.

Source code in packages/dqm-ml-job/src/dqm_ml_job/dataloaders/proto.py
30
31
32
33
34
35
def get_nb_batches(self) -> int:
    """
    Return the estimated number of batches in this selection.

    Used primarily for progress bar estimation.
    """