Tests of Completeness, Representativeness¶
Replay of the V1 example notebook related to completeness and representativeness using the new API V2 and optimized reimplementation of metrics. Full versions with all reimplemented metrics will be provided in the next version (2.0.0)
- Previous code will be commented when needed
- replacement included
In [1]:
Copied!
# Common Libraries
import numpy as np
import pandas as pd
# Common Libraries
import numpy as np
import pandas as pd
Tests on challenge welding data¶
This dataset is described on the website of challenge welding : https://confianceai.github.io/Welding-Quality-Detection-Challenge/
In [ ]:
Copied!
In [2]:
Copied!
# upload data
# Dataset path
path = "datasets/challenge_welding.csv"
data = pd.read_csv(path, sep=",")
#print("Data info:\n", data.info())
# print("The first 5 lines:\n", data.head())
data
# upload data
# Dataset path
path = "datasets/challenge_welding.csv"
data = pd.read_csv(path, sep=",")
#print("Data info:\n", data.info())
# print("The first 5 lines:\n", data.head())
data
Out[2]:
| sample_id | class | timestamp | welding-seams | labelling_type | resolution | path | sha256 | storage_type | data_origin | blur_level | blur_class | luminosity_level | external_path | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | data_37966 | OK | 08/07/2022 02:05 | c33 | operator | [960 540] | challenge-welding/datasets/welding-detection-c... | b'0F\\\x12\xa1\x81\xcdL\xc3~\xbe\x03\xc90\x91\... | s3 | real | 1667.425860 | clean | 51.079433 | http://minio-storage.apps.confianceai-public.i... |
| 1 | data_25403 | OK | 19/07/2022 00:51 | c102 | operator | [960 540] | challenge-welding/datasets/welding-detection-c... | b'\tlM\xd7\x83\\\x7fm\xaeV\xc6\xbd\xd6S@\x17S\... | s3 | real | 868.574712 | blur | 32.825601 | http://minio-storage.apps.confianceai-public.i... |
| 2 | data_27038 | OK | 28/07/2022 16:08 | c102 | operator | [960 540] | challenge-welding/datasets/welding-detection-c... | b'\xeb\xba/\xe0-\xe6\x14\xd6\xe0\xf2oT\xd1\xa9... | s3 | real | 1078.402671 | clean | 35.192525 | http://minio-storage.apps.confianceai-public.i... |
| 3 | data_8767 | OK | 31/01/20 09:41 | c33 | expert | [1920 1080] | challenge-welding/datasets/welding-detection-c... | b"\x9e@!\xd3\xdd\xaa'\xb7]\xeb\xf7B\x800\xf1\x... | s3 | real | 1240.198851 | clean | 38.510588 | http://minio-storage.apps.confianceai-public.i... |
| 4 | data_4744 | OK | 25/11/19 21:57 | c20 | expert | [1920 1080] | challenge-welding/datasets/welding-detection-c... | b'\x0c\xd8\xcfb\x95\\\xb9\xa8w\xe7^\xea\rX,8\x... | s3 | real | 3371.897144 | clean | 46.001366 | http://minio-storage.apps.confianceai-public.i... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 22748 | data_66456 | OK | 29/10/19 01:33 | c102 | expert | [1920 1080] | challenge-welding/datasets/welding-detection-c... | b'\x9fZ^\xc5j\xf1\xd8JRw\x03\xa4C\x04\xa6\xa1\... | s3 | real | 4463.079210 | clean | 41.554284 | http://minio-storage.apps.confianceai-public.i... |
| 22749 | data_24294 | OK | 07/07/2022 01:40 | c102 | operator | [960 540] | challenge-welding/datasets/welding-detection-c... | b'\x05\x07x\xa4\xb8\x01\xf1a\x18.\x13\xe4\xe6\... | s3 | real | 859.194594 | blur | 31.966749 | http://minio-storage.apps.confianceai-public.i... |
| 22750 | data_25568 | OK | 19/07/2022 11:59 | c102 | operator | [960 540] | challenge-welding/datasets/welding-detection-c... | b'O\x0eT\xa1\x8b\xa0\xf5\xaei\x81\xe8\x16\x96p... | s3 | real | 977.316444 | clean | 40.872106 | http://minio-storage.apps.confianceai-public.i... |
| 22751 | data_39356 | OK | 13/07/2022 06:45 | c33 | operator | [960 540] | challenge-welding/datasets/welding-detection-c... | b'\x1bh\xd9\x8aK\x88\xc1\xcb\xea\xff\xe3]{\x0e... | s3 | real | 1653.447154 | clean | 50.429395 | http://minio-storage.apps.confianceai-public.i... |
| 22752 | data_8988 | OK | 09/03/20 12:22 | c33 | expert | [1920 1080] | challenge-welding/datasets/welding-detection-c... | b'\xce\x0e\x87\xd7\x9b\x8d\xc4\x10[\xe9\x8fl\x... | s3 | real | 593.528212 | blur | 39.343171 | http://minio-storage.apps.confianceai-public.i... |
22753 rows × 14 columns
Completeness¶
you can see also the file: dqm/completeness/main.py
In [3]:
Copied!
## Specific Libraries dqml-ml V1
from dqm.completeness.metric import DataCompleteness
## Specific Libraries dqml-ml V2
from dqm_ml_core import CompletenessProcessor, MetricRunner
## Specific Libraries dqml-ml V1
from dqm.completeness.metric import DataCompleteness
## Specific Libraries dqml-ml V2
from dqm_ml_core import CompletenessProcessor, MetricRunner
In [4]:
Copied!
# Compute completeness on the whole dataset
completeness_evaluator = DataCompleteness()
overall_score = completeness_evaluator.completeness_tabular(data)
print("V1 : ", overall_score)
runner = MetricRunner()
metric = CompletenessProcessor(config={"include_per_column" : False})
metrics_values = runner.run(data, [metric])
print("V2 : ", metrics_values)
# Compute completeness on the whole dataset
completeness_evaluator = DataCompleteness()
overall_score = completeness_evaluator.completeness_tabular(data)
print("V1 : ", overall_score)
runner = MetricRunner()
metric = CompletenessProcessor(config={"include_per_column" : False})
metrics_values = runner.run(data, [metric])
print("V2 : ", metrics_values)
V1 : 1.0
V2 : {'completeness_overall': 1.0}
In [5]:
Copied!
# Compute completeness on a single column
column_score = completeness_evaluator.data_completion(data['blur_level'])
# Print the results
print(f'Overall Data Completeness Score: {overall_score}')
print(f'Completeness Score for Column: {column_score}')
# With API V2 - configure column and metric metric name in outdict
metric = CompletenessProcessor(config={"input_columns" : ["blur_level"]})
metrics_values = runner.run(data, [metric])
print(f'Overall Data Completeness Score: ', metrics_values['completeness_overall'])
print(f'Completeness Score for Column: ', metrics_values['completeness_blur_level'])
# Compute completeness on a single column
column_score = completeness_evaluator.data_completion(data['blur_level'])
# Print the results
print(f'Overall Data Completeness Score: {overall_score}')
print(f'Completeness Score for Column: {column_score}')
# With API V2 - configure column and metric metric name in outdict
metric = CompletenessProcessor(config={"input_columns" : ["blur_level"]})
metrics_values = runner.run(data, [metric])
print(f'Overall Data Completeness Score: ', metrics_values['completeness_overall'])
print(f'Completeness Score for Column: ', metrics_values['completeness_blur_level'])
Overall Data Completeness Score: 1.0 Completeness Score for Column: 1 Overall Data Completeness Score: 1.0 Completeness Score for Column: 1.0
Representativeness¶
In [6]:
Copied!
## Specific Libraries
from dqm.representativeness.metric import DistributionAnalyzer
## Specific Libraries
from dqm.representativeness.metric import DistributionAnalyzer
[12-17 13:09:10] {/home/jovyan/Maturation/env-dqm-ml/lib/python3.13/site-packages/dqm/utils/twe_logger.py:103} INFO - Logger: twe_logger, handlers: [<StreamHandler stdout (DEBUG)>]
[12-17 13:09:10] {/home/jovyan/Maturation/env-dqm-ml/lib/python3.13/site-packages/dqm/utils/twe_logger.py:103} INFO - Logger: twe_logger, handlers: [<StreamHandler stdout (DEBUG)>]
In [7]:
Copied!
var = data["blur_level"] # you can choose another variable from the data.
mean = np.mean(var)
std = np.std(var)
# data = pd.Series(np.random.normal(0, 1, 1000))
print(mean,std)
var = data["blur_level"] # you can choose another variable from the data.
mean = np.mean(var)
std = np.std(var)
# data = pd.Series(np.random.normal(0, 1, 1000))
print(mean,std)
1650.9658847770813 946.0829447223126
In [8]:
Copied!
# Parameters for analysis
bins = 20
distribution = 'normal'
# Instantiation of DistributionAnalyzer
analyzer = DistributionAnalyzer(var, bins, distribution)
# Parameters for analysis
bins = 20
distribution = 'normal'
# Instantiation of DistributionAnalyzer
analyzer = DistributionAnalyzer(var, bins, distribution)
[12-17 13:09:11] {/home/jovyan/Maturation/env-dqm-ml/lib/python3.13/site-packages/dqm/utils/twe_logger.py:103} INFO - Logger: twe_logger, handlers: [<StreamHandler stdout (DEBUG)>]
In [9]:
Copied!
# Using the method chisquare_test
pvalue, _ = analyzer.chisquare_test()
print(f"Chi-Square Test: p-value = {pvalue}")
# Using the method chisquare_test
pvalue, _ = analyzer.chisquare_test()
print(f"Chi-Square Test: p-value = {pvalue}")
[12-17 13:09:14] {/home/jovyan/Maturation/env-dqm-ml/lib/python3.13/site-packages/dqm/representativeness/metric.py:170} INFO - pvalue = 0.0 < 0.05: Data is not following the normal distribution
Chi-Square Test: p-value = 0.0
In [10]:
Copied!
# Using the method chisquare_test for V3
## Specific Libraries dqml-ml V2
from dqm_ml_core import RepresentativenessProcessor
bins = 20
distribution = 'normal'
metric = RepresentativenessProcessor(config={"metrics" : ["chi-square"], "input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
print(f"Chi-Square Test: p-value = ", metrics_values["chi-square_blur_level_p_value"])
#pvalue, intervals_frequencies = analyzer.chisquare_test()
#print(f"Chi-Square Test: p-value = {pvalue}")
#print(f"Chi-Square Test: discretized intervalls are \n = {intervals_frequencies}")
# Using the method chisquare_test for V3
## Specific Libraries dqml-ml V2
from dqm_ml_core import RepresentativenessProcessor
bins = 20
distribution = 'normal'
metric = RepresentativenessProcessor(config={"metrics" : ["chi-square"], "input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
print(f"Chi-Square Test: p-value = ", metrics_values["chi-square_blur_level_p_value"])
#pvalue, intervals_frequencies = analyzer.chisquare_test()
#print(f"Chi-Square Test: p-value = {pvalue}")
#print(f"Chi-Square Test: discretized intervalls are \n = {intervals_frequencies}")
{'chi-square_blur_level_p_value': 0.0, 'chi-square_blur_level_statistic': 7048.087773024402, 'chi-square_blur_level_interpretation': 'does_not_follow_distribution', '_metadata': '{"bins": 20, "distribution": "normal", "metrics_computed": ["chi-square"], "total_samples": 22753, "columns_analyzed": ["blur_level"], "ks_sampling_enabled": false, "note": "KS test uses random sampling approximation for scalability"}'}
Chi-Square Test: p-value = 0.0
In [11]:
Copied!
# Using the method kolmogorov
ks_pvalue = analyzer.kolmogorov(mean,std)
print(f"Kolmogorov-Smirnov Test: p-value = {ks_pvalue}")
# Using the method kolmogorov
ks_pvalue = analyzer.kolmogorov(mean,std)
print(f"Kolmogorov-Smirnov Test: p-value = {ks_pvalue}")
[12-17 13:09:14] {/home/jovyan/Maturation/env-dqm-ml/lib/python3.13/site-packages/dqm/representativeness/metric.py:221} INFO - KstestResult(statistic=np.float64(0.08743767587843615), pvalue=np.float64(8.401273781579098e-152), statistic_location=np.float64(1038.29640591), statistic_sign=np.int8(1))
[12-17 13:09:14] {/home/jovyan/Maturation/env-dqm-ml/lib/python3.13/site-packages/dqm/representativeness/metric.py:224} INFO - p-value = 8.401273781579098e-152 < 0.05 : The data is not followingthe normal distribution
Kolmogorov-Smirnov Test: p-value = 8.401273781579098e-152
In [12]:
Copied!
metric = RepresentativenessProcessor(config={"metrics" : ["kolmogorov-smirnov"], "input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
print(f"Kolmogorov-Smirnov Test: p-value = ", metrics_values["kolmogorov-smirnov_blur_level_p_value"])
metric = RepresentativenessProcessor(config={"metrics" : ["kolmogorov-smirnov"], "input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
print(f"Kolmogorov-Smirnov Test: p-value = ", metrics_values["kolmogorov-smirnov_blur_level_p_value"])
{'kolmogorov-smirnov_blur_level_p_value': 0.009753347289739563, 'kolmogorov-smirnov_blur_level_statistic': 0.0726015117572063, 'kolmogorov-smirnov_blur_level_interpretation': 'does_not_follow_distribution', 'kolmogorov-smirnov_blur_level_sample_size': 500, 'kolmogorov-smirnov_blur_level_note': 'approximated_from_random_samples', '_metadata': '{"bins": 20, "distribution": "normal", "metrics_computed": ["kolmogorov-smirnov"], "total_samples": 22753, "columns_analyzed": ["blur_level"], "ks_sampling_enabled": true, "note": "KS test uses random sampling approximation for scalability"}'}
Kolmogorov-Smirnov Test: p-value = 0.009753347289739563
In [13]:
Copied!
# Using the method shannon_entropy
entropy = analyzer.shannon_entropy()
print(f"Shannon Entropy: {entropy}")
# Using the method shannon_entropy
entropy = analyzer.shannon_entropy()
print(f"Shannon Entropy: {entropy}")
Shannon Entropy: 2.9953301802692853
In [14]:
Copied!
metric = RepresentativenessProcessor(config={"metrics" : ["shannon-entropy"], "input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
print(f"Shannon Entropy Test: p-value = ", metrics_values["shannon-entropy_blur_level_entropy"])
metric = RepresentativenessProcessor(config={"metrics" : ["shannon-entropy"], "input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
print(f"Shannon Entropy Test: p-value = ", metrics_values["shannon-entropy_blur_level_entropy"])
{'shannon-entropy_blur_level_entropy': 0.0, 'shannon-entropy_blur_level_interpretation': 'low_diversity', '_metadata': '{"bins": 20, "distribution": "normal", "metrics_computed": ["shannon-entropy"], "total_samples": 22753, "columns_analyzed": ["blur_level"], "ks_sampling_enabled": false, "note": "KS test uses random sampling approximation for scalability"}'}
Shannon Entropy Test: p-value = 0.0
In [15]:
Copied!
# Using the method grte
grte_result, intervals_discretized = analyzer.grte()
print(f"GRTE: {grte_result}")
#print(f"GRTE discretized intervalls are \n = {intervals_discretized}") # uncomment to display data discretization
# Using the method grte
grte_result, intervals_discretized = analyzer.grte()
print(f"GRTE: {grte_result}")
#print(f"GRTE discretized intervalls are \n = {intervals_discretized}") # uncomment to display data discretization
GRTE: 0.7474311921010316
In [16]:
Copied!
metric = RepresentativenessProcessor(config={"metrics" : ["grte"], "input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
print(f"GRTE = ", metrics_values["grte_blur_level_grte_value"])
metric = RepresentativenessProcessor(config={"metrics" : ["grte"], "input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
print(f"GRTE = ", metrics_values["grte_blur_level_grte_value"])
{'grte_blur_level_grte_value': 0.003347168773970686, 'grte_blur_level_interpretation': 'low_representativeness', '_metadata': '{"bins": 20, "distribution": "normal", "metrics_computed": ["grte"], "total_samples": 22753, "columns_analyzed": ["blur_level"], "ks_sampling_enabled": false, "note": "KS test uses random sampling approximation for scalability"}'}
GRTE = 0.003347168773970686
We can compute all metrics at once¶
In [17]:
Copied!
metric = RepresentativenessProcessor(config={"input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
metric = RepresentativenessProcessor(config={"input_columns" : ["blur_level"], "bins": 20, "distribution" : 'normal'})
metrics_values = runner.run(data, [metric])
print(metrics_values)
{'chi-square_blur_level_p_value': 0.0, 'chi-square_blur_level_statistic': 6238.9516482113795, 'chi-square_blur_level_interpretation': 'does_not_follow_distribution', 'kolmogorov-smirnov_blur_level_p_value': 7.651794192324447e-05, 'kolmogorov-smirnov_blur_level_statistic': 0.10042789800275398, 'kolmogorov-smirnov_blur_level_interpretation': 'does_not_follow_distribution', 'kolmogorov-smirnov_blur_level_sample_size': 500, 'kolmogorov-smirnov_blur_level_note': 'approximated_from_random_samples', 'shannon-entropy_blur_level_entropy': 2.9950324819300826, 'shannon-entropy_blur_level_interpretation': 'high_diversity', 'grte_blur_level_grte_value': 0.7479459737054072, 'grte_blur_level_interpretation': 'high_representativeness', '_metadata': '{"bins": 20, "distribution": "normal", "metrics_computed": ["chi-square", "grte", "kolmogorov-smirnov", "shannon-entropy"], "total_samples": 22753, "columns_analyzed": ["blur_level"], "ks_sampling_enabled": true, "note": "KS test uses random sampling approximation for scalability"}'}
In [ ]:
Copied!