Experimental Test Data (Blocking)

   

icon html

Experiment 1: Overall Comparison of Adaptation Strategies


 

Movie Databases generated with standard data setting (HSQL)
DSC1 [zip]
DSC2 [zip]
DSC3 [zip]
DSC4 [zip]
DSC5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 

 

 

Experiment 2: Robustness against a varying Key Design


 

Movie Databases generated with standard data setting (HSQL)
DSC1 [zip]
DSC2 [zip]
DSC3 [zip]
DSC4 [zip]
DSC5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 

 

 

Experiment 3: Robustness against a varying Setting of Blocking Parameters


 

Movie Databases generated with standard data setting (HSQL)
DSC1 [zip]
DSC2 [zip]
DSC3 [zip]
DSC4 [zip]
DSC5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 

 

 

Experiment 4: Robustness against a varying Dirtiness of the Source Data


 

Movie Databases (HSQL)
DSH1 [zip]
DSH2 [zip]
DSH3 [zip]
DSH4 [zip]
DSH5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.72
 
 
Movie Databases (HSQL)
DSF1 [zip]
DSF2 [zip]
DSF3 [zip]
DSF4 [zip]
DSF5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternative per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.77
 
 
Movie Databases (HSQL)
DSE1 [zip]
DSE2 [zip]
DSE3 [zip]
DSE4 [zip]
DSE5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.81
 
 
Movie Databases generated with standard data setting (HSQL)
DSC1 [zip]
DSC2 [zip]
DSC3 [zip]
DSC4 [zip]
DSC5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 
Movie Databases (HSQL)
DSA1 [zip]
DSA2 [zip]
DSA3 [zip]
DSA4 [zip]
DSA5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.9
 
 
Movie Databases (HSQL)
DSI1 [zip]
DSI2 [zip]
DSI3 [zip]
DSI4 [zip]
DSI5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.933
 
 

 

 

Experiment 5: Robustness against a varying Data Uncertainty


 

Movie Databases generated with standard data setting (HSQL)
DSC1 [zip]
DSC2 [zip]
DSC3 [zip]
DSC4 [zip]
DSC5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 
Movie Databases (HSQL)
DSC1_15A [zip]
DSC2_15A [zip]
DSC3_15A [zip]
DSC4_15A [zip]
DSC5_15A [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 812,272
Maximal Number of Alternatives per X-Tuple: 15
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 
Movie Databases (HSQL)
DSC1_20A [zip]
DSC2_20A [zip]
DSC3_20A [zip]
DSC4_20A [zip]
DSC5_20A [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 1,057,191
Maximal Number of Alternatives per X-Tuple: 20
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 
Movie Databases (HSQL)
DSC1_25A [zip]
DSC2_25A [zip]
DSC3_25A [zip]
DSC4_25A [zip]
DSC5_25A [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 1,301,207
Maximal Number of Alternatives per X-Tuple: 25
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 

 

 

Experiment 6: Uncertain Keys First


 

Movie Databases generated with standard data setting (HSQL)
DSC1 [zip]
DSC2 [zip]
DSC3 [zip]
DSC4 [zip]
DSC5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 
Movie Databases (HSQL)
DSC1_15A [zip]
DSC2_15A [zip]
DSC3_15A [zip]
DSC4_15A [zip]
DSC5_15A [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 812,272
Maximal Number of Alternatives per X-Tuple: 15
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 
Movie Databases (HSQL)
DSC1_20A [zip]
DSC2_20A [zip]
DSC3_20A [zip]
DSC4_20A [zip]
DSC5_20A [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 1,057,191
Maximal Number of Alternatives per X-Tuple: 20
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 
Movie Databases (HSQL)
DSC1_25A [zip]
DSC2_25A [zip]
DSC3_25A [zip]
DSC4_25A [zip]
DSC5_25A [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 1,301,207
Maximal Number of Alternatives per X-Tuple: 25
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 

 

 

Experiment 7: Overall Comparison using Different Blocking Techniques


 

Movie Databases (HSQL)
DSH1 [zip]
DSH2 [zip]
DSH3 [zip]
DSH4 [zip]
DSH5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.72
 
 
Movie Databases (HSQL)
DSF1 [zip]
DSF2 [zip]
DSF3 [zip]
DSF4 [zip]
DSF5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternative per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.77
 
 
Movie Databases (HSQL)
DSE1 [zip]
DSE2 [zip]
DSE3 [zip]
DSE4 [zip]
DSE5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.81
 
 
Movie Databases generated with standard data setting (HSQL)
DSC1 [zip]
DSC2 [zip]
DSC3 [zip]
DSC4 [zip]
DSC5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.856
 
 
Movie Databases (HSQL)
DSA1 [zip]
DSA2 [zip]
DSA3 [zip]
DSA4 [zip]
DSA5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.9
 
 
Movie Databases (HSQL)
DSI1 [zip]
DSI2 [zip]
DSI3 [zip]
DSI4 [zip]
DSI5 [zip]
Characteristics:
Number of X-Tuples: 102,692
Total Number of Alternatives: 561,025
Maximal Number of Alternatives per X-Tuple: 10
Number of Duplicates: 4,380
Distribution of Clustersizes (clustersize,frequence): 2,1560; 3,232; 4,72; 5,31; 6,15; 7,11; 8,9; 9,5; 10,3; 11,2; 12,1; 13,1; 15,1
Average Similarity of True Duplicates (scored with Monge-Elkan distance): 0.933