Test data sets for duplicate detection
This site provides several test data sets for duplicate detection which were generated using the data pollution tool
DaPo.
If you use one of these data sets in your research,
please reference the corresponding paper.
Scenario 1: Integration of five duplicate-free music databases
Identifier |
File |
Size |
Information |
MB-A01-020k |
musicbrainz-20000-A01.csv.zip |
0.98 MB |
(10,000 real-world entities, 19,375 tuples) |
MB-A01-200k |
musicbrainz-200000-A01.csv.zip |
10 MB |
(100,000 real-world entities, 193,750 tuples) |
MB-A01-002m |
musicbrainz-2000000-A01.csv.zip |
101 MB |
(1,000,000 real-world entities, 1,937,500 tuples) |
MB-A01-020m |
musicbrainz-20000000-A01.csv.zip |
1 GB |
(10,000,000 real-world entities, 19,375,000 tuples) |
In this scenario, the input data originates from the free available
MusicBrainz database.
We divided every input data set into five logical sources by assigning a sourceID to each tuple randomly.
Since the five data sources are assumed to be duplicate-free, the size of the generated duplicate clusters ranges from 1 to 5.
The cluster sizes are distributed as follows:
Cluster Size |
Proportion |
1 |
50% |
2 |
25% |
3 |
12.5% |
4 |
6.25% |
5 |
6.25% |
Every tuple of the given test data sets (i.e. line of the CSV-files) describes a certain audio recording.
The schema of these data sets consists of the following 12 attributes:
Attribute |
Description |
TID |
This attribute is a unique tuple identifier. |
CID |
This attribute is a unique cluster identifier which indicates to which duplicate cluster a tuple belongs (i.e. two tuples having the same CID are duplicates).
This id describes the gold standard which is required to evaluate the quality of a duplicate detection result. |
CTID |
This attribute is a cluster-internal identifier. It counts from 1 to the size of the corresponding duplicate cluster. |
SourceID |
This attribute is a unique source identifier (i.e. two tuples having the same SourceID are assumed to originate from the same data source). |
id |
This attribute is the original tuple identifier of the input data set. To avoid conclusions about duplicate relationships we modified the value of this attribute for every tuple based on the source it belongs to. |
number |
This attribute is the track number of the recording on its corresponding album. |
title |
This attribute is the title of the recording. |
length |
This attribute is the length of the recording. |
artist |
This attribute is the artist of the recording which is typically a band or musician. |
album |
This attribute is the name of the recording's corresponding album. |
year |
This attribute is the year of recording. |
language |
This attribute is the language of recording. |