Test data sets for duplicate detection

This site provides several test data sets for duplicate detection which were generated using the data pollution tool DaPo. If you use one of these data sets in your research, please reference the corresponding paper.

Scenario 1: Integration of five duplicate-free music databases

Identifier File Size Information
MB-A01-020k musicbrainz-20000-A01.csv.zip 0.98 MB (10,000 real-world entities, 19,375 tuples)
MB-A01-200k musicbrainz-200000-A01.csv.zip 10 MB (100,000 real-world entities, 193,750 tuples)
MB-A01-002m musicbrainz-2000000-A01.csv.zip 101 MB (1,000,000 real-world entities, 1,937,500 tuples)
MB-A01-020m musicbrainz-20000000-A01.csv.zip 1 GB (10,000,000 real-world entities, 19,375,000 tuples)

In this scenario, the input data originates from the free available MusicBrainz database. We divided every input data set into five logical sources by assigning a sourceID to each tuple randomly. Since the five data sources are assumed to be duplicate-free, the size of the generated duplicate clusters ranges from 1 to 5. The cluster sizes are distributed as follows:

Cluster Size Proportion
1 50%
2 25%
3 12.5%
4 6.25%
5 6.25%

Every tuple of the given test data sets (i.e. line of the CSV-files) describes a certain audio recording. The schema of these data sets consists of the following 12 attributes:

Attribute Description
TID This attribute is a unique tuple identifier.
CID This attribute is a unique cluster identifier which indicates to which duplicate cluster a tuple belongs (i.e. two tuples having the same CID are duplicates). This id describes the gold standard which is required to evaluate the quality of a duplicate detection result.
CTID This attribute is a cluster-internal identifier. It counts from 1 to the size of the corresponding duplicate cluster.
SourceID This attribute is a unique source identifier (i.e. two tuples having the same SourceID are assumed to originate from the same data source).
id This attribute is the original tuple identifier of the input data set. To avoid conclusions about duplicate relationships we modified the value of this attribute for every tuple based on the source it belongs to.
number This attribute is the track number of the recording on its corresponding album.
title This attribute is the title of the recording.
length This attribute is the length of the recording.
artist This attribute is the artist of the recording which is typically a band or musician.
album This attribute is the name of the recording's corresponding album.
year This attribute is the year of recording.
language This attribute is the language of recording.