Test data sets for duplicate detection

This site provides several test data sets for duplicate detection which were generated using the data pollution tool DaPo. If you use one of these data sets in your research, please reference the corresponding paper.

Scenario 1: Integration of five duplicate-free music databases

Identifier	File	Size	Information
MB-A01-020k	musicbrainz-20000-A01.csv.zip	0.98 MB	(10,000 real-world entities, 19,375 tuples)
MB-A01-200k	musicbrainz-200000-A01.csv.zip	10 MB	(100,000 real-world entities, 193,750 tuples)
MB-A01-002m	musicbrainz-2000000-A01.csv.zip	101 MB	(1,000,000 real-world entities, 1,937,500 tuples)
MB-A01-020m	musicbrainz-20000000-A01.csv.zip	1 GB	(10,000,000 real-world entities, 19,375,000 tuples)

In this scenario, the input data originates from the free available MusicBrainz database. We divided every input data set into five logical sources by assigning a sourceID to each tuple randomly. Since the five data sources are assumed to be duplicate-free, the size of the generated duplicate clusters ranges from 1 to 5. The cluster sizes are distributed as follows:

Cluster Size	Proportion
1	50%
2	25%
3	12.5%
4	6.25%
5	6.25%

Every tuple of the given test data sets (i.e. line of the CSV-files) describes a certain audio recording. The schema of these data sets consists of the following 12 attributes:

Attribute	Description
TID	This attribute is a unique tuple identifier.
CID	This attribute is a unique cluster identifier which indicates to which duplicate cluster a tuple belongs (i.e. two tuples having the same CID are duplicates). This id describes the gold standard which is required to evaluate the quality of a duplicate detection result.
CTID	This attribute is a cluster-internal identifier. It counts from 1 to the size of the corresponding duplicate cluster.
SourceID	This attribute is a unique source identifier (i.e. two tuples having the same SourceID are assumed to originate from the same data source).
id	This attribute is the original tuple identifier of the input data set. To avoid conclusions about duplicate relationships we modified the value of this attribute for every tuple based on the source it belongs to.
number	This attribute is the track number of the recording on its corresponding album.
title	This attribute is the title of the recording.
length	This attribute is the length of the recording.
artist	This attribute is the artist of the recording which is typically a band or musician.
album	This attribute is the name of the recording's corresponding album.
year	This attribute is the year of recording.
language	This attribute is the language of recording.