ProbDataGen is a Java application for generating duplicate polluted probabilistic data from duplicate-free certain data.
The certain source data is obtained from the online movie database IMDb
with the Java application JMDb and stored in a HSQL database.
The resultant probabilistic data items are modeled as x-tuples according to the ULDB model.
Errors in multiple x-tuple alternatives and errors in duplicate x-tuples are introduced artificially.
With ProbDataGen it is possible to choose among several HSQL databases holding certain movie data to generate a probabilistic movie database with
duplicates, where the user can make several adjustments, e.g. the number of duplicates or the maximal number of alternatives per tuple.
The generated probabilistic data are then stored in such a way that they can be loaded into our x-tuple data model for further processing:
In addition to the actual movie attribute values, every HSQLDB movie tuple has an x-tuple ID, an alternative ID and a confidence value.
So every HSQLDB tuple represents an x-tuple alternative.
Figure 1 illustrates the probabilistic data that were generated from certain data with ProbDataGen.
|
Figure 1: Sample of probabilistic data generation |
The process of generating a probabilistic database is performed in the following steps:
- Create
a new database of a given size: First, a new
database is created and filled with the desired number of randomly
chosen (certain) movie tuples.
- Generate
x-tuple IDs: Every x-tuple is assigned a unique
x-tuple ID.
- Generate
duplicates: Now, some movie tuples are
duplicated, so that they appear twice in the movie relation. The
duplicates are then assigned x-tuple IDs. It is also stored to the
database which pairs are duplicates.
- Generate
confidence values: In this step, a
confidence value is assigned to every tuple in the database. The
confidence values are generated randomly to some degree, but they are
also influenced by several parameters. For example, the user can define
the probability of generating a confidence value smaller than 1, i.e.
the probability to turn a tuple into a maybe tuple.
- Adding
alternatives: At this point, some
alternatives are added to the tuples, so that there actually are
x-tuples with more than one alternative. Adding a few alternatives to
an x-tuple means to duplicate the (only) x-tuple alternative several
times and distribute the tuple's confidence among all alternatives
afterwards. Since all alternatives of that tuple are identical except
for the confidence value, their alternative IDs are modified in such a
way that the alternatives are enumerated from 1 to the number of the
x-tuple's alternatives in order to guarantee that the combination of
x-tuple ID and alternative ID is unique for every x-tuple alternative.
The user can define many parameters for this step as well, e.g. the
minimal alternative confidence or how many alternatives at least and
how many at most are generated for a tuple.
- Add
errors: Finally, some errors are added to the
x-tuples. Whether errors are added to a certain attribute value and of
what kind or how serious they are, is decided randomly according to
several user-defined parameters, but always following two rules: The
first rule is that the chance of generating an error is greater for
alternatives with small confidence values. The second rule is that the
alternatives of a tuple have to differ somehow, when the errors have
been added, since x-tuple alternatives are mutually exclusive, i.e.
they must not be identical.
Errors that can occur in the production year are simply wrong numbers.
Strings, i.e. the title, director and studio, are affected by typo-like
errors, for example missing or transposed neighbouring characters and
wrong spelling such as a 'novie' instead of 'movie'. The director and
studio attribute values are even exchanged with other values to
simulate not only erroneous, but downright wrong data.
|