ProbDataGen - A tool for generating probabilistic data with duplicates

   

icon html

ProbDataGen is a Java application for generating duplicate polluted probabilistic data from duplicate-free certain data. The certain source data is obtained from the online movie database IMDb with the Java application JMDb and stored in a HSQL database. The resultant probabilistic data items are modeled as x-tuples according to the ULDB model. Errors in multiple x-tuple alternatives and errors in duplicate x-tuples are introduced artificially.

With ProbDataGen it is possible to choose among several HSQL databases holding certain movie data to generate a probabilistic movie database with duplicates, where the user can make several adjustments, e.g. the number of duplicates or the maximal number of alternatives per tuple. The generated probabilistic data are then stored in such a way that they can be loaded into our x-tuple data model for further processing: In addition to the actual movie attribute values, every HSQLDB movie tuple has an x-tuple ID, an alternative ID and a confidence value. So every HSQLDB tuple represents an x-tuple alternative. Figure 1 illustrates the probabilistic data that were generated from certain data with ProbDataGen.

Some probabilistic movie tuples generated from certain data.
Figure 1: Sample of probabilistic data generation

The process of generating a probabilistic database is performed in the following steps:

  1. Create a new database of a given size: First, a new database is created and filled with the desired number of randomly chosen (certain) movie tuples.
  2. Generate x-tuple IDs: Every x-tuple is assigned a unique x-tuple ID.
  3. Generate duplicates: Now, some movie tuples are duplicated, so that they appear twice in the movie relation. The duplicates are then assigned x-tuple IDs. It is also stored to the database which pairs are duplicates.
  4. Generate confidence values: In this step, a confidence value is assigned to every tuple in the database. The confidence values are generated randomly to some degree, but they are also influenced by several parameters. For example, the user can define the probability of generating a confidence value smaller than 1, i.e. the probability to turn a tuple into a maybe tuple.
  5. Adding alternatives: At this point, some alternatives are added to the tuples, so that there actually are x-tuples with more than one alternative. Adding a few alternatives to an x-tuple means to duplicate the (only) x-tuple alternative several times and distribute the tuple's confidence among all alternatives afterwards. Since all alternatives of that tuple are identical except for the confidence value, their alternative IDs are modified in such a way that the alternatives are enumerated from 1 to the number of the x-tuple's alternatives in order to guarantee that the combination of x-tuple ID and alternative ID is unique for every x-tuple alternative. The user can define many parameters for this step as well, e.g. the minimal alternative confidence or how many alternatives at least and how many at most are generated for a tuple.
  6. Add errors: Finally, some errors are added to the x-tuples. Whether errors are added to a certain attribute value and of what kind or how serious they are, is decided randomly according to several user-defined parameters, but always following two rules: The first rule is that the chance of generating an error is greater for alternatives with small confidence values. The second rule is that the alternatives of a tuple have to differ somehow, when the errors have been added, since x-tuple alternatives are mutually exclusive, i.e. they must not be identical.
    Errors that can occur in the production year are simply wrong numbers. Strings, i.e. the title, director and studio, are affected by typo-like errors, for example missing or transposed neighbouring characters and wrong spelling such as a 'novie' instead of 'movie'. The director and studio attribute values are even exchanged with other values to simulate not only erroneous, but downright wrong data.