Conditioning probabilistic databases

Christoph Koch, Dan Olteanu
2008 Proceedings of the VLDB Endowment  
Past research on probabilistic databases has studied the problem of answering queries on a static database. Application scenarios of probabilistic databases however often involve the conditioning of a database using additional information in the form of new evidence. The conditioning problem is thus to transform a probabilistic database of priors into a posterior probabilistic database which is materialized for subsequent query processing or further refinement. It turns out that the
more » ... problem is closely related to the problem of computing exact tuple confidence values. It is known that exact confidence computation is an NPhard problem. This has led researchers to consider approximation techniques for confidence computation. However, neither conditioning nor exact confidence computation can be solved using such techniques. In this paper we present efficient techniques for both problems. We study several problem decomposition methods and heuristics that are based on the most successful search techniques from constraint satisfaction, such as the Davis-Putnam algorithm. We complement this with a thorough experimental evaluation of the algorithms proposed. Our experiments show that our exact algorithms scale well to realistic database sizes and can in some scenarios compete with the most efficient previous approximation algorithms. R SSN NAME { 1 (p=.2) | 7 (p=.8) } John { 4 (p=.3) | 7 (p=.7) } Bill represents four possible worlds (shown in Figure 1 ), modelling that John has either SSN 1 or 7, with probability .2 and .8 (the paper form may contain a hand-written symbol that can either be read as a European "1" or an American "7"), respectively, and Bill has either SSN 4 or 7, with probability .3 and .7, respectively. We assume independence between John's and Bill's alternatives, thus the world in which John has SSN 1 and Bill has SSN 7 has probability .2 · .7 = .14. If Ax denotes the event that Bill has SSN x, then P (A4) = .3 and P (A7) = .7. We can compute these probabilities in a probabilistic database by asking for the confidence values of the tuples in the result of the query select SSN, conf(SSN) from R where NAME = 'Bill'; which will result in the table Q SSN CONF 4 .3 7 .7 Now suppose we want to use the additional knowledge that social security numbers are unique. We can express this
doi:10.14778/1453856.1453894 fatcat:aprnpguqh5dzxkfvvu4wu2utwu