Indexing Boolean expressions

Steven Euijong Whang, Hector Garcia-Molina, Chad Brower, Jayavel Shanmugasundaram, Sergei Vassilvitskii, Erik Vee, Ramana Yerneni
2009 Proceedings of the VLDB Endowment  
We consider the problem of efficiently indexing Disjunctive Normal Form (DNF) and Conjunctive Normal Form (CNF) Boolean expressions over a high-dimensional multi-valued attribute space. The goal is to rapidly find the set of Boolean expressions that evaluate to true for a given assignment of values to attributes. A solution to this problem has applications in online advertising (where a Boolean expression represents an advertiser's user targeting requirements, and an assignment of values to
more » ... nt of values to attributes represents the characteristics of a user visiting an online page) and in general any publish/subscribe system (where a Boolean expression represents a subscription, and an assignment of values to attributes represents an event). All existing solutions that we are aware of can only index a specialized sub-set of conjunctive and/or disjunctive expressions, and cannot efficiently handle general DNF and CNF expressions (including NOTs) over multi-valued attributes. In this paper, we present a novel solution based on the inverted list data structure that enables us to index arbitrarily complex DNF and CNF Boolean expressions over multi-valued attributes. An interesting aspect of our solution is that, by virtue of leveraging inverted lists traditionally used for ranked information retrieval, we can efficiently return the top-N matching Boolean expressions. This capability enables emerging applications such as ranked publish/subscribe systems [16] , where only the top subscriptions that match an event are desired. For example, in online advertising there is a limit on the number of advertisements that can be shown on a given page and only the "best" advertisements can be displayed. We have evaluated our proposed technique based on data from an online advertising application, and the results show a dramatic performance improvement over prior techniques.
doi:10.14778/1687627.1687633 fatcat:3pkavmytgrcfhei4oei64gspd4