Private Searching on Streaming Data Based on Keyword Frequency
IEEE Transactions on Dependable and Secure Computing
Private searching on streaming data is a process to dispatch to a public server a program, which searches streaming sources of data without revealing searching criteria and then sends back a buffer containing the findings. From an Abelian group homomorphic encryption, the searching criteria can be constructed by only simple combinations of keywords, for example, disjunction of keywords. The recent breakthrough in fully homomorphic encryption has allowed us to construct arbitrary searching
... ary searching criteria theoretically. In this paper, we consider a new private query, which searches for documents from streaming data on the basis of keyword frequency, such that the frequency of a keyword is required to be higher or lower than a given threshold. This form of query can help us in finding more relevant documents. Based on the state of the art fully homomorphic encryption techniques, we give disjunctive, conjunctive, and complement constructions for private threshold queries based on keyword frequency. Combining the basic constructions, we further present a generic construction for arbitrary private threshold queries based on keyword frequency. Our protocols are semantically secure as long as the underlying fully homomorphic encryption scheme is semantically secure. Index Terms-Private searching on streaming data, fully homomorphic encryption, binary linear code Ç IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 11, NO. 2, MARCH/APRIL 2014 155 . X. Yi is with the College 3. Yi et al.  proposed a solution to search for documents containing more than t out of n keywords, so-called (t; n) threshold searching, without increasing the dictionary size. The solution is built on the state of the art fully homomorphic encryption (FHE) technique and the buffer keeps at most m matching documents without collisions. Searching for documents containing one or more classified keywords like , , ,  can be achieved by (1; n) threshold searching. The existing solutions for private searching on streaming data have not considered keyword frequency, the number of times that keyword is used in a document. Search engines like Google, Yahoo, and AltaVista display results based on secret algorithms. Although we do not know the equations, we believe that these are based mainly on keyword frequency and link popularity. Our contributions. In this paper, we consider a new private query, which searches for documents from streaming data based on keyword frequency, such that a number of times that a keyword appears in a matching document is required to be higher or lower than a given threshold. For example, find documents containing keywords k 1 ; k 2 ; . . . ; k n such that the frequency of the keyword k i ði ¼ 1; 2; . . . ; nÞ in the document is higher (or lower) than t i . We take the lower case into account because terms that appear too frequently are often not very useful as they may not allow one to retrieve a small subset of documents from the streaming data. This form of query can help us in finding more relevant documents, but it cannot be implemented with traditional homomorphic encryption schemes. Based on FHE, we give disjunctive, conjunctive, and complement constructions for private threshold queries based on keyword frequency: 1) Our disjunctive construction allows to search for documents satisfying a condition such as ðfðk 1 Þ ! t 1 Þ _ ðfðk 2 Þ ! t 2 Þ _ Á Á Á _ ðfðk n Þ ! t n Þ, where fðk i Þ denotes the frequency of the keyword k i and t i is a given threshold; 2) Our conjunctive construction allows to search for documents satisfying a condition such as ðfðk 1 Þ ! t 1 Þ^ðfðk 2 Þ ! t 2 Þ^Á Á Á^ðfðk n Þ ! t n Þ; 3) We have two complement constructions. Our disjunctive complement construction allows us to search for documents satisfying a condition such as ðfðk i1 Þ ! t i1 Þ _ Á Á Á _ ðfðk in 1 Þ ! t in 1 Þ _ :ðfðk j1 Þ ! t j1 Þ _ Á Á Á _ :ðfðk jn 2 Þ ! t jn 2 Þ, i.e., ðfðk i1 Þ ! t i1 Þ _ Á Á Á _ ðfðk in 1 Þ ! t in 1 Þ _ ðfðk j 1 Þ < t j 1 Þ _ Á Á Á _ ðfðk j n 2 Þ < t j n 2 Þ, where : stands for complement and n 1 þ n 2 ¼ n. Our conjunctive complement construction allows to search for documents satisfying a condition such as ðfðk i1 Þ ! t i1 Þ^Á Á Á^ðfðk in 1 Þ ! t in 1 Þ: ðfðk j 1 Þ ! t j 1 Þ^Á Á Á^ðfðk j n 2 Þ ! t j n 2 Þ, i.e., ðfðk i 1 Þ ! t i 1 ÞÁ Á Á^ðfðk i n 1 Þ ! t i n 1 Þ^fðk j 1 Þ < t j 1 Þ^Á Á Á^ðfðk j n 2 Þ < t j n 2 Þ. Furthermore, by combining the above basic constructions, we present a generic construction for arbitrary threshold query based on keyword frequency. Like Yi et al.'s solution for the (t; n) threshold query , our solutions encrypt the thresholds, compare them with the ciphertexts and store a matching document into the buffer by constructing an encryption of (L; ') linear code of the document. Unlike the (t; n) threshold query solution where only one threshold t is encrypted and enclosed to the searching program, our solutions encrypt the frequency threshold for each keyword because different keywords may have different frequency thresholds. CONJUNCTIVE THRESHOLD QUERY BASED ON KEYWORD FREQUENCY Formally, a conjunctive threshold query over keywords K ¼ fk 1 ; k 2 ; . . . ; k n g can be expressed as where fðk i Þð1 i nÞ is the frequency of the keyword k i in the document and t i is the given threshold. It is easy to see Lemma 2. Given a document M, a conjunctive threshold query Q K ðMÞ ¼ 1 if and only if fðk i Þ ! t i for 1 i n. Construction Following the model described in Section 4, our protocol of conjunctive threshold query is composed of four algorithms KeyGen, FilterGen, FilterExec, BufferDec. Our conjunctive construction can be formally presented as follows. Key Generation. KeyGenðkÞ. Run the key generation algorithm for the underlying fully homomorphic encryption scheme to produce the private key sk and the public key pk.