Cumulative Algebraic Signatures for Fast String Search, Protection Against Incidental Viewing and Corruption of Data in an SDDS [chapter]

Witold Litwin, Riad Mokadem, Thomas Schwarz
Databases, Information Systems, and Peer-to-Peer Computing  
We propose to encode the records of a Scalable Distributed Data Structure (SDDS) using precomputed algebraic signatures. The partly pre-computed algebraic signature of a string encodes each symbol into its contribution to the algebraic signature of the string. The cumulative pre-computed algebraic signature encodes each symbol with the signature of the string prefix ending with the symbol. The encoding/decoding according to either scheme occurs at the SDDS clients. For both schemes, and each
more » ... ration, the overhead is of linear time complexity O (n). It is however slightly higher for the cumulative signature. The schemes protect the SDDS data against incidental viewing by an unauthorized server's administrator. One may use them also to detect and localize the silent corruption. These features should be of interest for P2P and grid computing. Both schemes provide also fast string search (match) directly on encoded data at the SDDS servers. They appear an alternative to known Karp-Rabin type schemes in our context of a search in a file or a database. Both accelerate the string search with respect to the fast already use of the algebraic signatures on the original data. Moreover, both appear typically the fastest in the context, among any string search algorithms we are aware of. The cumulative signature provides the fastest searches. For the string of l symbols in a field of n symbols, the complexity is almost O (1) for prefix search, and O (n -l) for the string search. The string manipulation capabilities of our schemes should be by themselves of interest to applications. Introduction At present, a record in a Scalable Distributed Data Structure (SDDS) such as LH* or RP*, is implicitly or explicitly assumed to contain the original values of user/application data, [LNS96], [LNS94]. This is the case for the free SDDS-2004 system [C04]. The SDDS servers store records in buckets in distributed RAM. This greatly enhances the access performance compared to disk buckets. Typically, as in SDDS-2004, SDDS clients and servers are P2P PCs. The key search or insert speed acceleration reaches three hundred times, [DL00], [LMS04], and thus provides an experimental backup for Jim Gray's old conjecture about the advantages of distributed RAM presented at UC Berkeley in 1992 and following the calculations in [G88]. The SDDS data in a PC RAM are at risk of incidental exposure to the server's owner/administrator even during simple maintenance operations such as a storage dump. Both SDDS user and PC administrator might be unhappy with this exposure and would like to limit it especially in regards to the non-key data. (We recall that an SDDS record consists of a key and a non-key field. The former is often a meaningless object ID.) A PC provided as the SDDS server should typically support concurrently other applications at the discretion of its owner. A malfunction due to execution of any of them may lead to an incidental corruption of SDDS data. User and server administrator may both welcome a method detecting such a corruption. Possibly, indicating also the memory area where the corruption occurred with a practical precision, e.g., 256 B at most. Most often, the servers should be a part of an organization. There is then usually a business-friendly relationship between the administrators and the user. The protection can be simple and inexpensive because it can assume no malicious action to break it or trust social safeguards. In "real life", confidential letters are circulated in a simple envelope with a stamp "confidential" on it, despite the relative ease to use steam to open the envelope with only a small risk of detection. The threat of a jail term suffices to keep even the easily tempted from opening interesting mail in a normal office setting, and thus saves organizations the costs of armored protection for the mail. Encoding against incidental viewing and corruption should preserve all user capabilities. These include in SDDS-2004 the key search and the non-key scan by string searches. The latter functions use the algebraic signatures [LS04] . The approach presents advantages with respect to more traditional non-signature based search methods. It may be about independent of the searched string length. The SDDS-2004 client sends indeed only a signature few bytes long to the server. Presently we use 4 byte signatures for strings that are up to 64 KB long. -1 -The communication load is much less. As an important side effect, a network snooper has a harder time to make sense of captured traffic. Presently, popular encoding schemes including the standard ones under Windows do not meet these requirements well. They either limit the capabilities or require systematic encryption at the server. The latter notably slows record access. Recent work attempts to adapt popular string search algorithm to work on the encoded data appeared therefore recently for selected compression schemes, e.g., [NT04] . At the expense of heavier encoding/decoding, due to the compression, one may have interesting performance for selected types of strings and operations upon them. We propose two schemes that fit our needs better. Both use the pre-computed algebraic signatures [LS04]. The first scheme uses the partial pre-computing. It encodes each symbol in a data string to store by the value of its contribution to the signature of the string. The other scheme uses the cumulative pre-computing. It replaces each symbol by the signature of the prefix up to it. The SDDS client performs the encoding and the decoding which are both very fast. The server gets the encoded data only. The algebraic signatures allow for fast string search using an algorithm similar to Karp-Rabin [KR87]. One compares the signature of the searched string with that of the currently examined (sub)string in the data field. The SDDS-2004 prototype uses the scheme for parallel/distributed non-key scans. These search for a string in the non-key field of the SDDS record (we recall that an SDDS record basically contains the key and the non-key field). The pre-computed signature preserves this capability over encoded data. In other words, the scan performs without decoding any data at any server. As one could wish for best data protection. Surprisingly, the string search becomes even notably faster, especially for cumulative pre-computing.
doi:10.1007/978-3-540-71661-7_14 dblp:conf/dbisp2p/LitwinMS05 fatcat:uudq7zmuqjcebhujaf76fg37ae