Filters








83 Hits in 1.6 sec

Compact Samples for Data Dissemination [chapter]

Tova Milo, Assaf Sagi, Elad Verbin
2006 Lecture Notes in Computer Science  
We consider data dissemination in a peer-to-peer network, where each user wishes to obtain some subset of the available information objects. In most of the modern algorithms for such data dissemination, the users periodically obtain samples of peer IDs (possibly with some summary of their content). They then use the samples for connecting to other peers and downloading data pieces from them. For a set O of information objects, we call a sample of peers, containing at least k possible providers
more » ... possible providers for each object o ∈ O, a k-sample. In order to balance the load, the k-samples should be fair, in the sense that for every object, its providers should appear in the sample with equal probability. Also, since most algorithms send fresh samples frequently, the size of the k-samples should be as small as possible, to minimize communication overhead. We describe in this paper two novel techniques for generating fair and small k-samples in a P2P setting. The first is based on a particular usage of uniform sampling and has the advantage that it allows to build on standard P2P uniform sampling tools. The second is based on non-uniform sampling and requires more particular care, but is guaranteed to generate the smallest possible fair k-sample. The two algorithms exploit available dependencies between information objects to reduce the sample size, and are proved, both theoretically and experimentally, to be extremely effective.
doi:10.1007/11965893_6 fatcat:wrhgkhrzt5cytfmkwin7ejah7u

A simpler analysis of Burrows–Wheeler-based compression

Haim Kaplan, Shir Landau, Elad Verbin
2007 Theoretical Computer Science  
In this paper we present a new technique for worst-case analysis of compression algorithms which are based on the Burrows-Wheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following three essential steps: 1) Obtain the Burrows-Wheeler Transform of the text, 2) Convert the transform into a sequence of integers using the move-to-front algorithm, 3) Encode the integers using
more » ... ntegers using Arithmetic code or any order-0 encoding (possibly with run-length encoding). We achieve a strong upper bound on the worst-case compression ratio of this algorithm. This bound is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, we show that for any input string s, and µ > 1, the length of the compressed string is bounded by µ · |s|H k (s) + log(ζ(µ)) · |s| + µg k + O(log n) where H k is the k-th order empirical entropy, g k is a constant depending only on k and on the size of the alphabet, and ζ(µ) = 1 1 µ + 1 2 µ +. . . is the standard zeta function. As part of the analysis we prove a result on the compressibility of integer sequences, which is of independent interest. Finally, we apply our techniques to prove a worst-case bound on the compression ratio of a compression algorithm based on the Burrows-Wheeler Transform followed by distance coding, for which worst-case guarantees have never been given. We prove that the length of the compressed string is bounded by 1.7286 · |s|H k (s) + g k + O(log n). This bound is better than the bound we give for bw0.
doi:10.1016/j.tcs.2007.07.020 fatcat:ujldiywxmbdbxg7df2jost2cva

The limits of buffering

Elad Verbin, Qin Zhang
2010 Proceedings of the 42nd ACM symposium on Theory of computing - STOC '10  
We study the dynamic membership (or dynamic dictionary) problem, which is one of the most fundamental problems in data structures. We study the problem in the external memory model with cell size b bits and cache size m bits. We prove that if the amortized cost of updates is at most 0.9 (or any other constant < 1), then the query cost must be Ω(log b log n (n/m)), where n is the number of elements in the dictionary. In contrast, when the update time is allowed to be 1 + o(1), then a bit vector
more » ... then a bit vector or hash table give query time O(1). Thus, this is a threshold phenomenon for data structures. This lower bound answers a folklore conjecture of the external memory community. Since almost any data structure task can solve membership, our lower bound implies a dichotomy between two alternatives: (i) make the amortized update time at least 1 (so the data structure does not buffer, and we lose one of the main potential advantages of the cache), or (ii) make the query time at least roughly logarithmic in n. Our result holds even when the updates and queries are chosen uniformly at random and there are no deletions; it holds for randomized data structures, holds when the universe size is O(n), and does not make any restrictive assumptions such as indivisibility. All of the lower bounds we prove hold regardless of the space consumption of the data structure, while the upper bounds only need linear space. The lower bound has some striking implications for external memory data structures. It shows that the query complexities of many problems such as 1D-range counting, predecessor, rank-select, and many others, are all the same in the regime where the amortized update time is less than 1, as long as the cell size is large enough (b = polylog(n) suffices). The proof of our lower bound is based on a new combinatorial lemma called the Lemma of Surprising Intersections (LOSI) which allows us to use a proof methodology where we first analyze the intersection structure of the positive queries by using encoding arguments, and then use statistical arguments to deduce properties of the intersection structure of all queries, even the negative ones. In most other data structure arguments that we know, it is difficult to argue anything about the negative queries. Therefore we believe that the LOSI and this proof methodology might find future uses for other problems.
doi:10.1145/1806689.1806752 dblp:conf/stoc/VerbinZ10 fatcat:qtxnv2n47jfphfdop3ogltdsza

A Simpler Analysis of Burrows-Wheeler Based Compression [chapter]

Haim Kaplan, Shir Landau, Elad Verbin
2006 Lecture Notes in Computer Science  
In this paper we present a new technique for worst-case analysis of compression algorithms which are based on the Burrows-Wheeler Transform. We deal mainly with the algorithm proposed by Burrows and Wheeler in their first paper on the subject [6], called bw0. This algorithm consists of the following three essential steps: 1) Obtain the Burrows-Wheeler Transform of the text, 2) Convert the transform into a sequence of integers using the move-to-front algorithm, 3) Encode the integers using
more » ... ntegers using Arithmetic code or any order-0 encoding (possibly with run-length encoding). We achieve a strong upper bound on the worst-case compression ratio of this algorithm. This bound is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, we show that for any input string s, and µ > 1, the length of the compressed string is bounded by µ · |s|H k (s) + log(ζ(µ)) · |s| + µg k + O(log n) where H k is the k-th order empirical entropy, g k is a constant depending only on k and on the size of the alphabet, and ζ(µ) = 1 1 µ + 1 2 µ +. . . is the standard zeta function. As part of the analysis we prove a result on the compressibility of integer sequences, which is of independent interest. Finally, we apply our techniques to prove a worst-case bound on the compression ratio of a compression algorithm based on the Burrows-Wheeler Transform followed by distance coding, for which worst-case guarantees have never been given. We prove that the length of the compressed string is bounded by 1.7286 · |s|H k (s) + g k + O(log n). This bound is better than the bound we give for bw0.
doi:10.1007/11780441_26 fatcat:hvvhbhj66zbexnahjirvznldpa

Sorting signed permutations by reversals, revisited

Haim Kaplan, Elad Verbin
2005 Journal of computer and system sciences (Print)  
The problem of sorting signed permutations by reversals (SBR) is a fundamental problem in computational molecular biology. The goal is, given a signed permutation, to find a shortest sequence of reversals that transforms it into the positive identity permutation, where a reversal is the operation of taking a segment of the permutation, reversing it, and flipping the signs of its elements. In this paper we describe a randomized algorithm for SBR. The algorithm tries to sort the permutation by
more » ... e permutation by repeatedly performing a random oriented reversal. This process is in fact a random walk on the graph where permutations are the nodes and an arc from to corresponds to an oriented reversal that transforms to . We show that if this random walk stops at the identity permutation, then we have found a shortest sequence. We give empirical evidence that this process indeed succeeds with high probability on a random permutation. To implement our algorithm we describe a data structure to maintain a permutation, that allows to draw an oriented reversal uniformly at random, and perform it in sub-linear time. With this data structure we can implement the random walk in O(n 3/2 log n) time, thus obtaining an algorithm for SBR that almost always runs in subquadratic time. The data structures we present may also be of independent interest for developing other algorithms for SBR, and for other problems. Finally, we present the first efficient parallel algorithm for SBR. We obtain this result by developing a fast implementation of the recent algorithm of Bergeron (Proceedings of CPM, 2001, pp. 106-117) for sorting signed permutations by reversals that is parallelizable. Our implementation runs in O(n 2 log n) time on a regular RAM, and in O(n log n) time on a PRAM using n processors.
doi:10.1016/j.jcss.2004.12.002 fatcat:netdafa5enhr7cmznu6dfad2m4

Compact samples for data dissemination

Tova Milo, Assaf Sagi, Elad Verbin
2008 Journal of computer and system sciences (Print)  
We consider data dissemination in a peer-to-peer network, where each user wishes to obtain some subset of the available information objects. In most of the modern algorithms for such data dissemination, the users periodically obtain samples of peer IDs (possibly with some summary of their content). They then use the samples for connecting to other peers and downloading data pieces from them. For a set O of information objects, we call a sample of peers, containing at least k possible providers
more » ... possible providers for each object o ∈ O, a k-sample. In order to balance the load, the k-samples should be fair, in the sense that for every object, its providers should appear in the sample with equal probability. Also, since most algorithms send fresh samples frequently, the size of the k-samples should be as small as possible, to minimize communication overhead. We describe in this paper two novel techniques for generating fair and small k-samples in a P2P setting. The first is based on a particular usage of uniform sampling and has the advantage that it allows to build on standard P2P uniform sampling tools. The second is based on non-uniform sampling and requires more particular care, but is guaranteed to generate the smallest possible fair k-sample. The two algorithms exploit available dependencies between information objects to reduce the sample size, and are proved, both theoretically and experimentally, to be extremely effective.
doi:10.1016/j.jcss.2007.07.012 fatcat:6cb5n47jcbf5tlyqx62ojrb2li

Matrix Tightness: A Linear-Algebraic Framework for Sorting by Transpositions [chapter]

Tzvika Hartman, Elad Verbin
2006 Lecture Notes in Computer Science  
Elad Verbin would like to thank Martin C.  ... 
doi:10.1007/11880561_23 fatcat:77aed3ldrfcxdmeham2wqqhxd4

Sorting and Selection in Posets

Constantinos Daskalakis, Richard M. Karp, Elchanan Mossel, Samantha J. Riesenfeld, Elad Verbin
2011 SIAM journal on computing (Print)  
Classical problems of sorting and searching assume an underlying linear ordering of the objects being compared. In this paper, we study these problems in the context of partially ordered sets, in which some pairs of objects are incomparable. This generalization is interesting from a combinatorial perspective, and it has immediate applications in ranking scenarios where there is no underlying linear ordering, e.g., conference submissions. It also has applications in reconstructing certain types
more » ... ting certain types of networks, including biological networks. Our results represent significant progress over previous results from two decades ago by Faigle and Turán. In particular, we present the first algorithm that sorts a width-w poset of size n with optimal query complexity O(n(w + log n)). We also describe a variant of Mergesort with query complexity O(wn log n w ) and total complexity O(w 2 n log n w ); an algorithm with the same query complexity was given by Faigle and Turán, but no efficient implementation of that algorithm is known. Both our sorting algorithms can be applied with negligible overhead to the more general problem of reconstructing transitive relations. We also consider two related problems: finding the minimal elements, and its generalization to finding the bottom k "levels", called the k-selection problem. We
doi:10.1137/070697720 fatcat:ejhc3uvpdzayrcwe7w5k4hjise

Distance Oracles for Sparse Graphs

Christian Sommer, Elad Verbin, Wei Yu
2009 2009 50th Annual IEEE Symposium on Foundations of Computer Science  
Thorup and Zwick, in their seminal work, introduced the approximate distance oracle, which is a data structure that answers distance queries in a graph. For any integer k, they showed an efficient algorithm to construct an approximate distance oracle using space O(kn 1+1/k ) that can answer queries in time O(k) with a distance estimate that is at most α = 2k − 1 times larger than the actual shortest distance (α is called the stretch). They proved that, under a combinatorial conjecture, their
more » ... onjecture, their data structure is optimal in terms of space: if a stretch of at most 2k−1 is desired, then the space complexity is at least n 1+1/k . Their proof holds even if infinite query time is allowed: it is essentially an "incompressibility" result. Also, the proof only holds for dense graphs, and the best bound it can prove only implies that the size of the data structure is lower bounded by the number of edges of the graph. Naturally, the following question arises: what happens for sparse graphs? In this paper we give a new lower bound for approximate distance oracles in the cell-probe model. This lower bound holds even for sparse (polylog(n)-degree) graphs, and it is not an "incompressibility" bound: we prove a three-way tradeoff between space, stretch and query time. We show that, when the query time is t, and the stretch is α, then the space S must be S ≥ n 1+Ω(1/tα) / lg n . (1) This lower bound follows by a reduction from lopsided set disjointness to distance oracles, based on and motivated by recent work of Pǎtraşcu. Our results in fact show that for any high-girth regular graph, an approximate distance oracle that supports efficient queries for all subgraphs of G must obey Eq. (1). We also prove some lemmas that count sets of paths in high-girth regular graphs and high-girth regular expanders, which might be of independent interest.
doi:10.1109/focs.2009.27 dblp:conf/focs/SommerVY09 fatcat:deu5xejmezb7xcnd7qp7j7lshy

Data Structure Lower Bounds on Random Access to Grammar-Compressed Strings [chapter]

Elad Verbin, Wei Yu
2013 Lecture Notes in Computer Science  
In this paper we investigate the problem of building a static data structure that represents a string s using space close to its compressed size, and allows fast access to individual characters of s. This type of structures was investigated by the recent paper of Bille et al. [3] . Let n be the size of a context-free grammar that derives a unique string s of length L. (Note that L might be exponential in n.) Bille et al. showed a data structure that uses space O(n) and allows to query for the
more » ... to query for the i-th character of s using running time O(log L). Their data structure works on a word RAM with a word size of log L bits. Here we prove that for such data structures, if the space is poly(n), then the query time must be at least (log L) 1−ε / log S where S is the space used, for any constant ε > 0. As a function of n, our lower bound is Ω(n 1/2−ε ). Our proof holds in the cell-probe model with a word size of log L bits, so in particular it holds in the word RAM model. We show that no lower bound significantly better than n 1/2−ε can be achieved in the cell-probe model, since there is a data structure in the cell-probe model that uses O(n) space and achieves O( √ n log n) query time. The "bad" setting of parameters occurs roughly when L = 2 √ n . We also prove a lower bound for the case of not-as-compressible strings, where, say, L = n 1+ε . For this case, we prove that if the space is n · polylog(n), then the query time must be at least Ω(log n/ log log n). The proof works by reduction to communication complexity, namely to the LSD (Lopsided Set Disjointness) problem, recently employed by Pǎtraşcu and others. We prove lower bounds also for the case of LZ-compression and Burrows-Wheeler (BWT) compression. All of our lower bounds hold even when the strings are over an alphabet of size 2 and hold even for randomized data structures with 2-sided error.
doi:10.1007/978-3-642-38905-4_24 fatcat:kywb33jslzbrrblpblangd5bz4

On agnostic boosting and parity learning

Adam Tauman Kalai, Yishay Mansour, Elad Verbin
2008 Proceedings of the fourtieth annual ACM symposium on Theory of computing - STOC 08  
The motivating problem is agnostically learning parity functions, i.e., parity with arbitrary or adversarial noise. Specifically, given random labeled examples from an arbitrary distribution, we would like to produce an hypothesis whose accuracy nearly matches the accuracy of the best parity function. Our algorithm runs in time 2 O(n/ log n) , which matches the best known for the easier cases of learning parities with random classification noise (Blum et al, 2003) and for agnostically learning
more » ... ostically learning parities over the uniform distribution on inputs (Feldman et al, 2006) . Our approach is as follows. We give an agnostic boosting theorem that is capable of nearly achieving optimal accuracy, improving upon earlier studies (starting with Ben David et al, 2001). To achieve this, we circumvent previous lower bounds by altering the boosting model. We then show that the (random noise) parity learning algorithm of Blum et al (2000) fits our new model of agnostic weak learner. Our agnostic boosting framework is completely general and may be applied to other agnostic learning problems. Hence, it also sheds light on the actual difficulty of agnostic learning by showing that full agnostic boosting is indeed possible.
doi:10.1145/1374376.1374466 dblp:conf/stoc/KalaiMV08 fatcat:kcgl7xxjmfa3tdwae6abzr4kgu

Colored intersection searching via sparse rectangular matrix multiplication

Haim Kaplan, Micha Sharir, Elad Verbin
2006 Proceedings of the twenty-second annual symposium on Computational geometry - SCG '06  
In a Batched Colored Intersection Searching Problem (CI), one is given a set of n geometric objects (of a certain class). Each object is colored by one of c colors, and the goal is to report all pairs of colors (c1, c2) such that there are two objects, one colored c1 and one colored c2, that intersect each other. We also consider the bipartite version of the problem, where we are interested in intersections between objects of one class with objects of another class (e.g., points and
more » ... nts and halfspaces). In a Sparse Rectangular Matrix Multiplication Problem (SRM M ), one is given an n1 × n2 matrix A and an n2 × n3 matrix B, each containing at most m non-zero entries, and the goal is to compute their product AB. In this paper we present a technique for solving CI problems over a wide range of classes of geometric objects. The basic idea is first to use some decomposition method, such as geometric cuttings, to represent the intersection graph of the objects as a union of bi-cliques. Then, in each of these bi-cliques, contract all vertices of the same color. Finally, use an algorithm for sparse matrix multiplication (adapted from Yuster and Zwick [20]) to compute the union of the bicliques. We apply the technique to segments in R 1 , to segments in R 2 , to points and halfplanes in R 2 , and, more generally, to points and halfspaces in R d , for any fixed d. However, the technique extends to colored intersection searching in any class (or pair of classes) of geometric objects of constant descriptive complexity. In particular, using our technique we obtain an algorithm that reports all the pairs of intersecting colors for n points and n halfplanes in R 2 , that are colored by c colors, in O(n 4/3 c 0.46 ) time when n ≥ c 1.44 , and in O(n 1.04 c 0.9 + c 2 ) time when n ≤ c 1.44 . The algorithms that we give for CI use the algorithm for SRM M as a black box, which means that any improved algorithm for SRM M immediately leads to an improved algorithm for all colored intersection problems that our method applies to. We also show that the complexity of computing all intersecting colors in a set of segments on the real line is identical, up to a polylogarithmic multiplicative factor, to the complexity of SRM M with the appropriate parameters.
doi:10.1145/1137856.1137866 dblp:conf/compgeom/KaplanSV06 fatcat:b4aihjoo2jcupnljgsuc2zfgly

Data Structure Lower Bounds on Random Access to Grammar-Compressed Strings [article]

Shiteng Chen, Elad Verbin, Wei Yu
2012 arXiv   pre-print
In this paper we investigate the problem of building a static data structure that represents a string s using space close to its compressed size, and allows fast access to individual characters of s. This type of structures was investigated by the recent paper of Bille et al. Let n be the size of a context-free grammar that derives a unique string s of length L. (Note that L might be exponential in n.) Bille et al. showed a data structure that uses space O(n) and allows to query for the i-th
more » ... ery for the i-th character of s using running time O(log L). Their data structure works on a word RAM with a word size of logL bits. Here we prove that for such data structures, if the space is poly(n), then the query time must be at least (log L)^{1-\epsilon}/log S where S is the space used, for any constant eps>0. As a function of n, our lower bound is \Omega(n^{1/2-\epsilon}). Our proof holds in the cell-probe model with a word size of log L bits, so in particular it holds in the word RAM model. We show that no lower bound significantly better than n^{1/2-\epsilon} can be achieved in the cell-probe model, since there is a data structure in the cell-probe model that uses O(n) space and achieves O(\sqrt{n log n}) query time. The "bad" setting of parameters occurs roughly when L=2^{\sqrt{n}}. We also prove a lower bound for the case of not-as-compressible strings, where, say, L=n^{1+\epsilon}. For this case, we prove that if the space is n polylog(n), then the query time must be at least \Omega(log n/loglog n). The proof works by reduction to communication complexity, namely to the LSD problem, recently employed by Patrascu and others. We prove lower bounds also for the case of LZ-compression and Burrows-Wheeler (BWT) compression. All of our lower bounds hold even when the strings are over an alphabet of size 2 and hold even for randomized data structures with 2-sided error.
arXiv:1203.1080v2 fatcat:fjfenk2vzvfh7mev5okt4ce4i4

The Coin Problem and Pseudorandomness for Branching Programs

Joshua Brody, Elad Verbin
2010 2010 IEEE 51st Annual Symposium on Foundations of Computer Science  
The Coin Problem is the following problem: a coin is given, which lands on head with probability either 1/2 + β or 1/2 − β. We are given the outcome of n independent tosses of this coin, and the goal is to guess which way the coin is biased, and to answer correctly with probability ≥ 2/3. When our computational model is unrestricted, the majority function is optimal, and succeeds when β ≥ c/ √ n for a large enough constant c. The coin problem is open and interesting in models that cannot
more » ... that cannot compute the majority function. In this paper we study the coin problem in the model of read-once width-w branching programs. We prove that in order to succeed in this model, β must be at least 1/(log n) Θ(w) . For constant w this is tight by considering the recursive tribes function, and for other values of w this is nearly tight by considering other read-once AND-OR trees. We generalize this to a Dice Problem, where instead of independent tosses of a coin we are given independent tosses of one of two m-sided dice. We prove that if the distributions are too close and the mass of each side of the dice is not too small, then the dice cannot be distinguished by small-width read-once branching programs. We suggest one application for this kind of theorems: we prove that Nisan's Generator fools width-w read-once regular branching programs, using seed length O(w 4 log n log log n + log n log(1/ε)). For w = ε = Θ(1), this seedlength is O(log n log log n). The coin theorem and its relatives might have other connections to PRGs. This application is related to the independent, but chronologically-earlier, work of Braverman, Rao, Raz and Yehudayoff [1].
doi:10.1109/focs.2010.10 dblp:conf/focs/BrodyV10 fatcat:ea4qck6ffvfndcd4cro5yuuloy

On the complexity of cell flipping in permutation diagrams and multiprocessor scheduling problems

Martin Charles Golumbic, Haim Kaplan, Elad Verbin
2005 Discrete Mathematics  
Permutation diagrams have been used in circuit design to model a set of single point nets crossing a channel, where the minimum number of layers needed to realize the diagram equals the clique number (G) of its permutation graph, the value of which can be calculated in O(n log n) time. We consider a generalization of this model motivated by "standard cell" technology in which the numbers on each side of the channel are partitioned into consecutive subsequences, or cells, each of which can be
more » ... of which can be left unchanged or flipped (i.e., reversed). We ask, for what choice of flippings will the resulting clique number be minimum or maximum. We show that when one side of the channel is fixed (no flipping), an optimal flipping for the other side can be found in O(n log n) time for the maximum clique number, and that when both sides are free this can be solved in O(n 2 ) time. We also prove NP-completeness of finding a flipping that gives a minimum clique number, even when one side of the channel is fixed, and even when the size of the cells is restricted to be less than a small constant. Moreover, since the complement of a permutation graph is also a permutation graph, the same complexity results hold for the stable set (independence) number. In the process of the NP-completeness proof ଁ A preliminary version of this paper has appeared in the we also prove NP-completeness of a restricted variant of a scheduling problem. This new NPcompleteness result may be of independent interest.
doi:10.1016/j.disc.2004.08.042 fatcat:fiqmzmygkbcmxfbq7d7k37773a
« Previous Showing results 1 — 15 out of 83 results