Filters








107 Hits in 1.4 sec

DynaMat

Yannis Kotidis, Nick Roussopoulos
1999 SIGMOD record  
Pre-computation and materialization of views with aggregate functions is a common technique in Data Warehouses. Due to the complex structure of the warehouse and the different profiles of the users who submit queries, there is need for tools that will automate the selection and management of the materialized data. In this paper we present DynaMat, a system that dynamically materializes information at multiple levels of granularity in order to match the demand (workload) but also takes into
more » ... nt the maintenance restrictions for the warehouse, such as down time to update the views and space availability. DynaMat unifies the view selection and the view maintenance problems under a single framework using a novel "goodness" measure for the materialized views. DynaMat constantly monitors incoming queries and materializes the best set of views subject to the space constraints. During updates, DynaMat reconciles the current materialized view selection and refreshes the most beneficial subset of it within a given maintenance window. We compare DynaMat against a system that is given all queries in advance and the pre-computed optimal static view selection. The comparison is made based on a new metric, the Detailed Cost Savings Ratio introduced for quantifying the benefits of view materialization against incoming queries. These experiments show that DynaMat's dynamic view selection outperforms the optimal static view selection and thus, any sub-optimal static algorithm that has appeared in the literature.
doi:10.1145/304181.304215 fatcat:wsis7rdubfajdkhgb7yv5hafbm

Cubetree

Nick Roussopoulos, Yannis Kotidis, Mema Roussopoulos
1997 SIGMOD record  
The data cube is an aggregate operator which has been shown to be very powerful for On Line Analytical Processing OLAP in the context of data warehousing. It is, however, very expensive to compute, access, and maintain. In this paper we de ne the cubetree" as a storage abstraction of the cube and realize it using packed R-trees for most ecient cube queries. We then reduce the problem of creation and maintenance of the cube to sorting and bulk incremental merge-packing of cubetrees. This
more » ... ck has been implemented to use separate storage for writing the updated cubetrees, therefore allowing cube queries to continue even during maintenance. Finally, w e c haracterize the size of the delta increment for achieving good bulk update schedules for the cube. The paper includes experiments with various data sets measuring query and bulk update performance.
doi:10.1145/253262.253276 fatcat:7gz5kcdl3vgpre5yhe32ql7bcy

Dwarf

Yannis Sismanis, Antonios Deligiannakis, Nick Roussopoulos, Yannis Kotidis
2002 Proceedings of the 2002 ACM SIGMOD international conference on Management of data - SIGMOD '02  
Dwarf is a highly compressed structure for computing, storing, and querying data cubes. Dwarf identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for sparse areas. Putting the two together fuses the exponential sizes of high dimensional full cubes into a dramatically condensed data structure. The elimination of suffix redundancy has an equally dramatic
more » ... reduction in the computation of the cube because recomputation of the redundant suffixes is avoided. This effect is multiplied in the presence of correlation amongst attributes in the cube. A Petabyte 25-dimensional cube was shrunk this way to a 2.3GB Dwarf Cube, in less than 20 minutes, a 1:400000 storage reduction ratio. Still, Dwarf provides 100% precision on cube queries and is a self-sufficient structure which requires no access to the fact table. What makes Dwarf practical is the automatic discovery, in a single pass over the fact table, of the prefix and suffix redundancies without user involvement or knowledge of the value distributions. This paper describes the Dwarf structure and the Dwarf cube construction algorithm. Further optimizations are then introduced for improving clustering and query performance. Experiments with the current implementation include comparisons on detailed measurements with real and synthetic datasets against previously published techniques. The comparisons show that Dwarfs by far outperform these techniques on all counts: storage space, creation time, query response time, and updates of cubes. *
doi:10.1145/564691.564745 dblp:conf/sigmod/SismanisDRK02 fatcat:ffry3g3idfgp5p6gfblpxcclge

TACO

Nikos Giatrakos, Yannis Kotidis, Antonios Deligiannakis, Vasilis Vassalos, Yannis Theodoridis
2010 Proceedings of the 2010 international conference on Management of data - SIGMOD '10  
Wireless sensor networks are becoming increasingly popular for a variety of applications. Users are frequently faced with the surprising discovery that readings produced by the sensing elements of their motes are often contaminated with outliers. Outlier readings can severely affect applications that rely on timely and reliable sensory data in order to provide the desired functionality. As a consequence, there is a recent trend to explore how techniques that identify outlier values can be
more » ... d to sensory data cleaning. Unfortunately, most of these approaches incur an overwhelming communication overhead, which limits their practicality. In this paper we introduce an in-network outlier detection framework, based on locality sensitive hashing, extended with a novel boosting process as well as efficient load balancing and comparison pruning mechanisms. Our method trades off bandwidth for accuracy in a straightforward manner and supports many intuitive similarity metrics.
doi:10.1145/1807167.1807199 dblp:conf/sigmod/GiatrakosKDVT10 fatcat:4rnisys3kjhpte4cbei5g2ehuy

The active MultiSync controller of the cubetree storage organization

Nick Roussopoulos, Yannis Kotidis, Yannis Sismanis
1999 SIGMOD record  
The Cubetree Storage Organization (CSO)I logically and physically clusters materialized-views data, multi-dimensional indices on them, and computed aggregate values all in one compact and tight storage structure that uses a fraction of the conventional table-based space. This is a breakthrough technology for storing and accessing multi-dimensional data in terms of storage reduction, query performance and incremental bulk update speed. CS0 has been extended with an Active MultiSync controller
more » ... synchronizing multiple concurrent access and continuous asynchronous online updates for a non-stop data warehouse.
doi:10.1145/304181.304584 fatcat:w5b24plr6rdsbplsi6rvebemwq

RFID Data Aggregation [chapter]

Dritan Bleco, Yannis Kotidis
2009 Lecture Notes in Computer Science  
Radio frequency identication (RFID) technology is gaining popularity for many IT related applications. Nevertheless, an immediate adoption of RFID solutions by the existing IT infrastructure is a formidable task because of the volume of data that can be collected in a large-scale deployment of RFIDs. In this paper we present algorithms for temporal and spatial aggregation of RFID data streams, as a means to reduce their volume in an application controllable manner. We propose algorithms of
more » ... ased complexity that can aggregate the temporal records indicating the presence of an RFID tag using an application-dened storage upper bound. We further present complementary techniques that exploit the spatial correlations among RFID tags. Our methods detect multiple tags that are moved as a group and replace them with a surrogate group id, in order to further reduce the size of the representation. We provide an experimental study using real RFID traces and demonstrate the eectiveness of our methods. ? This work has been supported by the Basic Research Funding Program, Athens University of Economics and Business. and archiving. The ability to automatically identify objects, without contact, through their RFID tags, allows for a much more ecient tracking in the supply chain, thus eliminating the need for human intervention (which for instance is typically required in the case of bar codes). This removal of latency between the appearance of an object at a certain location and its identication allows us to consider new large-or global-scale monitoring infrastructures, enabling a much more ecient planning and management of resources. Nevertheless, an immediate adoption of RFID technology by existing IT infrastructure, consisting of systems such as enterprise resource planning, manufacturing execution, or supply chain management, is a formidable task. As an example, the typical architecture of a centralized data warehouse, used by decision support applications, assumes a periodic refresh schedule [3] that contradicts the need for currency by a supply chain management solution: when a product arrives at a distribution hub, it needs to be processed as quickly as possible. Moreover, existing systems have not been designed to cope with the voluminous data feeds that can be easily generated through a wide-use of RFID technology. A pallet of a few hundred products tagged with RFIDs generates hundreds of readings every time it is located within the sensing radius of a reader. A container with several hundred pallets throws tens of thousands of such readings. Moreover, these readings are continuous: the RFID reader will continuously report all tags that it senses at every time epoch. Obviously, some form of data reduction is required in order to manage these excessive volumes of data. Fortunately, the type of data feeds generated by RFIDs are embedded with lots of redundancy. As an example, successive observations of the same tag by a reader can be easily encoded using a time interval indicating the starting and ending time of the observation. Unfortunately, this straightforward data representation is prone to data collection errors. Existing RFID deployments, routinely drop a signicant amount of the tag-readings; often as much as 30% of the observations are lost [4] . This makes the previous solution practically ineectual as it can not limit in a application-controllable manner the number of records required in order to represent an existing RFID data stream. In this paper, we investigate data reduction methods that can reduce the size of the RFID data streams into a manageable representation that can then be fed into existing data processing and archiving infrastructures such as a data warehouse. Key to our framework is the decision to move much of the processing near the locations where RFID streams are produced. This reduces network congestion and allows for large scale deployment of the monitoring infrastructure. Our methods exploit the inherent temporal redundancy of RFID data streams. While an RFID tag remains at a certain location, its presence is recorded multiple times by the readers nearby. Based on this observation we propose algorithms of increased complexity that can aggregate the records indicating the presence of this tag using an application-dened storage upper bound. During this process some information might be lost resulting in false positive or false negative cases of identication. Our techniques minimize the inaccuracy of the reduced representation for a target space constraint. In addition to temporal, RFID data
doi:10.1007/978-3-642-02903-5_9 fatcat:dx7oofyddzeujkcnpbwyesw5ee

In-network approximate computation of outliers with quality guarantees

Nikos Giatrakos, Yannis Kotidis, Antonios Deligiannakis, Vasilis Vassalos, Yannis Theodoridis
2013 Information Systems  
Wireless sensor networks are becoming increasingly popular for a variety of applications. Users are frequently faced with the surprising discovery that readings produced by the sensing elements of their motes are often contaminated with outliers. Outlier readings can severely affect applications that rely on timely and reliable sensory data in order to provide the desired functionality. As a consequence, there is a recent trend to explore how techniques that identify outlier values based on
more » ... r similarity to other readings in the network can be applied to sensory data cleaning. Unfortunately, most of these approaches incur an overwhelming communication overhead, which limits their practicality. In this paper we introduce an in-network outlier detection framework, based on locality sensitive hashing, extended with a novel boosting process as well as efficient load balancing and comparison pruning mechanisms. Our method trades off bandwidth for accuracy in a straightforward manner and supports many intuitive similarity metrics. Our experiments demonstrate that our framework can reliably identify outlier readings using a fraction of the bandwidth and energy that would otherwise be required. values depend on the distance of the sensor from the source of the event that triggers the measurements. Moreover, in many applications, one cannot reliably infer whether a reading should be classified as an outlier without considering the recent history of values obtained by the nodes. Thus, in our framework we propose a more general method that detects outlier readings taking into account the recent measurements of a node, as well as spatial correlations with measurements of other nodes. Similar to recent proposals for processing declarative queries in wireless sensor networks, our techniques employ an in-network processing paradigm that fuses individual sensor readings as they are transmitted towards a base station. This fusion, dramatically reduces the communication cost, often by orders of magnitude, resulting in prolonged network lifetime. While such an in-network paradigm is also used in proposed methods that address the issue of data cleaning of sensor readings by identifying and, possibly, removing outliers [6, 2, 1, 7], none of these existing techniques provides a straightforward mechanism for controlling the burden of the nodes that are assigned to the task of outlier detection. An important observation that we make in this paper is that existing in-network processing techniques cannot reduce the volume of data transmitted in the network to a satisfactory level and lack the ability of tuning the resulting overhead according to the application needs and the accuracy levels required for outlier detection. Please note that it is desirable to reduce the amount of transmitted data in order to also significantly reduce the energy drain of sensor nodes. This occurs not only because radio operation is by far the biggest culprit in energy drain [8] , but also because fewer data transmissions also result in fewer collisions and, thus, fewer re-transmissions by the sensor nodes. In this paper we present a novel outlier detection scheme termed TACO (TACO stands for Tunable Approximate Computation of Outliers). TACO [9] adopts two levels of hashing mechanisms. The first is based on locality sensitive hashing (LSH) [10], which is a powerful method for dimensionality reduction [10, 11, 12] . We first utilize LSH in order to encode the latest W measurements collected by each sensor node as a bitmap of d W bits. This encoding is performed locally at each node. The encoding that we utilize trades accuracy (i.e., probability of correctly determining whether a node is an outlier or not) for bandwidth, by simply varying the desired level of dimensionality reduction and provides tunable accuracy guarantees based on the d parameter mentioned above. Assuming a clustered network organization [13, 14, 15, 16] , motes communicate their bitmaps to their clusterhead, which can estimate the similarity amongst the latest values of any pair of sensors within its cluster by comparing their bitmaps, and for a variety of similarity metrics that are useful for the applications we consider. Based on the performed similarity tests, and a desired minimum support specified by the posed query, each clusterhead generates a list of potential outlier nodes within its cluster. At a second (inter-cluster) phase of the algorithm, this list is then communicated among the clusterheads, in order to allow potential outliers to gain support from measurements of nodes that lie within other clusters. This process is sketched in Figure 1 . The second level of hashing (omitted in Figure 1 ) adopted in TACO's framework comes during the intra-cluster communication phase. It is based on the hamming weight of sensor bitmaps and provides a pruning technique (re-2 garding the number of performed bitmap comparisons) and a load balancing mechanism alleviating clusterheads from communication and processing overload. We choose to discuss this load balancing and comparison pruning mechanism separately, for ease of exposition, as well as to better exhibit its benefits. The contributions of this work can be summarized as follows: 1. We present TACO, an outlier detection framework that trades bandwidth for accuracy in a straightforward manner. TACO supports various popular similarity measures used in different application areas. Examples of such measures include, but are not limited to, the cosine similarity, the correlation coefficient and the Jaccard coefficient. 2. We present an extensive theoretical study on the trade offs occurring between bandwidth and accuracy during TACO's operation. 3. We subsequently devise a boosting process that provably improves TACO's accuracy under no additional communication costs. 4. We devise novel load balancing and comparison pruning mechanisms, which alleviate clusterheads from excessive processing and communication load. These mechanisms result in a more uniform, intra-cluster power consumption and prolonged network unhindered operation, since the more evenly spread power consumption results in an infrequent need for network reorganization. 5. We present a detailed experimental analysis of our techniques for a variety of data sets and parameter settings. Our results demonstrate that our methods can reliably compute outliers, while at the same time significantly reducing the amount of transmitted data, with average recall and precision values exceeding 80% and often reaching 100%. It is important to emphasize that the above results often correspond to bandwidth consumptions that are lower than what is required by a simple continuous aggregate query, using a method like TAG [8] . We also demonstrate that TACO may result in prolonged network lifetime, up to a factor of 3 in our experiments. We further provide comparative results with the recently proposed technique of [2] that uses an equivalent outlier definition and supports common similarity measures. 1 Overall, TACO appears to be more accurate up to 10% in terms of the F-Measure metric while resulting in lower bandwidth consumption. This paper proceeds as follows. Initially, in Section 2 we present related work, while Section 3 introduces our basic framework. In Sections 4 and 5 we analyze TACO's operation in detail. Our load balancing and comparison pruning mechanisms are described in Section 6, while Section 7 demonstrates how a variety of similarity measures can be utilized by TACO. In Section 8 we elaborate on interesting extensions to TACO that are capable of further reducing the communication cost. Section 9 presents our experimental evaluation, while Section 10 includes concluding remarks. Related Work The emergence of sensor networks as a viable and economically practical solution for monitoring and intelligent applications has prompted the research community to devote substantial effort to define and design the necessary primitives for data acquisition based on sensor networks [8, 17] . Different network organizations have been considered, such as using hierarchical routes (i.e., the aggregation tree [18, 19] ), cluster formations [13, 14, 15, 16] , or even completely ad-hoc formations [20, 21, 22] . Our framework assumes a clustered network organization. Such networks have been shown to be efficient in terms of energy dissipation, thus resulting in prolonged network lifetime [15, 16] . Sensor networks can be rather unreliable, as the commodity hardware used in the development of the motes is prone to environmental interference and failures. As a result, substantial effort has been devoted to the development of efficient outlier detection techniques that manage to pinpoint motes exhibiting extraordinary behavior [23] . The authors of [1, 24] introduce a declarative data cleaning mechanism over data streams produced by the sensors. Similarly, the work of [25] introduces a data cleaning module designed to capture noise in sensor streaming data based on the prior data distribution and a given error model N(0, δ 2 ). In [26] kalman filters are adopted during data cleaning or outlier detection procedures. Nonetheless, without prior knowledge of the data distribution the parameters and covariance values used in these filters are difficult to set. The data cleaning technique presented in [27] makes use of a weighted
doi:10.1016/j.is.2011.08.005 fatcat:tazckypci5hf3gt6zipejfxwvy

Dwarf

Yannis Sismanis, Antonios Deligiannakis, Nick Roussopoulos, Yannis Kotidis
2002 Proceedings of the 2002 ACM SIGMOD international conference on Management of data - SIGMOD '02  
Dwarf is a highly compressed structure for computing, storing, and querying data cubes. Dwarf identifies prefix and suffix structural redundancies and factors them out by coalescing their store. Prefix redundancy is high on dense areas of cubes but suffix redundancy is significantly higher for sparse areas. Putting the two together fuses the exponential sizes of high dimensional full cubes into a dramatically condensed data structure. The elimination of suffix redundancy has an equally dramatic
more » ... reduction in the computation of the cube because recomputation of the redundant suffixes is avoided. This effect is multiplied in the presence of correlation amongst attributes in the cube. A Petabyte 25-dimensional cube was shrunk this way to a 2.3GB Dwarf Cube, in less than 20 minutes, a 1:400000 storage reduction ratio. Still, Dwarf provides 100% precision on cube queries and is a self-sufficient structure which requires no access to the fact table. What makes Dwarf practical is the automatic discovery, in a single pass over the fact table, of the prefix and suffix redundancies without user involvement or knowledge of the value distributions. This paper describes the Dwarf structure and the Dwarf cube construction algorithm. Further optimizations are then introduced for improving clustering and query performance. Experiments with the current implementation include comparisons on detailed measurements with real and synthetic datasets against previously published techniques. The comparisons show that Dwarfs by far outperform these techniques on all counts: storage space, creation time, query response time, and updates of cubes. *
doi:10.1145/564744.564745 fatcat:rx6nyrk2wfdsxej2saqhhndg6q

Detecting proximity events in sensor networks

Antonios Deligiannakis, Yannis Kotidis
2011 Information Systems  
Sensor networks are often used to perform monitoring tasks, such as animal and vehicle tracking, or the surveillance of enemy forces in military applications. In this paper we introduce the concept of proximity queries, which allow us to report interesting events, observed by nodes in the network that lie within a certain distance from each other. An event is triggered when a user-programmable predicate is satisfied on a sensor node. We study the problem of computing proximity queries in sensor
more » ... networks and propose several alternative techniques that differ in the number of messages exchanged by the nodes and the quality of the returned answers. Our solutions utilize a distributed routing index, maintained by the nodes in the network, that is dynamically updated as new observations are obtained by the nodes. This distributed index allows us to efficiently process multiple proximity queries involving several different event types within a fraction of the cost that a straightforward evaluation requires. We present an extensive experimental study to show the benefits of our techniques under different scenarios using both synthetic and real data sets. Our results demonstrate that our algorithms scale better and require significantly fewer messages compared to a straightforward execution of the queries.
doi:10.1016/j.is.2011.03.004 fatcat:22ouh7vkargzpiletbvlqxbxpu

DynaMat

Yannis Kotidis, Nick Roussopoulos
1999 Proceedings of the 1999 ACM SIGMOD international conference on Management of data - SIGMOD '99  
Pre-computation and materialization of views with aggregate functions is a common technique in Data Warehouses. Due to the complex structure of the warehouse and the different profiles of the users who submit queries, there is need for tools that will automate the selection and management of the materialized data. In this paper we present DynaMat, a system that dynamically materializes information at multiple levels of granularity in order to match the demand (workload) but also takes into
more » ... nt the maintenance restrictions for the warehouse, such as down time to update the views and space availability. DynaMat unifies the view selection and the view maintenance problems under a single framework using a novel "goodness" measure for the materialized views. DynaMat constantly monitors incoming queries and materializes the best set of views subject to the space constraints. During updates, DynaMat reconciles the current materialized view selection and refreshes the most beneficial subset of it within a given maintenance window. We compare DynaMat against a system that is given all queries in advance and the pre-computed optimal static view selection. The comparison is made based on a new metric, the Detailed Cost Savings Ratio introduced for quantifying the benefits of view materialization against incoming queries. These experiments show that DynaMat's dynamic view selection outperforms the optimal static view selection and thus, any sub-optimal static algorithm that has appeared in the literature.
doi:10.1145/304182.304215 dblp:conf/sigmod/KotidisR99 fatcat:vmte2bdifjgoxoxad37juwqasq

Circumventing Data Quality Problems Using Multiple Join Paths

Yannis Kotidis, Amélie Marian, Divesh Srivastava
2006 Clean Database  
We propose the Multiple Join Path (MJP) framework for obtaining high quality information by linking fields across multiple databases, when the underlying databases have poor quality data, which are characterized by violations of integrity constraints like keys and functional dependencies within and across databases. MJP associates quality scores with candidate answers by first scoring individual data paths between a pair of field values taking into account data quality with respect to specified
more » ... integrity constraints, and then agglomerating scores across multiple data paths that serve as corroborating evidences for a candidate answer. We address the problem of finding the top-few (highest quality) answers in the MJP framework using novel techniques, and demonstrate the utility of our techniques using real data and our Virtual Integration Prototype testbed.
dblp:conf/cleandb/KotidisMS06 fatcat:xpmuzzjhjfg6dlqfsrjb7j27iu

The Opsis Project: Materialized Views for Data Warehouses and the Web [chapter]

Nick Roussopoulos, Yannis Kotidis, Alexandros Labrinidis, Yannis Sismanis
2003 Lecture Notes in Computer Science  
The real world we live in is mostly perceived through an incredibly large collection of views generated by humans, machines, and other systems. This is the view reality. The Opsis project concentrates its efforts in dealing with the multifaceted form and complexity of data views including data projection views, aggregate views, summary views (synopses), point of view views, and finally web views. In particular, Opsis deals with the generation, the storage organization (Cubetrees), the efficient
more » ... run-time management (Dynamat) of materialized views for Data Warehouse systems, and for web servers with dynamic content (WebViews).
doi:10.1007/3-540-38076-0_5 fatcat:o4migbm3rfhp3myv5ibbf7xqqa

Mv-Index: An Efficient Index for Graph-Query Containment

Theofilos Mailis, Yannis Kotidis, Vaggelis Nikolopoulos, Evgeny Kharlamov, Ian Horrocks, Yannis E. Ioannidis
2019 International Semantic Web Conference  
Kotidis was financed by the Research Centre of Athens University of Economics and Business, in the framework of the project entitled Original Scientific Publications.  ... 
dblp:conf/semweb/MailisKNKHI19 fatcat:hcrp6v3x5ncuhlepnagn6jwoti

Hierarchical dwarfs for the rollup cube

Yannis Sismanis, Antonios Deligiannakis, Yannis Kotidis, Nick Roussopoulos
2003 Proceedings of the 6th ACM international workshop on Data warehousing and OLAP - DOLAP '03  
The data cube operator exemplifies two of the most important aspects of OLAP queries: aggregation and dimension hierarchies. In earlier work we presented Dwarf, a highly compressed and clustered structure for creating, storing and indexing data cubes. Dwarf is a complete architecture that supports queries and updates, while also including a tunable granularity parameter that controls the amount of materialization performed. However, it does not directly support dimension hierarchies. Rollup and
more » ... drilldown queries on dimension hierarchies that naturally arise in OLAP need to be handled externally and are, thus, very costly. In this paper we present extensions to the Dwarf architecture for incorporating rollup data cubes, i.e. cubes with hierarchical dimensions. We show that the extended Hierarchical Dwarf retains all its advantages both in terms of creation time and space while being able to directly and efficiently support aggregate queries on every level of a dimension's hierarchy.
doi:10.1145/956060.956064 dblp:conf/dolap/SismanisDKR03 fatcat:cpyxml2i7nfcli4k4yyw7kaq7q

Distributed similarity estimation using derived dimensions

Konstantinos Georgoulas, Yannis Kotidis
2011 The VLDB journal  
Computing the similarity between data objects is a fundamental operation for many distributed applications such as those on the World Wide Web, in Peer-to-Peer networks, or even in Sensor Networks. In our work, we provide a framework based on Random Hyperplane Projection (RHP) that permits continuous computation of similarity estimates (using the cosine similarity or the correlation coefficient as the preferred similarity metric) between data descriptions that are streamed from remote sites.
more » ... se estimates are computed at a monitoring node, without the need for transmitting the actual data values. The original RHP framework is data agnostic and works for arbitrary data sets. However, data in most applications is not uniform. In our work, we first describe the shortcomings of the RHP scheme, in particular, its inefficiency to exploit evident skew in the underlying data distribution and then propose a novel framework that automatically detects correlations and computes an RHP embedding in the Hamming cube tailored to the provided data set using the idea of derived dimensions we first introduce. We further discuss extensions of our framework in order to cope with changes in the data distribution. In such cases, our technique automatically reverts to the basic RHP model for data items that cannot be described accurately through the computed embedding. Our experimental evaluation using several real and synthetic data sets demonstrates that our proposed scheme outperforms the existing RHP algorithm and alternative techniques that have been proposed, providing significantly more accurate similarity computations using the same number of bits.
doi:10.1007/s00778-011-0233-y fatcat:ugsu4wtswfdu7plzdo7eyuty2i
« Previous Showing results 1 — 15 out of 107 results