40 Hits in 1.5 sec

Benchmarking Declarative Approximate Selection Predicates [article]

Oktie Hassanzadeh
2009 arXiv   pre-print
Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Several similarity predicates have been proposed in the past for common quality primitives (approximate selections, joins, etc.) and have been fully
more » ... ve been fully expressed using declarative SQL statements. In this thesis, new similarity predicates are proposed along with their declarative realization, based on notions of probabilistic information retrieval. Then, full declarative specifications of previously proposed similarity predicates in the literature are presented, grouped into classes according to their primary characteristics. Finally, a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations is performed.
arXiv:0907.2471v1 fatcat:jg3fqtqhmnamnb55wqvgzrxeua

Automated Protein Structure Classification: A Survey [article]

Oktie Hassanzadeh
2009 arXiv   pre-print
Classification of proteins based on their structure provides a valuable resource for studying protein structure, function and evolutionary relationships. With the rapidly increasing number of known protein structures, manual and semi-automatic classification is becoming ever more difficult and prohibitively slow. Therefore, there is a growing need for automated, accurate and efficient classification methods to generate classification databases or increase the speed and accuracy of
more » ... cy of semi-automatic techniques. Recognizing this need, several automated classification methods have been developed. In this survey, we overview recent developments in this area. We classify different methods based on their characteristics and compare their methodology, accuracy and efficiency. We then present a few open problems and explain future directions.
arXiv:0907.1990v1 fatcat:v4o77i5yvbgkfe45nr4y637edu

LinkedCT: A Linked Data Space for Clinical Trials [article]

Oktie Hassanzadeh, Anastasios Kementsietsidis, Lipyeow Lim, Renee J. Miller, Min Wang
2009 arXiv   pre-print
The Linked Clinical Trials (LinkedCT) project aims at publishing the first open semantic web data source for clinical trials data. The database exposed by LinkedCT is generated by (1) transforming existing data sources of clinical trials into RDF, and (2) discovering semantic links between the records in the trials data and several other data sources. In this paper, we discuss several challenges involved in these two steps and present the methodology used in LinkedCT to overcome these
more » ... ome these challenges. Our approach for semantic link discovery involves using state-of-the-art approximate string matching techniques combined with ontology-based semantic matching of the records, all performed in a declarative and easy-to-use framework. We present an evaluation of the performance of our proposed techniques in several link discovery scenarios in LinkedCT.
arXiv:0908.0567v1 fatcat:i5if25dtvbfwbnrecypdg2ny7y

Creating probabilistic databases from duplicated data

Oktie Hassanzadeh, Renée J. Miller
2009 The VLDB journal  
The latter measures are taken from Hassanzadeh et al. [32] .  ... 
doi:10.1007/s00778-009-0161-2 fatcat:zdqo7b5lrjdudmzhd255i2nx3q

Semantic Concept Discovery over Event Databases [chapter]

Oktie Hassanzadeh, Shari Trewin, Alfio Gliozzo
2018 Lecture Notes in Computer Science  
Preparing a comprehensive, accurate, and unbiased report on a given topic or question is a challenging task. The first step is often a daunting discovery task that requires searching through an overwhelming number of information sources without introducing bias from the analyst's current knowledge or limitations of the information sources. A common requirement for many analysis reports is a deep understanding of various kinds of historical and ongoing events that are reported in the media. To
more » ... in the media. To enable better analysis based on events, there exist several event databases containing structured representations of events extracted from news articles. Examples include GDELT [4], ICEWS [1], and EventRegistry [3] . These event databases have been successfully used to perform various kinds of analysis tasks, e.g., forecasting societal events [6] . However, there has been little work on the discovery aspect of the analysis, that results in a gap between the information requirements and the available data, and potentially a biased view of the available information. In this presentation, we describe a framework for concept discovery over event databases using semantic technologies. Unlike existing concept discovery solutions that perform discovery over text documents and in isolation from the remaining data analysis tasks [5, 8] , our goal is providing a unified solution that allows deep understanding of the same data that will be used to perform other analysis tasks (e.g., hypothesis generation [7] or building models for forecasting [2]). Figure 1 shows the architecture of our system. The system takes in as input a set of event databases and RDF knowledge bases and provides as output a set of APIs that provide a unified retrieval mechanism over input data and knowledge bases, and an interface to a number of concept discovery algorithms. Figures 2 shows different portions of our system's UI that is built using our concept discovery framework APIs. The analyst can enter a natural language question or a set of concepts, and retrieve collections of relevant concepts identified and ranked using different concept discovery algorithms. A key aspect of our framework is the use of semantic technologies. In particular: -A unified view over multiple event databases and a background RDF knowledge base is achieved through semantic link discovery and annotation. -Natural language or keyword query understanding is performed through mapping of input terms to the concepts in the background knowledge base. -Concept discovery and ranking is performed through neural network based semantic term embeddings. We will present the results of our detailed evaluation of our proposed concept discovery techniques. We prepared a ground truth from reports on specific topics written by human experts, including reports from the Human Rights Watch or-
doi:10.1007/978-3-319-93417-4_19 fatcat:njxowo3qtzel5njglthwhrjyne

Exploring Big Data with Helix

Jason Ellis, Achille Fokoue, Oktie Hassanzadeh, Anastasios Kementsietsidis, Kavitha Srinivas, Michael J. Ward
2015 SIGMOD record  
While much work has focused on efficient processing of Big Data, little work considers how to understand them. In this paper, we describe Helix, a system for guided exploration of Big Data. Helix provides a unified view of sources, ranging from spreadsheets and XML files with no schema, all the way to RDF graphs and relational data with well-defined schemas. Helix users explore these heterogeneous data sources through a combination of keyword searches and navigation of linked web pages that
more » ... web pages that include information about the schemas, as well as data and semantic links within and across sources. At a technical level, the paper describes the research challenges involved in developing Helix, along with a set of real-world usage scenarios and the lessons learned.
doi:10.1145/2737817.2737829 fatcat:57z6duin7vbdrajo66wmp56m2y


Oktie Hassanzadeh, Songyun Duan, Achille Fokoue, Anastasios Kementsietsidis, Kavitha Srinivas, Michael J. Ward
2011 Proceedings of the 20th international conference companion on World wide web - WWW '11  
The size, heterogeneity and dynamicity of data within an enterprise makes indexing, integration and analysis of the data increasingly difficult tasks. On the other hand, there has been a massive increase in the amount of high-quality open data available on the Web that could provide invaluable insights to data analysts and business intelligence specialists within the enterprise. The goal of Helix project is to provide users within the enterprise with a platform that allows them to perform
more » ... em to perform online analysis of almost any type and amount of internal data using the power of external knowledge bases available on the Web. Such a platform requires a novel, data-format agnostic indexing mechanism, and light-weight data linking techniques that could link semantically related records across internal and external data sources of various characteristics. We present the initial architecture of our system and discuss several research challenges involved in building such a system.
doi:10.1145/1963192.1963295 dblp:conf/www/HassanzadehDFKSW11 fatcat:3dpa7373yvad3gw5p6jyh3n4oq

Automatic Curation of Clinical Trials Data in LinkedCT [chapter]

Oktie Hassanzadeh, Renée J. Miller
2015 Lecture Notes in Computer Science  
The code is available on GitHub at We will also maintain a list of projects contributed by users and application scenarios, and will be open to new proposals.  ... 
doi:10.1007/978-3-319-25010-6_16 fatcat:k25ynxmpqndwhonvgjzch3vrli


Bahar Ghadiri Bashardoost, Christina Christodoulakis, Soheil Hassas Yeganeh, Renée J. Miller, Kelly Lyons, Oktie Hassanzadeh
2015 Proceedings of the 24th International Conference on World Wide Web - WWW '15 Companion  
VizCurator permits the exploration, understanding and curation of open RDF data, its schema, and how it has been linked to other sources. We provide visualizations that enable one to seamlessly navigate through RDFS and RDF layers and quickly understand the open data, how it has been mapped or linked, how it has been structured (and could be restructured), and how deeply it has been related to other open data sources. More importantly, VizCurator provides a rich set of tools for data curation.
more » ... for data curation. It suggests possible improvements to the structure of the data and enables curators to make informed decisions about enhancements to the exploration and exploitation of the data. Moreover, VizCurator facilitates the mining of temporal resources and the definition of temporal constraints through which the curator can identify conflicting facts. Finally, VizCurator can be used to create new binary temporal relations by reifying base facts and linking them to temporal resources. We will demonstrate VizCurator using, a five-star open data set mapped from the XML NIH clinical trials data ( that we have been maintaining and curating for several years.
doi:10.1145/2740908.2742845 dblp:conf/www/BashardoostCYML15 fatcat:zfvzeolzqrh2dhpr6rhvzkeegq

Linkage Query Writer

Oktie Hassanzadeh, Reynold Xin, Renée J. Miller, Anastasios Kementsietsidis, Lipyeow Lim, Min Wang
2009 Proceedings of the VLDB Endowment  
We present Linkage Query Writer (LinQuer), a system for generating SQL queries for semantic link discovery over relational data. The LinQuer framework consists of (a) LinQL, a language for specification of linkage requirements; (b) a web interface and an API for translating LinQL queries to standard SQL queries; (c) an interface that assists users in writing LinQL queries. We discuss the challenges involved in the design and implementation of a declarative and easy to use framework for
more » ... mework for discovering links between different data items in a single data source or across different data sources. We demonstrate different steps of the linkage requirements specification and discovery process in several real world scenarios and show how the LinQuer system can be used to create high-quality linked data sources.
doi:10.14778/1687553.1687599 fatcat:emsydbbidrhdnlszbhwwssjdti

Data Management Issues on the Semantic Web

Oktie Hassanzadeh, Anastasios Kementsietsidis, Yannis Velegrakis
2012 2012 IEEE 28th International Conference on Data Engineering  
We provide an overview of the current data management research issues in the context of the Semantic Web. The objective is to introduce the audience into the area of the Semantic Web, and to highlight the fact that the area provides many interesting research opportunities for the data management community. A new model, the Resource Description Framework (RDF), coupled with a new query language, called SPARQL, lead us to revisit some classical data management problems, including efficient
more » ... ng efficient storage, query optimization, and data integration. These are problems that the Semantic Web community has only recently started to explore, and therefore the experience and long tradition of the database community can prove valuable. We target both experienced and novice researchers that are looking for a thorough presentation of the area and its key research topics.
doi:10.1109/icde.2012.141 dblp:conf/icde/HassanzadehKV12 fatcat:5zfbdevi5jfylmdf2dldercbou

Predicting Drug-Drug Interactions Through Large-Scale Similarity-Based Link Prediction [chapter]

Achille Fokoue, Mohammad Sadoghi, Oktie Hassanzadeh, Ping Zhang
2016 Lecture Notes in Computer Science  
Drug-Drug Interactions (DDIs) are a major cause of preventable adverse drug reactions (ADRs), causing a significant burden on the patients' health and the healthcare system. It is widely known that clinical studies cannot sufficiently and accurately identify DDIs for new drugs before they are made available on the market. In addition, existing public and proprietary sources of DDI information are known to be incomplete and/or inaccurate and so not reliable. As a result, there is an emerging
more » ... is an emerging body of research on in-silico prediction of drug-drug interactions. We present Tiresias, a framework that takes in various sources of drug-related data and knowledge as inputs, and provides DDI predictions as outputs. The process starts with semantic integration of the input data that results in a knowledge graph describing drug attributes and relationships with various related entities such as enzymes, chemical structures, and pathways. The knowledge graph is then used to compute several similarity measures between all the drugs in a scalable and distributed framework. The resulting similarity metrics are used to build features for a large-scale logistic regression model to predict potential DDIs. We highlight the novelty of our proposed approach and perform thorough evaluation of the quality of the predictions. The results show the effectiveness of Tiresias in both predicting new interactions among existing drugs and among newly developed and existing drugs.
doi:10.1007/978-3-319-34129-3_47 fatcat:qxr7buoflbakla32zeogdeuacy

Benchmarking declarative approximate selection predicates

Amit Chandel, Oktie Hassanzadeh, Nick Koudas, Mohammad Sadoghi, Divesh Srivastava
2007 Proceedings of the 2007 ACM SIGMOD international conference on Management of data - SIGMOD '07  
Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Over the last few years several similarity predicates have been proposed for common quality primitives (approximate selections, joins, etc) and have been
more » ... etc) and have been fully expressed using declarative SQL statements. In this paper we propose new similarity predicates along with their declarative realization, based on notions of probabilistic information retrieval. In particular we show how language models and hidden Markov models can be utilized as similarity predicates for data quality and present their full declarative instantiation. We also show how other scoring methods from information retrieval, can be utilized in a similar setting. We then present full declarative specifications of previously proposed similarity predicates in the literature, grouping them into classes according to their primary characteristics. Finally, we present a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations. We quantify both their runtime performance as well as their accuracy for several types of common quality problems encountered in operational databases.
doi:10.1145/1247480.1247521 dblp:conf/sigmod/ChandelHKSS07 fatcat:ijpzqh63hravzflvq3477n5l6a

Schema management for document stores

Lanjun Wang, Shuo Zhang, Juwei Shi, Limei Jiao, Oktie Hassanzadeh, Jia Zou, Chen Wangz
2015 Proceedings of the VLDB Endowment  
Document stores that provide the efficiency of a schema-less interface are widely used by developers in mobile and cloud applications. However, the simplicity developers achieved controversially leads to complexity for data management due to lack of a schema. In this paper, we present a schema management framework for document stores. This framework discovers and persists schemas of JSON records in a repository, and also supports queries and schema summarization. The major technical challenge
more » ... chnical challenge comes from varied structures of records caused by the schema-less data model and schema evolution. In the discovery phase, we apply a canonical form based method and propose an algorithm based on equivalent sub-trees to group equivalent schemas efficiently. Together with the algorithm, we propose a new data structure, eSiBu-Tree, to store schemas and support queries. In order to present a single summarized representation for heterogenous schemas in records, we introduce the concept of "skeleton", and propose to use it as a relaxed form of the schema, which captures a small set of core attributes. Finally, extensive experiments based on real data sets demonstrate the efficiency of our proposed schema discovery algorithms, and practical use cases in real-world data exploration and integration scenarios are presented to illustrate the effectiveness of using skeletons in these applications.
doi:10.14778/2777598.2777601 fatcat:dfmt5bnhonghvcknuxd6pthxqu

Framework for evaluating clustering algorithms in duplicate detection

Oktie Hassanzadeh, Fei Chiang, Hyun Chul Lee, Renée J. Miller
2009 Proceedings of the VLDB Endowment  
The presence of duplicate records is a major data quality concern in large databases. To detect duplicates, entity resolution also known as duplication detection or record linkage is used as a part of the data cleaning process to identify records that potentially refer to the same real-world entity. We present the Stringer system that provides an evaluation framework for understanding what barriers remain towards the goal of truly scalable and general purpose duplication detection algorithms.
more » ... ction algorithms. In this paper, we use Stringer to evaluate the quality of the clusters (groups of potential duplicates) obtained from several unconstrained clustering algorithms used in concert with approximate join techniques. Our work is motivated by the recent significant advancements that have made approximate join algorithms highly scalable. Our extensive evaluation reveals that some clustering algorithms that have never been considered for duplicate detection, perform extremely well in terms of both accuracy and scalability.
doi:10.14778/1687627.1687771 fatcat:4dpzw5qdxralveecjfpyrnxd2u
« Previous Showing results 1 — 15 out of 40 results