133,912 Hits in 5.6 sec

Measuring similarity of semi-structured documents with context weights

Christopher C. Yang, Nan Liu
2006 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '06  
In this work, we study similarity measures for text-centric XML documents based on an extended vector space model, which considers both document content and structure.  ...  Experimental results based on a benchmark showed superior performance of the proposed measure over the baseline which ignores structural knowledge of XML documents.  ...  To evaluate a hypothesis, we instantiate the similarity measure with the set of context weights and use it to rank the collection for the set of training documents and compute the mean average precision  ... 
doi:10.1145/1148170.1148334 dblp:conf/sigir/YangL06 fatcat:wq5q6ybhybf7vl5tkdv4564u7i

Weighted Naive Bayes Model for Semi-Structured Document Categorization [article]

Pierre-François Marteau , Eugen Popovici
2009 arXiv   pre-print
The aim of this paper is the supervised classification of semi-structured data.  ...  We define what we call the structural context of occurrence for unstructured data, and we derive a recursive formulation in which parameters are used to weight the contribution of structural element relatively  ...  XML CONTEXT MODELING A semi-structured document d is well represented by a tree structure T d , containing a set of vertices S d and a set of edges A d .  ... 
arXiv:0901.0358v1 fatcat:xglfec6ffnfdxihiut2iz4ip4y

Context-sensitive access to e-document corpus [article]

A. V. Smirnov, T. V. Levashova, M. P. Pashkin, N. G. Shilov, A. A. Krizhanovsky, A. M. Kashevnik, A. S. Komarova
2006 arXiv   pre-print
; (iii) a method for identification of relevant e-documents based on semantic similarity measures.  ...  Wiki resources as a modern text format provides huge number of text in a semi formalized structure.  ...  Wiki resources as a modern text format provides huge number of text in a semi formalized structure.  ... 
arXiv:cs/0610058v1 fatcat:34sgvu6gnzhejedktyw53bxmvq

Semi-metric Behavior in Document Networks and its Application to Recommendation Systems [article]

L.M. Rocha
2003 arXiv   pre-print
Regarding (1), we present the idea of semi-metric distance graphs, and introduce ratios to measure semi-metric behavior.  ...  Thus, we are presented with a problem of combining evidence (about associations between items) from different sources characterized by distance functions.  ...  These web pages/concepts were taken as vertices of a non-directed graph, whose edges are weighted with a value computed by a structural proximity measure very similar to formula (1): P struc = max (P in  ... 
arXiv:cs/0309013v1 fatcat:cbzcl6ljmfaapml7rj2vclnkge

Contextfree Grammar Extraction form Web Document using Probabilities Association

Ramesh Thakur
2015 International Journal on Recent and Innovation Trends in Computing and Communication  
In this paper I proposed a method of learning Context-free grammar rules from HTML documents using probabilities association of HTML tags.  ...  These documents are typically formatted for human viewing (HTML) and vary widely from document to document.  ...  These grammar rules will be used to create structural descriptions of the unstructured and semi-structured documents.  ... 
doi:10.17762/ijritcc2321-8169.1504103 fatcat:clyojvtudbccbc7uhdq7rllaga

Comparative Study on Graph-based Information Retrieval: the Case of XML Document

Imane Belahyane, Mouad Mammass, Hasna Abioui, Assmaa Moutaoukkil, Ali Idarrou
2021 International Journal of Advanced engineering Management and Science  
In this paper, we will examine the state of the art of IR in XML documents. We will also discuss some works that have used graphs to represent documents in the context of IR.  ...  The processing of massive amounts of data has become indispensable especially with the potential proliferation of big data.  ...  Querying a collection of XML documents means comparing the query with all the XML documents in the document database. Indeed, the XML document is a semi-structured document by essence.  ... 
doi:10.22161/ijaems.78.1 fatcat:jjjjaif6kvdsnc5ekxwuhh5z7e

Semi-structured Documents Mining: A Review and Comparison

Amina Madani, Omar Boussaid, Djamel Eddine Zegour
2013 Procedia Computer Science  
The number of semi-structured documents that is produced is steadily increasing. Thus, it will be essential for discovering new knowledge from them.  ...  In this survey paper, we review popular semi-structured documents mining approaches (structure alone and both structure and content).  ...  They also propose to integrate a textitude measure to the document description process that basically measures the ratio between the weight of the structural information and the weight of the content information  ... 
doi:10.1016/j.procs.2013.09.110 fatcat:sjh47ru4ofcdtezxs6brnfiqta

A modular approach for exploring the semantic structure of technical document collections

Andreas Becks, Stefan Sklorz, Matthias Jarke
2000 Proceedings of the working conference on Advanced visual interfaces - AVI '00  
In this paper, we therefore present a visualization technique based on a modular approach that allows a variety of techniques from semantic document analysis to be used in the visualization of the structure  ...  of technical document collections.  ...  In all cases presented here we used a term vector representation for document indexing (with normalized term frequencies as component weighting) and the cosine measure of similarity (cf. section 4).  ... 
doi:10.1145/345513.345361 dblp:conf/avi/BecksSJ00 fatcat:75yhsptvmzfsnhjrp6l3wzbefi

A novel method for measuring semantic similarity for XML schema matching

Buhwan Jeong, Daewon Lee, Hyunbo Cho, Jaewook Lee
2008 Expert systems with applications  
To this end, we present a supervised approach to measure semantic similarity between XML schema documents, and, more importantly, address a novel approach to augment reliably labeled training data from  ...  The paper deals with an essential activity enabling seamless enterprises integration, that is, a similarity-based schema matching.  ...  The tree structure is a native structure for XML documents; hence, it is most related to our problem context.  ... 
doi:10.1016/j.eswa.2007.01.025 fatcat:bjmh3wj75ncpvpeamydd7vl2qy

Construction of a Corpus for the Evaluation of Textual Case-based Reasoning Architectures

Andreas Korger, Joachim Baumeister
2020 Lernen, Wissen, Daten, Analysen  
We report on the construction of an open corpus of regulatory documents in the domain of nuclear safety regulations.  ...  These documents enumerate situations with conditions, that are often dangerous for human and environment and they give advice, rules, and instructions for prevention or handling.  ...  With an aggregation function a global similarity measure is composed by weighting the previously described local similarity functions of information unit cases with the parameters (ω P , ω I , ω R ) and  ... 
dblp:conf/lwa/KorgerB20 fatcat:ihhoocluvnf4vfsyi5xqyu46de

Kinship contextualization

Muhammad A. Norozi, Paavo Arvola
2013 Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval - SIGIR '13  
The textual context of an element, structurally, contains traces of evidences. Utilizing this context in scoring is called contextualization.  ...  In this study we hypothesize that the context of an XML-element originated from its preceding and following elements in the sequential ordering of a document improves the quality of retrieval.  ...  XML documents are used as a sample case of semi-structured documents, these documents have hierarchical structure, which is often represented in a form of tree.  ... 
doi:10.1145/2484028.2484111 dblp:conf/sigir/NoroziA13 fatcat:iwkcznwt2fbdrjqomwlpm77i2a

Building semantic trees from XML documents

Joe Tekli, Nathalie Charbel, Richard Chbeir
2016 Journal of Web Semantics  
structured documents consisting of hierarchically nested elements and atomic attributes.  ...  In this context, XML was introduced as a data representation standard that simplifies the tasks of interoperation and integration among heterogeneous data sources, allowing to represent data in (semi-)  ...  as the weighted sum of several semantic similarity measures 11 .  ... 
doi:10.1016/j.websem.2016.03.002 fatcat:uqehvo445bgwpeyawpqxoon4ye

A framework for retrieving conceptual knowledge from Web pages

Nacéra Bennacer, Lobna Karoui
2005 Semantic Web Applications and Perspectives  
Our approach takes advantage from both structural and linguistic HTML document characteristics and is based on an incremental evaluation by the user of the conceptual quality.  ...  Their proliferation relies strongly on the automation of ontology building, integration and deployment processes.  ...  To weight the significance of a given term pair we combine two types of measures: co-occurrence in a structural context and co-occurrence in a syntactic context.  ... 
dblp:conf/swap/BennacerK05 fatcat:i4ehy42pancfvbne4yds4c6lde

Knowledge-Based Biomedical Word Sense Disambiguation: An Evaluation and Application to Clinical Document Classification

Vijay N. Garla, Cynthia Brandt
2012 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology  
Funding This work was supported in part by NIH grant T15 LM07056 from the National Library of Medicine, CTSA grant number UL1 RR024139 from the NIH National Center for Advancing Translational Sciences  ...  (IC) measures. 16 17 Distributional similarity methods use the distribution of concepts within a corpus in conjunction with the taxonomic structure to compute similarity; these include corpus IC-based  ...  [15] [16] [17] Knowledge-based similarity methods use the taxonomic structure of a biomedical terminology to compute similarity; these include path finding measures and intrinsic information content  ... 
doi:10.1109/hisb.2012.12 dblp:conf/hisb/GarlaB12 fatcat:rcxw74xrarchvawo62ezccvofa

Contextualization using hyperlinks and internal hierarchical structure of Wikipedia documents

Muhammad Ali Norozi, Paavo Arvola, Arjen P. de Vries
2012 Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12  
semi-structured documents.  ...  Context surrounding hyperlinked semi-structured documents, externally in the form of citations and internally in the form of hierarchical structure, contains a wealth of useful but implicit evidence about  ...  structure of semi-structured documents.  ... 
doi:10.1145/2396761.2396855 dblp:conf/cikm/NoroziAV12 fatcat:gmpjpuxrm5b5hljmsgdp6sf7di
« Previous Showing results 1 — 15 out of 133,912 results