228 Hits in 4.0 sec

Template Induction over Unstructured Email Corpora

Julia Proskurnia, Marc-Allen Cartright, Lluis Garcia-Pueyo, Ivo Krka, James B. Wendt, Tobias Kaufmann, Balint Miklos
2017 Proceedings of the 26th International Conference on World Wide Web - WWW '17  
Unsupervised template induction over email data is a central component in applications such as information extraction, document classification, and auto-reply.  ...  However much less work has been done in performing the same task over unstructured email data.  ...  Given this setting, the primary task becomes one of establishing that template induction can be effective for the unstructured content.  ... 
doi:10.1145/3038912.3052631 dblp:conf/www/ProskurniaCPKWK17 fatcat:xzzqz4ruzneulc4cbl3jswvcle

Can We Make Information Extraction More Adaptive? [chapter]

Yorick Wilks, Roberta Catizone
1999 Lecture Notes in Computer Science  
We shall discuss attempts to derive templates directly from corpora; to derive knowledge structures and lexicons directly from corpora, including discussion of the recent LE project ECRAN which attempted  ...  and templates.  ...  Using some partial parser on domain corpora to locate \signi cant triples" over all instances of the word set selected by (i) in the manner of Grishman and Sterling 31]. 4.  ... 
doi:10.1007/3-540-48089-7_1 fatcat:q33ux5rdg5e6vnpq3mrchi4lpa

Survey of Stages of Developing the Information Extraction Systems from the Web

Asmaa Ahmed Hamed Khalil Elsaeidy
2015 International Journal Of Mechanical Engineering And Information Technology  
The traditional information extraction systems are focused on satisfying precise, narrow and pre-specified requests from small homogenous corpora.  ...  Wrapper induction is a special group of methods based on the fact that most web pages are generated by templates.  ...  Today's enterprises are generating and consuming rapidly increasing quantities of unstructured textual data like emails, web pages, news articles, blog posts, online reviews and comments and call centre  ... 
doi:10.18535/ijmeit/v3i11.01 fatcat:nytac7qcvzelpjblwzltgmkt64

An Overview and Classification of Adaptive Approaches to Information Extraction [chapter]

Christian Siefkes, Peter Siniakov
2005 Lecture Notes in Computer Science  
Preprocessing of Input Texts: Text corpora often consist of unstructured, "raw" natural language texts.  ...  Now it ranges from classical pattern-oriented systems over numerous combinations of different AI and statistical methods to rather new approaches such as wrapper induction.  ... 
doi:10.1007/11603412_6 fatcat:szbm5g25ivd7xnaydroojhbncy

Ontology learning from text

Wilson Wong, Wei Liu, Mohammed Bennamoun
2012 ACM Computing Surveys  
This together with the advanced state in related areas, such as natural language processing, have fueled research into ontology learning over the past decade.  ...  Wong, RMIT University, Melbourne Victoria 3000, Australia; email: wilson.wong@; W. Liu, University of Western Australia, Crawley WA 6009, Australia; email: wei@csse.uwa.; M.  ...  Bennamoun, University of Western Australia, Crawley WA 6009, Australia; email: bennamou@  ... 
doi:10.1145/2333112.2333115 fatcat:4uo5bazvivh3fpbxj7yf4yoe2u

Bayesian Information Extraction Network [article]

Leonid Peshkin, Avi Pfeffer
2003 arXiv   pre-print
The last one uses constraints obtained by rule induction over predicates like token identity, word length and capitalization, and simple semantic features.  ...  At the opposite end lies the task of extracting information from free text which, although unstructured, is assumed to be grammatical.  ... 
arXiv:cs/0306039v1 fatcat:wb335ig34zcdjpgm5uinf4ra2u

Cerno: Light-weight tool support for semantic annotation of textual documents

Nadzeya Kiyavitskaya, Nicola Zeni, James R. Cordy, Luisa Mich, John Mylopoulos
2009 Data & Knowledge Engineering  
However, most web data remain unstructured because of the difficulty and the cost of manually annotating text.  ...  Among these is SemTag [8] , which performs automated semantic tagging of large corpora.  ...  Figure 4 . 4 Database schema template for accommodation ads.  ... 
doi:10.1016/j.datak.2009.07.012 fatcat:u577vp6jwzf3noir3iqbthcblq

A Computational Inflection for Scientific Discovery [article]

Tom Hope, Doug Downey, Oren Etzioni, Daniel S. Weld, Eric Horvitz
2022 arXiv   pre-print
We propose a research agenda of task-guided knowledge retrieval, in which systems counter humans' bounded capacity by ingesting corpora of scientific knowledge and retrieving inspirations, explanations  ...  In parallel, exponential growth in data processing power has fueled remarkable advances in AI, including self-supervised neural models capable of learning powerful representations from large-scale unstructured  ...  Along with the trend in society, private communication channels among researchers are also digital-emails, online calls and messages.  ... 
arXiv:2205.02007v1 fatcat:d2dsit6msng6xhlpkb7ckgoi6q

Information Extraction

Sunita Sarawagi
2007 Foundations and Trends in Databases  
the abundance of unstructured data.  ...  This review is a survey of information extraction research of over two decades from these diverse communities.  ...  Using this template we can easily calculate the maximum number of features of each type.  ... 
doi:10.1561/1900000003 fatcat:bw42p45zkbgajmroydls2m2giq

Lynx D2.5 Report on Lynx acquired vocabularies

Ilan Kernerman, Patricia Martín Chozas, Andis Lagzdiņš, Jorge Gracia
2019 Zenodo  
part of Work Package 2, which is concerned with the acquisition and management of linguistic data for Lynx, and includes the process followed for )a) the identification of existing language resources (corpora  ...  Terminology resources in semi-structured and unstructured formats, e.g.  ...  It contains over 400 terms in English that have been extracted from 92 resources.  ... 
doi:10.5281/zenodo.3558710 fatcat:sx7zrwlcc5cjjpc4t5euywfqgq

Developing computational infrastructure for the CorCenCC corpus: The National Corpus of Contemporary Welsh

Dawn Knight, Fernando Loizides, Steven Neale, Laurence Anthony, Irena Spasić
2020 Language Resources and Evaluation  
Instead, an inductive approach was used to produce the codes (see Table 6 ).  ...  Collectively, this resource brings together hundreds of researchers working in over 34 languages.  ... 
doi:10.1007/s10579-020-09501-9 fatcat:h6lxef65ubegvlf7ewbgqm3egq

LearningPinocchio: adaptive information extraction for real world applications

2004 Natural Language Engineering  
Finally we are addressing the extension to template-based IE learning. Fig. 1 . 1 An architecture in LearningPinocchio. Fig. 2 . 2 Rule induction steps.  ...  Rules are learnt by generalising over a set of examples marked via XML tags in a training corpus.  ... 
doi:10.1017/s135132490400333x fatcat:y4icfby2ivaavpluicnor6rjta

Human language technologies for knowledge management

S. Staab
2001 IEEE Intelligent Systems  
Businesses rely on unstructured information available over the Internet, intranets, email, press releases, online newspapers, digital libraries, and other sources.  ...  Provided corpora might be unbalanced with respect to types or genres (emails could be underrepresented with respect to free texts) or might show peculiar regularities because of wrong selection criteria  ...  He has organized multiple international symposia, given tutorials, and published over 50 technical and tutorial articles in the areas of language generation, multimedia presentation, text summarization  ... 
doi:10.1109/5254.972104 fatcat:4w6vfn2p3rhkbfg6cm53yxtvha

Uncovering the to-dos hidden in your in-box

D. M. Sow, J. S. Davis, M. R. Ebling, A. Misra, L. Bergman
2006 IBM Systems Journal  
Huang, “Automatic Categorization of Email into Folders: Benchmark Experi- ments on Enron and SRI Corpora,” CIIR Technical Report IR-418, University of Massachusetts, Amherst, MA (2004). V.  ...  We then describe the SCOUT architecture and give an over- view of the e-mail analysis components.  ... 
doi:10.1147/sj.454.0739 fatcat:wv3d2emgvnew3g6iyp76s3pdlq

Ontology-based information extraction: An introduction and a survey of current approaches

Daya C. Wimalasuriya, Dejing Dou
2010 Journal of information science  
However, since no such conferences or standard text corpora currently exist for OBIE, most researchers have compiled their own corpora for OBIE systems.  ...  As such, it can be expected that having standard text corpora and well defined tasks will have a similar positive impact on the development of ontology based information extraction.  ...  They can be either unstructured (e.g., text files) or semi-structured (e.g., web pages using a particular template such as pages from Wikipedia).  ... 
doi:10.1177/0165551509360123 fatcat:glimm55usbdwlmlklkglhkxrly
« Previous Showing results 1 — 15 out of 228 results