A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2019; you can also visit the original URL.
The file type is application/pdf
.
Filters
Keyphrase Extraction using Sequential Labeling
[article]
2016
arXiv
pre-print
Keyphrases efficiently summarize a document's content and are used in various document processing and retrieval tasks. Several unsupervised techniques and classifiers exist for extracting keyphrases from text documents. Most of these methods operate at a phrase-level and rely on part-of-speech (POS) filters for candidate phrase generation. In addition, they do not directly handle keyphrases of varying lengths. We overcome these modeling shortcomings by addressing keyphrase extraction as a
arXiv:1608.00329v2
fatcat:xjxmry4ae5eg7doek277i3dvtm
more »
... tial labeling task in this paper. We explore a basic set of features commonly used in NLP tasks as well as predictions from various unsupervised methods to train our taggers. In addition to a more natural modeling for the keyphrase extraction problem, we show that tagging models yield significant performance benefits over existing state-of-the-art extraction methods.
On Zero-Modified Poisson-Sujatha Distribution to Model Overdispersed Count Data
2018
Austrian Journal of Statistics
It will be shown that the zero modification can be incorporated by using the zero-truncated Poisson-Sujatha distribution. ...
A simple reparametrization of the probability function will allow us to represent the zero-modified Poisson-Sujatha distribution as a hurdle model. ...
Recently, the Poisson-Sujatha distribution was obtained by compounding the Poisson with a Sujatha distribution. ...
doi:10.17713/ajs.v47i3.590
fatcat:tskja2ceyvdjddi6nsminx2eue
A Search/Crawl Framework for Automatically Acquiring Scientific Documents
[article]
2016
arXiv
pre-print
Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. In this paper, we propose a novel search-driven framework for acquiring documents for scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. Next,
arXiv:1604.05005v1
fatcat:sq4pssjmh5exnm5detl47djwyq
more »
... ch papers and sources of research papers are identified from the search results using accurate classification modules. Our experiments highlight not only the performance of our individual classifiers but also the effectiveness of our overall Search/Crawl framework. Indeed, we were able to obtain approximately 0.665 million research documents through our fully-automated framework using about 0.076 million queries. These prolific results position Web search as an effective alternative to crawl methods for acquiring both the actual documents and seed URLs for future crawls.
Phrase Pair Classification for Identifying Subtopics
[chapter]
2012
Lecture Notes in Computer Science
Automatic identification of subtopics for a given topic is desirable because it eliminates the need for manual construction of domain-specific topic hierarchies. In this paper, we design features based on corpus statistics to design a classifier for identifying the (subtopic, topic) links between phrase pairs. We combine these features along with the commonly-used syntactic patterns to classify phrase pairs from datasets in Computer Science and WordNet. In addition, we show a novel application
doi:10.1007/978-3-642-28997-2_48
fatcat:tyrjnzuofzh7bdcrykxzv7yqti
more »
... f our is-a-subtopic-of classifier for query expansion in Expert Search and compare it with pseudo-relevance feedback.
Performance of colloidal CdS sensitized solar cells with ZnO nanorods/nanoparticles
2017
Beilstein Journal of Nanotechnology
As an alternative photosensitizer in dye-sensitized solar cells, bovine serum albumin (BSA) (a nonhazardous protein) was used in the synthesis of colloidal CdS nanoparticles (NPs). This system has been employed to replace the commonly used N719 dye molecule. Various nanostructured forms of ZnO, namely, nanorod and nanoparticle-based photoanodes, have been sensitized with colloidal CdS NPs to evaluate their effective performance towards quantum dot sensitized solar cells (QDSSCs). A polysulphide
doi:10.3762/bjnano.8.23
pmid:28243559
pmcid:PMC5301656
fatcat:24imh6mpp5gabgbz7j6wd7w6e4
more »
... (S x 2− )-based electrolyte and Cu x S counter electrode were used for cell fabrication and testing. An interesting improvement in the performance of the device by imposing nanorods as a scattering layer on a particle layer has been observed. As a consequence, a maximum conversion efficiency of 1.06% with an open-circuit voltage (V OC ) of 0.67 V was achieved for the ZnO nanorod/nanoparticle assembled structure.
Deep Learning for Character-Based Information Extraction
[chapter]
2014
Lecture Notes in Computer Science
Yanjun Qi, Sujatha Das G, Ronan Collobert, and Jason Weston the-art performance on WS and POS and has achieved the state-of-theart predictive level for NER and SS tasks (detailed discussion in [1]). ...
Das G
Ronan Collobert
Jason Weston
1 System
Figure 1: The basic deep learning system for
character-based IE tagging.
2 Experiments
2.1 Data Sets:
Table 1 summarizes some statistics of the ...
doi:10.1007/978-3-319-06028-6_74
fatcat:ubzh4ytyrngapdfnijcizhysmi
Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents
[article]
2008
arXiv
pre-print
Most search engines index the textual content of documents in digital libraries. However, scholarly articles frequently report important findings in figures for visual impact and the contents of these figures are not indexed. These contents are often invaluable to the researcher in various fields, for the purposes of direct comparison with their own work. Therefore, searching for figures and extracting figure data are important problems. To the best of our knowledge, there exists no tool to
arXiv:0809.1802v1
fatcat:dxeuw7aukbb3bn5xq523v6nfbe
more »
... matically extract data from figures in digital documents. If we can extract data from these images automatically and store them in a database, an end-user can query and combine data from multiple digital documents simultaneously and efficiently. We propose a framework based on image analysis and machine learning to extract information from 2-D plot images and store them in a database. The proposed algorithm identifies a 2-D plot and extracts the axis labels, legend and the data points from the 2-D plot. We also segregate overlapping shapes that correspond to different data points. We demonstrate performance of individual algorithms, using a combination of generated and real-life images.
Improving Researcher Homepage Classification with Unlabeled Data
2015
ACM Transactions on the Web
A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on "non-homepages" present on
doi:10.1145/2767135
fatcat:sa5amlvswveqhn3hh5hwvfu5ry
more »
... rrent-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: "How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?" We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for "learning a conforming pair of classifiers" that mimics co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset. Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions. Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.
Researcher homepage classification using unlabeled data
2013
Proceedings of the 22nd international conference on World Wide Web - WWW '13
A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on the Web? We investigate this question in the context of researcher homepage crawling. We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant"
doi:10.1145/2488388.2488430
dblp:conf/www/GollapalliCMG13
fatcat:eihwojyfmrgi5kqa2cyadgiuae
more »
... pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set.
High-Pressure Phase Transitions of Morphologically Distinct Zn2SnO4 Nanostructures
2019
ACS Omega
Many aspects of nanostructured materials at high pressures are still unexplored. We present here, high-pressure structural behavior of two Zn2SnO4 nanomaterials with inverse spinel type, one a particle with size of ∼7 nm [zero dimensional (0-D)] and the other with a chain-like [one dimensional (1-D)] morphology. We performed in situ micro-Raman and synchrotron X-ray diffraction measurements and observed that the cation disordering of the 0-D nanoparticle is preserved up to ∼40 GPa, suppressing
doi:10.1021/acsomega.9b01361
pmid:31460152
pmcid:PMC6649287
fatcat:hoa4i5ltm5br7lnlf3fetcqqvi
more »
... he reported martensitic phase transformation. On the other hand, an irreversible phase transition is observed from the 1-D nanomaterial into a new and dense high-pressure orthorhombic CaFe2O4-type structure at ∼40 GPa. The pressure-treated 0-D and 1-D nanomaterials have distinct diffuse reflectance and emission properties. In particular, a heterojunction between the inverse spinel and quenchable orthorhombic phases allows the use of 1-D Zn2SnO4 nanomaterials as efficient photocatalysts as shown by the degradation of the textile pollutant methylene blue.
MIKE
2017
Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM '17
Traditional supervised keyphrase extraction models depend on the features of labelled keyphrases while prevailing unsupervised models mainly rely on structure of the word graph, with candidate words as nodes and edges capturing the co-occurrence information between words. However, systematically integrating all these multidimensional heterogeneous information into a uni ed model is relatively unexplored. In this paper, we focus on how to e ectively exploit multidimensional information to
doi:10.1145/3132847.3132956
dblp:conf/cikm/ZhangCLG0X17
fatcat:rypdj2lo2bbi3eb7tphw67j5yq
more »
... the keyphrase extraction performance (MIKE). Speci cally, we propose a random-walk parametric model, MIKE, that learns the latent representation for a candidate keyphrase that captures the mutual in uences among all information, and simultaneously optimizes the parameters and ranking scores of candidates in the word graph. We use the gradient-descent algorithm to optimize our model and show the comprehensive experiments with two publicly-available WWW and KDD datasets in Computer Science. Experimental results demonstrate that our approach signi cantly outperforms the state-of-the-art graph-based keyphrase extraction approaches.
Extracting Researcher Metadata with Labeled Features
[chapter]
2014
Proceedings of the 2014 SIAM International Conference on Data Mining
Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We
doi:10.1137/1.9781611973440.85
dblp:conf/sdm/GollapalliQMG14
fatcat:m24zh2re3bbxdadwl4o5n7atby
more »
... feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the F1 value for the affiliation field, while the overall F1 improves by 9%.
Enhanced stability of Zn2SnO4 with N719, N3 and eosin Y dye molecules for DSSC application
2016
Physical Chemistry, Chemical Physics - PCCP
We have studied the interaction of N3, N719 and eosin Y photosensitizers with Zn2SnO4 and established its better stability compared to ZnO.
doi:10.1039/c5cp04716a
pmid:26498509
fatcat:azi4zdgu35cnnd7y5ebbml3y2e
Document Analysis and Retrieval Tasks in Scientific Digital Libraries
[chapter]
2015
Communications in Computer and Information Science
Machine Learning (ML) algorithms have opened up new possibilities for the acquisition and processing of documents in Information Retrieval (IR) systems. Indeed, it is now possible to automate several labor-intensive tasks related to documents such as categorization and entity extraction. Consequently, the application of machine learning techniques for various large-scale IR tasks has gathered significant research interest in both the ML and IR communities. This tutorial provides a reference
doi:10.1007/978-3-319-25485-2_1
fatcat:dfz6nxsn7zaz5p6omsef5rkglu
more »
... ary of our research in applying machine learning techniques to diverse tasks in Digital Libraries (DL). Digital library portals are specialized IR systems that work on collections of documents related to particular domains. We focus on open-access, scientific digital libraries such as CiteSeer x , which involve several crawling, ranking, content analysis, and metadata extraction tasks. We elaborate on the challenges involved in these tasks and highlight how machine learning methods can successfully address these challenges.
Similar researcher search in academic environments
2012
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries - JCDL '12
Entity search is an emerging IR and NLP task that involves the retrieval of entities of a specific type in response to a query. We address the "similar researcher search" or the "researcher recommendation" problem, an instance of "similar entity search" for the academic domain. In response to a 'researcher name' query, the goal of a researcher recommender system is to output the list of researchers that have similar expertise as that of the queried researcher. We propose models for computing
doi:10.1145/2232817.2232849
dblp:conf/jcdl/GollapalliMG12
fatcat:opljsvgag5hgfkwukfirut4ze4
more »
... ilarity between researchers based on expertise profiles extracted from their publications and academic homepages. We provide results of our models for the recommendation task on two publicly-available datasets. To the best of our knowledge, we are the first to address content-based researcher recommendation in an academic setting and demonstrate it for Computer Science via our system, ScholarSearch.
« Previous
Showing results 1 — 15 out of 1,013 results