1,013 Hits in 1.3 sec

Keyphrase Extraction using Sequential Labeling [article]

Sujatha Das Gollapalli, Xiao-li Li
2016 arXiv   pre-print
Keyphrases efficiently summarize a document's content and are used in various document processing and retrieval tasks. Several unsupervised techniques and classifiers exist for extracting keyphrases from text documents. Most of these methods operate at a phrase-level and rely on part-of-speech (POS) filters for candidate phrase generation. In addition, they do not directly handle keyphrases of varying lengths. We overcome these modeling shortcomings by addressing keyphrase extraction as a
more » ... tial labeling task in this paper. We explore a basic set of features commonly used in NLP tasks as well as predictions from various unsupervised methods to train our taggers. In addition to a more natural modeling for the keyphrase extraction problem, we show that tagging models yield significant performance benefits over existing state-of-the-art extraction methods.
arXiv:1608.00329v2 fatcat:xjxmry4ae5eg7doek277i3dvtm

On Zero-Modified Poisson-Sujatha Distribution to Model Overdispersed Count Data

Wesley Bertoli Da Silva, Angélica Maria Tortola Ribeiro, Katiane Silva Conceição, Marinho Gomes Andrade, Francisco Louzada Neto
2018 Austrian Journal of Statistics  
It will be shown that the zero modification can be incorporated by using the zero-truncated Poisson-Sujatha distribution.  ...  A simple reparametrization of the probability function will allow us to represent the zero-modified Poisson-Sujatha distribution as a hurdle model.  ...  Recently, the Poisson-Sujatha distribution was obtained by compounding the Poisson with a Sujatha distribution.  ... 
doi:10.17713/ajs.v47i3.590 fatcat:tskja2ceyvdjddi6nsminx2eue

A Search/Crawl Framework for Automatically Acquiring Scientific Documents [article]

Sujatha Das Gollapalli and Krutarth Patel and Cornelia Caragea
2016 arXiv   pre-print
Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. In this paper, we propose a novel search-driven framework for acquiring documents for scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. Next,
more » ... ch papers and sources of research papers are identified from the search results using accurate classification modules. Our experiments highlight not only the performance of our individual classifiers but also the effectiveness of our overall Search/Crawl framework. Indeed, we were able to obtain approximately 0.665 million research documents through our fully-automated framework using about 0.076 million queries. These prolific results position Web search as an effective alternative to crawl methods for acquiring both the actual documents and seed URLs for future crawls.
arXiv:1604.05005v1 fatcat:sq4pssjmh5exnm5detl47djwyq

Phrase Pair Classification for Identifying Subtopics [chapter]

Sujatha Das, Prasenjit Mitra, C. Lee Giles
2012 Lecture Notes in Computer Science  
Automatic identification of subtopics for a given topic is desirable because it eliminates the need for manual construction of domain-specific topic hierarchies. In this paper, we design features based on corpus statistics to design a classifier for identifying the (subtopic, topic) links between phrase pairs. We combine these features along with the commonly-used syntactic patterns to classify phrase pairs from datasets in Computer Science and WordNet. In addition, we show a novel application
more » ... f our is-a-subtopic-of classifier for query expansion in Expert Search and compare it with pseudo-relevance feedback.
doi:10.1007/978-3-642-28997-2_48 fatcat:tyrjnzuofzh7bdcrykxzv7yqti

Performance of colloidal CdS sensitized solar cells with ZnO nanorods/nanoparticles

Anurag Roy, Partha Pratim Das, Mukta Tathavadekar, Sumita Das, Parukuttyamma Sujatha Devi
2017 Beilstein Journal of Nanotechnology  
As an alternative photosensitizer in dye-sensitized solar cells, bovine serum albumin (BSA) (a nonhazardous protein) was used in the synthesis of colloidal CdS nanoparticles (NPs). This system has been employed to replace the commonly used N719 dye molecule. Various nanostructured forms of ZnO, namely, nanorod and nanoparticle-based photoanodes, have been sensitized with colloidal CdS NPs to evaluate their effective performance towards quantum dot sensitized solar cells (QDSSCs). A polysulphide
more » ... (S x 2− )-based electrolyte and Cu x S counter electrode were used for cell fabrication and testing. An interesting improvement in the performance of the device by imposing nanorods as a scattering layer on a particle layer has been observed. As a consequence, a maximum conversion efficiency of 1.06% with an open-circuit voltage (V OC ) of 0.67 V was achieved for the ZnO nanorod/nanoparticle assembled structure.
doi:10.3762/bjnano.8.23 pmid:28243559 pmcid:PMC5301656 fatcat:24imh6mpp5gabgbz7j6wd7w6e4

Deep Learning for Character-Based Information Extraction [chapter]

Yanjun Qi, Sujatha G. Das, Ronan Collobert, Jason Weston
2014 Lecture Notes in Computer Science  
Yanjun Qi, Sujatha Das G, Ronan Collobert, and Jason Weston the-art performance on WS and POS and has achieved the state-of-theart predictive level for NER and SS tasks (detailed discussion in [1]).  ...  Das G Ronan Collobert Jason Weston 1 System Figure 1: The basic deep learning system for character-based IE tagging. 2 Experiments 2.1 Data Sets: Table 1 summarizes some statistics of the  ... 
doi:10.1007/978-3-319-06028-6_74 fatcat:ubzh4ytyrngapdfnijcizhysmi

Automatic Identification and Data Extraction from 2-Dimensional Plots in Digital Documents [article]

William Brouwer, Saurabh Kataria, Sujatha Das, Prasenjit Mitra, C. L. Giles
2008 arXiv   pre-print
Most search engines index the textual content of documents in digital libraries. However, scholarly articles frequently report important findings in figures for visual impact and the contents of these figures are not indexed. These contents are often invaluable to the researcher in various fields, for the purposes of direct comparison with their own work. Therefore, searching for figures and extracting figure data are important problems. To the best of our knowledge, there exists no tool to
more » ... matically extract data from figures in digital documents. If we can extract data from these images automatically and store them in a database, an end-user can query and combine data from multiple digital documents simultaneously and efficiently. We propose a framework based on image analysis and machine learning to extract information from 2-D plot images and store them in a database. The proposed algorithm identifies a 2-D plot and extracts the axis labels, legend and the data points from the 2-D plot. We also segregate overlapping shapes that correspond to different data points. We demonstrate performance of individual algorithms, using a combination of generated and real-life images.
arXiv:0809.1802v1 fatcat:dxeuw7aukbb3bn5xq523v6nfbe

Improving Researcher Homepage Classification with Unlabeled Data

Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles
2015 ACM Transactions on the Web  
A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on "non-homepages" present on
more » ... rrent-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: "How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?" We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for "learning a conforming pair of classifiers" that mimics co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset. Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions. Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.
doi:10.1145/2767135 fatcat:sa5amlvswveqhn3hh5hwvfu5ry

Researcher homepage classification using unlabeled data

Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, C. Lee Giles
2013 Proceedings of the 22nd international conference on World Wide Web - WWW '13  
A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changed content on the Web? We investigate this question in the context of researcher homepage crawling. We show experimentally that classifiers trained on existing datasets for homepage identification underperform while classifying "irrelevant"
more » ... pages on current-day academic websites. As an alternative to obtaining datasets to retrain the classifier for the new content, we propose to use effectively unlimited amounts of unlabeled data readily available from these websites in a co-training scenario. To this end, we design novel URL-based features and use them in conjunction with content-based features as complementary views of the data to obtain remarkable improvements in accurately identifying homepages from the current-day university websites. In addition, we propose a novel technique for "learning a conforming pair of classifiers" using mini-batch gradient descent. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We demonstrate that tuning the classifiers so that they make "similar" predictions on unlabeled data strongly corresponds to the effect achieved by co-training algorithms. We argue that this loss formulation provides insight into understanding the co-training process and can be used even in absence of a validation set.
doi:10.1145/2488388.2488430 dblp:conf/www/GollapalliCMG13 fatcat:eihwojyfmrgi5kqa2cyadgiuae

High-Pressure Phase Transitions of Morphologically Distinct Zn2SnO4 Nanostructures

Partha Pratim Das, P. Sujatha Devi, Douglas A. Blom, Thomas Vogt, Yongjae Lee
2019 ACS Omega  
Many aspects of nanostructured materials at high pressures are still unexplored. We present here, high-pressure structural behavior of two Zn2SnO4 nanomaterials with inverse spinel type, one a particle with size of ∼7 nm [zero dimensional (0-D)] and the other with a chain-like [one dimensional (1-D)] morphology. We performed in situ micro-Raman and synchrotron X-ray diffraction measurements and observed that the cation disordering of the 0-D nanoparticle is preserved up to ∼40 GPa, suppressing
more » ... he reported martensitic phase transformation. On the other hand, an irreversible phase transition is observed from the 1-D nanomaterial into a new and dense high-pressure orthorhombic CaFe2O4-type structure at ∼40 GPa. The pressure-treated 0-D and 1-D nanomaterials have distinct diffuse reflectance and emission properties. In particular, a heterojunction between the inverse spinel and quenchable orthorhombic phases allows the use of 1-D Zn2SnO4 nanomaterials as efficient photocatalysts as shown by the degradation of the textile pollutant methylene blue.
doi:10.1021/acsomega.9b01361 pmid:31460152 pmcid:PMC6649287 fatcat:hoa4i5ltm5br7lnlf3fetcqqvi


Yuxiang Zhang, Yaocheng Chang, Xiaoqing Liu, Sujatha Das Gollapalli, Xiaoli Li, Chunjing Xiao
2017 Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM '17  
Traditional supervised keyphrase extraction models depend on the features of labelled keyphrases while prevailing unsupervised models mainly rely on structure of the word graph, with candidate words as nodes and edges capturing the co-occurrence information between words. However, systematically integrating all these multidimensional heterogeneous information into a uni ed model is relatively unexplored. In this paper, we focus on how to e ectively exploit multidimensional information to
more » ... the keyphrase extraction performance (MIKE). Speci cally, we propose a random-walk parametric model, MIKE, that learns the latent representation for a candidate keyphrase that captures the mutual in uences among all information, and simultaneously optimizes the parameters and ranking scores of candidates in the word graph. We use the gradient-descent algorithm to optimize our model and show the comprehensive experiments with two publicly-available WWW and KDD datasets in Computer Science. Experimental results demonstrate that our approach signi cantly outperforms the state-of-the-art graph-based keyphrase extraction approaches.
doi:10.1145/3132847.3132956 dblp:conf/cikm/ZhangCLG0X17 fatcat:rypdj2lo2bbi3eb7tphw67j5yq

Extracting Researcher Metadata with Labeled Features [chapter]

Sujatha Das Gollapalli, Yanjun Qi, Prasenjit Mitra, C. Lee Giles
2014 Proceedings of the 2014 SIAM International Conference on Data Mining  
Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We
more » ... feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the F1 value for the affiliation field, while the overall F1 improves by 9%.
doi:10.1137/1.9781611973440.85 dblp:conf/sdm/GollapalliQMG14 fatcat:m24zh2re3bbxdadwl4o5n7atby

Enhanced stability of Zn2SnO4 with N719, N3 and eosin Y dye molecules for DSSC application

Partha Pratim Das, Anurag Roy, Sumita Das, Parukuttyamma Sujatha Devi
2016 Physical Chemistry, Chemical Physics - PCCP  
We have studied the interaction of N3, N719 and eosin Y photosensitizers with Zn2SnO4 and established its better stability compared to ZnO.
doi:10.1039/c5cp04716a pmid:26498509 fatcat:azi4zdgu35cnnd7y5ebbml3y2e

Document Analysis and Retrieval Tasks in Scientific Digital Libraries [chapter]

Sujatha Das Gollapalli, Cornelia Caragea, Xiaoli Li, C. Lee Giles
2015 Communications in Computer and Information Science  
Machine Learning (ML) algorithms have opened up new possibilities for the acquisition and processing of documents in Information Retrieval (IR) systems. Indeed, it is now possible to automate several labor-intensive tasks related to documents such as categorization and entity extraction. Consequently, the application of machine learning techniques for various large-scale IR tasks has gathered significant research interest in both the ML and IR communities. This tutorial provides a reference
more » ... ary of our research in applying machine learning techniques to diverse tasks in Digital Libraries (DL). Digital library portals are specialized IR systems that work on collections of documents related to particular domains. We focus on open-access, scientific digital libraries such as CiteSeer x , which involve several crawling, ranking, content analysis, and metadata extraction tasks. We elaborate on the challenges involved in these tasks and highlight how machine learning methods can successfully address these challenges.
doi:10.1007/978-3-319-25485-2_1 fatcat:dfz6nxsn7zaz5p6omsef5rkglu

Similar researcher search in academic environments

Sujatha Das Gollapalli, Prasenjit Mitra, C. Lee Giles
2012 Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries - JCDL '12  
Entity search is an emerging IR and NLP task that involves the retrieval of entities of a specific type in response to a query. We address the "similar researcher search" or the "researcher recommendation" problem, an instance of "similar entity search" for the academic domain. In response to a 'researcher name' query, the goal of a researcher recommender system is to output the list of researchers that have similar expertise as that of the queried researcher. We propose models for computing
more » ... ilarity between researchers based on expertise profiles extracted from their publications and academic homepages. We provide results of our models for the recommendation task on two publicly-available datasets. To the best of our knowledge, we are the first to address content-based researcher recommendation in an academic setting and demonstrate it for Computer Science via our system, ScholarSearch.
doi:10.1145/2232817.2232849 dblp:conf/jcdl/GollapalliMG12 fatcat:opljsvgag5hgfkwukfirut4ze4
« Previous Showing results 1 — 15 out of 1,013 results