Digital Library and Archiving for Qatar

Tarek Kanan, Sagnik Ray Choudhury, C. Lee Giles, Prashant Chandrasekar, Edward A. Fox
2015 Bulletin of IEEE Technical Committee on Digital Libraries  
Crawling and Indexing Qatari Scholarly Content-SeerQ SeerSuite is a collection management system for digital libraries, developed at Penn State. It includes: 1) A Web crawler for scholarly articles; 2) A machine learning based automated system for metadata (title, abstract, author name/affiliation, citations) extraction; 3) A module for ingesting extracted information into a database and Solr; and 4) A JSP based front end for users. SeerQ reflects our modification of SeerSuite to address Qatari
more » ... requirements. It uses both Heritrix and an in-house developed OAI-PMH based crawler, which accesses digital repositories in Qatar that expose their metadata and content, especially QScience, a publisher in Doha focusing on scholarly content produced in Qatar. Other seeds for crawling were provided by the Qatar National Library and cover websites such as QCRI, Qatar University, and varied research establishments. We have around 3300 documents ingested and around 4000 documents crawled. Metadata records with an author name, title, and citations are available through OAI-PMH.
dblp:journals/tcdl/KananCGCF15 fatcat:yunwsr5wkfb2jakn3tctjdwhh4