Automated labeling of bibliographic data extracted from biomedical online journals

Jongwoo Kim, Daniel X. Le, George R. Thoma, Tapas Kanungo, Elisa H. Barney Smith, Jianying Hu, Paul B. Kantor
2003 Document Recognition and Retrieval X  
A prototype system has been designed to automate the extraction of bibliographic data (e.g., article title, authors, abstract, affiliation and others) from online biomedical journals to populate the National Library of Medicine's MEDLINE database. This paper describes a key module in this system: the labeling module that employs statistics and fuzzy rule-based algorithms to identify segmented zones in an article's HTML pages as specific bibliographic data. Results from experiments conducted
more » ... h 1,149 medical articles from forty-seven journal issues are presented.
doi:10.1117/12.476047 dblp:conf/drr/KimLT03 fatcat:d63h5lijjfhrrp6nm2o2q2htja