Automatic negation detection in narrative pathology reports

Ying Ou, Jon Patrick
2015 Artificial Intelligence in Medicine  
Pathology reports provide vital information for the clinical management of cancer patients, allowing accurate diagnosis, staging and determination of treatment and prognosis. However, there are several issues resulting from traditional narrative reports compared to structured reports. For example, essential elements are occasionally omitted, especially negative results, which are not always reported clearly. As well, the referring doctors often find it difficult to identify the necessary
more » ... s in a free-text pathology report to justify a given diagnosis. There are a number of advantages for the use of structured pathology reports: they can ensure the accuracy and completeness of pathology reporting; it is easier for the referring doctors to glean pertinent information from them, thus improving the communication between pathologists and clinicians. Furthermore, they also facilitate efficient extraction of information for cancer registries, data collection and research purposes. The goal of this thesis is to extract pertinent information from free-text pathology reports and automatically populate structured reports for three cancer diseases, namely melanoma, colorectal cancer, lymphoma and identify the commonalities and differences in processing principles to obtain maximum accuracy. Unlike previous works that regard the task as automatic structuring of sentences of interest in narrative medical reports, this study aims to populate certain fields in structured reports based on the global view of the entire document. This is challenging, as it requires either inference from the entities or combination of various entities as well. The fields predefined in structured templates were determined mainly by utilizing three structured cancer reporting protocols from the Australia and the Royal College of Pathologists of Australia as well as advice from clinicians and pathologists. A detailed corpus analysis was conducted on a set of pathology notes, with the objectives of identifying lexical and linguistic characteristics in the narratives, and the difficulties or challenges that may be encountered when processing these texts. Assessment of the level of completeness of original reports, and proposals for appropriate strategies for the establishment of structured templates were subsequently completed. Three pathology corpora were annotated with entities and relationships between the entities in this study, namely the melanoma corpus, the colorectal cancer corpus and the lymphoma corpus. Detailed annotation schemas and guidelines were developed in an iterative process to ensure annotation consistency. A supervised machine-learning based-approach was developed to recognise medical entities from the corpora. Specifically, the medical entity recognition system used conditional random fields (CRF) learners. The CRF-based models were able to capture a significant portion of the entity boundaries by iii using contextual information. The application of rich feature sets provided useful clues for the classification of entity types. By feature engineering, the best feature configurations were attained, which boosted the F-scores significantly from 4.2% to 6.8% in 10-fold cross-validation experiments on the training sets. Several common effective features across the three corpora were identified, which can be beneficial for other medical entity recognition tasks. Without proper negation and uncertainty detection, final outputs for several fields in the structured templates will be affected, and consequently the quality of the structured reports will be diminished. The negation and uncertainty detection modules were built to handle this problem. The modules obtained very good performance (with over 99% overall F-scores) on the training sets, which dropped on the test sets (where overall F-scores decreased to 76.6% -91.0%). A relation extraction system was presented to extract four relations from the lymphoma corpus. A rule-based approach was applied to classify Spatial Specialization relation, while a supervised machine learning-based approach was adopted to identify Result-Positive, Result-Negative and Result-Equivocal relations. Simple heuristic rules were applied in the rule-based module, while several useful features were prepared for the support vector machines (SVM) classifier. The system achieved very good performance on the training set, with 100% F-score obtained by the rule-based module and 97.2% micro-averaged F-score attained by the SVM classifier. Predefined templates were designed based on a thorough review of the structured reporting protocols and analysis of the training corpora. Rule-based approaches were used to generate the structured outputs and populate them to the templates. The rule-based system attained over 97% F-scores on the training sets. A pipeline system was implemented with an assembly of all the components described above. It achieved promising results in the end-to-end evaluations, with 86.5%, 84.2% and 78.9% micro-averaged F-scores on the melanoma, colorectal cancer and lymphoma test sets respectively. The pipeline system can be applied to cancer registries, clinical audits and epidemiology research. With further improvement, it can also significantly improve the quality of pathology reporting in the clinical setting. iv
doi:10.1016/j.artmed.2015.03.001 pmid:25990897 fatcat:yrijkncnsvht7lonmaqt7uyya4