C-C1-02: Data Extraction From Text, Step 1: Preparing Test for Machine Processing
D. Carrell
2010
Clinical Medicine & Research
For the specific purpose of crafting a strategic plan for developing scientific and technical capacity for applying natural language processing (NLP) techniques to clinical text in HMO Research Network (HMORN) sites we convened a working conference in Seattle, December 17 -18, 2008. Expert consultation in the strategic planning process was provided by six NLP experts from three institutions considered national leaders in the development and application of NLP technologies: the Mayo Clinic, the
more »
... niversity of Pittsburgh Medical Center, and Vanderbilt University. Conference participants included representatives of five HMORN sites, Northwestern University, and the University of Washington. Specific aims of the conference were: 1) To develop a strategic plan for acquiring natural language processing capacity including, potentially, development of in-house capacity, subcontracting NLP tasks to one or more external collaborators/contractors, some combination of these approaches, or none at all; and 2) To educate investigators and technical staff about the processes of developing, testing, and applying NLP algorithms for extracting data from clinical text in the context of typical HMORN-style research programs. Traditional conference presentations addressed "What is natural language processing," "Illustrations of potential NLP projects," "Considerations when developing and applying NLP algorithms," and "Comprehensive, large-scale information extraction systems." A round-tablestyle discussion sessions focused on issues of human capital requirements for NLP projects, subcontracting NLP tasks to outside entities, multi-site collaboration on NLP projects, and a strategic plan for developing NLP capacity. This presentation summarizes the strategic planning products of the conference, currently in draft form and undergoing edits by a working group of the expert consultants and conference participants. Preliminarily, the strategic plan recommends incremental development of NLP capacity in HMORN-like sites through sequential execution of thoughtfully-selected small -scale NLP and NLP-like projects that represent key technical aspects of what can become a mature program of NLP capacity within 3 -5 years. Initially, small -scale projects would be conducted in close collaboration with outside informatics experts to provide scientific and technical guidance to the immediate project and to assure fidelity with a longer-term locally -adapted strategic plan. Background: Natural language processing (NLP) uses software to assist in the extraction of information from clinical text, a process usually performed entirely by chart abstractors. Before NLP can be applied the text in question must be prepared for machine processing. In research settings this preprocessing work often involves several successive and related tasks, requiring substantial amounts of time and attention from people representing various types of clinical, scientific and technical expertise. Appreciating the tasks and participants involved in pre-processing clinical text can make the work more manageable, efficient, and effective. Methods: The information presented here comes from case study analyses of three small -scale projects involving preparation of clinical text (pathology reports, radiology reports, and progress notes) for processing by the Cancer Text Information Extraction System. Supplementing these experiences is information from anecdotal conversations with natural language processing experts. Results: Ten separate pre-processing tasks were identified: 1) obtaining source feeds, 2) assessing completeness, 3) de-duplication, 4) universe description, 5) cleaning and formatting, 6) de-identification, 7) database loading, 8) sampling, 9) preparation of the NLP system input feed, and 10) quality assurance. Nine types of expertise or task participants required for preprocessing were identified: 1) IRB representative, 2) source-system manager, 3) network/dbase administrator, 4) programmer, 5) statistician, 6) investigator, 7) informaticist, 8) clinical domain expert, and 9) manual chart abstractor.
doi:10.3121/cmr.8.1.53-a
fatcat:6fk5gzd5hzcdzmfkb2qjlsbwmi