Empowering the data science scientist

Jason H. Moore
2021 BioData Mining  
The discipline of data science has emerged, flourished, and evolved rapidly over the last 20 years in lockstep with the rise of big data, artificial intelligence, machine learning, statistics, and inexpensive computing. At its core, data science is about integrating the right methods, tools, and technology from different disciplines for the sole purpose of solving a complex data-driven problem in a particular domain such as economics, engineering, or medicine. All data science challenges start
more » ... ith a question. What is the best investment strategy? When will this bridge need to be replaced? Why do some people have adverse reactions to a drug? A key question is "where do these questions come from?" Most questions arise from domain experts. This is intuitive given economists, engineers, and clinicians have deep knowledge of their specific areas. They know the scientific literature and know where the gaps are. Unfortunately, the trend across disciplines has been to specialize. This, coupled with the rapid expansion of the size of the scientific literature, means that experts are increasingly unaware of key literature outside their specific area. For example, a mechanical engineer working on nanotechnology might be unaware of the mechanical engineering literature in biotechnology. Similarly, a clinician specializing in gastroenterology is not likely keeping up with the latest developments in neurology. The impact of this specialization is that the questions being asked are not informed by literature in other areas. As the ones who ask the questions, domain experts are usually the scientists leading the research studies. This of course makes sense. An important challenge comes from how data scientists are engaged. Unfortunately, domain experts sometimes see data scientists as service personnel. That is, the data scientist is brought to the project to perform the data management and analysis and then released. There are several issues with this approach. Most obvious is the importance of engaging data scientists early in the development of the research project so that the design of the study is consistent with the analytical approaches to be used. As the great statistician Sir Ronald A. Fisher once said, "To consult the statistician after an experiment is finished is often merely to ask him to conduct a postmortem examination. He can perhaps say what the experiment died of."
doi:10.1186/s13040-021-00246-x pmid:33485343 fatcat:6wxk54bc25gwdi3ysiujyali4q