Artificial Intelligence for diagnosis of fractures on plain radiographs: a scoping review of current literature
Aim: To complete a scoping review of the literature investigating the performance of artificial intelligence (AI) systems currently in development for their ability to detect fractures on plain radiographic images. Methods: A systematic approach was adopted to identify papers for inclusion in this scoping review and utilised the Preferred Reporting Items for Systematic Reviews and Meta-Analysis Statement (PRISMA). Following application of inclusion and exclusion criteria, sixteen studies were
... cluded in the final review. Results: With the exception of one study, all studies report that AI models demonstrated an ability to perform fracture identification tasks on plain skeletal radiographs. Metrics used to report performance are variable throughout all reviewed studies and include area under the receiver operating characteristic curve (AUC), sensitivity and specificity, positive predictive value, negative predictive value, precision, recall, F1 score and accuracy. Reported performances for studies indicated AUC values range from AUC 0.78 (weakest) to the best performing system reporting AUC 0.99. Conclusion: The review found a great variation in the AI model architectures, training and testing methodology as well as the metrics used to report the performance of the networks. A standardisation of the reporting metrics and methods would permit comparison of proposed models and training methods which may accelerate the testing of AI systems in the clinical setting. Prevalence agnostic metrics should be used to reflect the true performance of such systems. Many studies lacked any explainability for the algorithmic decision making of the AI models, and there was a lack of interrogation into the potential reasons for misclassification errors. This type of 'failure analysis' would have provided insight into the biases and the aetiology of AI misclassifications.