Evaluating performance of biomedical image retrieval systems—An overview of the medical image retrieval task at ImageCLEF 2004–2013

Jayashree Kalpathy-Cramer, Alba García Seco de Herrera, Dina Demner-Fushman, Sameer Antani, Steven Bedrick, Henning Müller
2015 Computerized Medical Imaging and Graphics  
Medical(image(retrieval(and(classification(have(been(extremely(active(research(topics(over( the( past( 15( years.( With( the( ImageCLEF( benchmark( in( medical( image( retrieval( and( classification( a( standard( test( bed( was( created( that( allows( researchers( to( compare( their( approaches( and( ideas( on( increasingly( large( and( varied( data( sets( including( generated( ground( truth.( This( article( describes( the( lessons( learned( in( ten( evaluations( campaigns.( A(
more » ... (of(the(data(also(highlights(the(value(of(the(resources(created.( ( Introduction:* While( development( of( image( retrieval( approaches( and( systems( began( as( a( research( field( over( two( decades( ago( [5,34, 38] , ( progress( has( been( slow( for( a( variety( of( reasons.( One( problem( is( the( inability( of( image( processing( algorithms( to( automatically( identify( the( content( of( images( in( the( manner( that( information( retrieval( and( extraction( systems( have( been(able(to(do(so(with(text([4, 38] . (A(second(problem(is(the(lack(of(robust(test(collections( and(in(particular,(realistic(query(tasks(with(ground(truth(that(allow(comparison(of(system( performance( [4,18,25,32]. ( In( general,( the( limits( of( systematic( comparisons( have( been( analyzed(in(several(publications([44],(but(also(an(important(impact(could(be(shown(when( evaluating( the( results( of( such( benchmarks( [42,40],( particularly( economic( value( but( also( scholarly(impact(in(terms(of(citations.* The(lack(of(realistic(test(collections(is(one(of(the(motivations(for(the(ImageCLEF(initiative,( which(is(a(part(of(the(Cross-Language(Evaluation(Forum(((CLEF),(a(challenge(evaluation(for( information(retrieval(from(diverse(languages ([24]. (The(goals(of(CLEF(are(to(build(realistic( test( collections( that( simulate( real( world( retrieval( tasks,( enable( researchers( to( assess( the( performance( of( their( systems,( and( compare( their( results( with( others.( The( goal( of( test( collection(construction(is(to(assemble(a(large(collection(of(content((text,(images,(structured( data,(etc.)(that(resemble(collections(used(in(the(real(world.(Builders(of(test(collections(also( seek( a( sample( of( realistic( tasks( to( serve( as( topics( that( can( be( submitted( to( systems( as( queries( to( retrieve( content( [18,33]. ( The( final( component( of( test( collections( is( relevance( judgments(that(determine(which(content(is(considered(relevant(to(each(topic.( Biomedical(information(retrieval(systems(are(complex,(comprising(many(key(components.( These( include( image( modality( classification( [29],( visual( image( similarity( computation,( multimodal( image( and( text( information( retrieval,( and( others( that( may( be( specific( to( individual(systems.(Performance(evaluation(needs(to(be(conducted(on(these(components(to( determine( the( overall( system( performance.( With( the( exponential( increase( in( available( biomedical(data(repositories(it(is(important(for(the(evaluation(to(be(close(to(real(world(in( its( size( and( scope.( The( ImageCLEF 1 (medical( retrieval( tasks( have( provided( such( an(
doi:10.1016/j.compmedimag.2014.03.004 pmid:24746250 pmcid:PMC4177510 fatcat:xuck5guwojexviqczz4keargjm