Configuring And Assembling Information Retrieval Based Solutions For Software Engineering Tasks

Bogdan Dit
2015
Textual or unstructured data generated during the software development process contains a significant amount of useful information that captures design decisions and the rationale of developers. One of the ways to exploit this information in order to support various software engineering (SE) tasks (e.g., concept location, traceability link recovery, change impact analysis, etc.) is to use Information Retrieval (IR) techniques (e.g., Vector S pace Model, Latent Semantic Indexing, Latent
more » ... Allocation, etc.). Two of the most important steps in a typical process of applying IR techniques to support SE tasks are: (i) preprocessing the corpus (i.e., a set of documents associated with a software system) by removing special characters, splitting identifiers, removing stop words, stemming identifiers, etc. and (ii) configuring the IR technique (i.e., setting up its param eters) and applying it on the preprocessed corpus. In our previous work, we observed that the various options available for the preprocessing step s of the corpus (e.g., splitting identifiers), as well as the different param eter values for configuring IR techniques (e.g., configuring the param eters for LDA) can significantly influence the results produced by IR techniques on different datasets for various S E tasks.
doi:10.21220/m2-hsht-dh07 fatcat:vfgnm5cm4nae5jsu7j3kfqcy4e