Augmenting and structuring user queries to support efficient free-form code search
Proceedings of the 40th International Conference on Software Engineering - ICSE '18
Source code terms such as method names and variable types are often different from conceptual words mentioned in a search query. This vocabulary mismatch problem can make code search inefficient. In this paper, we present COde voCABUlary (CoCaBu), an approach to resolving the vocabulary mismatch problem when dealing with free-form code search queries. Our approach leverages common developer questions and the associated expert answers to augment user queries with the relevant, but missing,
... ural code entities in order to improve the performance of matching relevant code examples within large code repositories. To instantiate this approach, we build GitSearch, a code search engine, on top of GitHub and Stack Overflow Q&A data. We evaluate GitSearch in several dimensions to demonstrate that (1) its code search results are correct with respect to user-accepted answers; (2) the results are qualitatively better than those of existing Internet-scale code search engines; (3) our engine is competitive against web search engines, such as Google, in helping users complete solve programming tasks; and (4) GitSearch provides code examples that are acceptable or interesting to the community as answers for Stack Overflow questions. 2 Raphael Sirres et al. implement most of program elements (e.g., classes and methods) based on existing programs already written by other programmers  , an effective code search engine is a critical factor for programming productivity. Open source project hosting platforms, such as GitHub, SourceForge, and BitBucket now offer an opportunity for students, researchers and developers to access real-world software projects for improving their work. It is, however, challenging to locate relevant source code due to the enormous size of existing code repositories. For instance, as of August 2015, GitHub is hosting more than 25 millions private and public code repositories 1 . To help developers search for source code, several Internet-scale code search engines  , such as OpenHub  and Codota  have been proposed. The advantage of these engines is that users can express their queries in a list of keywords (i.e., free-form queries) rather than specific program elements such as API classes and methods. Unfortunately, these Internet-scale code search engines have an accuracy issue since they treat source code as natural language documents. Source code, however, is written in a programming language while query terms are typically expressed in natural language. As a result, searching source code with query keywords in natural language often leads to irrelevant and low quality search results unless the keywords exactly correspond to program elements. According to Hoffmann et al. , however, around 64% of programmer web queries for code are merely descriptive but do not contain actual names of APIs, packages, types, etc. As in any search engine, the terms in a code search query must be mapped with an index built from the code. Unfortunately, the construction of such an index as well as the mapping process are challenging since "no single word can be chosen to describe a programming concept in the best way"  . This is known in the literature as the vocabulary mismatch problem: user search queries frequently mismatch a majority of the relevant documents [15, 20, 45, 46] . This problem occurs in various software engineering research work such as retrieving regulatory codes in product requirement specifications  , identifying bug files based on bug reports  , and searching code examples    . The vocabulary mismatch problem is further exacerbated in code search engines where the source code may be poorly documented or may use non explicit names for variables and method names  . To work around the translation issue between the query terms and the relevant code, one can leverage a developer community. Actually, developers often resort to web-based resources such as blogs, tutorial pages and Q&A sites. Stack Overflow is one of such leading discussion platforms, which has gained popularity among software developers. In Stack Overflow, an answer to a question is typically short texts accompanied by code snippets that demonstrate a solution to a given development task or the usage of a particular functionality in a library or framework. Stack Overflow provides social mechanisms to assess and improve the quality of posts that leads implicitly to high quality source code snippets. While code snippets found in Q&A sites certainly accelerate the software development process, they fail to explore the potential of large code repositories. Typically, those code snippets are manually crafted by developers rather than being actual examples from source code repositories. Thus, snippets often omit context information (e.g., variable types and initialization values) that might be necessary to understand interactions with other relevant components. On the other hand, actual examples in source code repositories can provide different views on how a single functionality can 1 https://github.com/about/press (verified 14.08.2015) Short form of title 3 be implemented by different APIs. Source code repositories also contain concrete code that demonstrates the interaction between various modules and APIs of interest. Besides, usually, in Q&A sites, an acceptable answer only exists when the question, or a very similar one, has been asked before. Otherwise, the questioner must wait for other experienced developers to provide answers. Our work focuses on building an approach to automatically expanding developer code search queries. Specifically, we aim at translating free-form queries to augment them with relevant program elements. To augment a user query, we consider first finding similar (in terms of natural language words) queries for which we have some sketched answers. Then we can collect from these answers some important code keywords. Finally, such code keywords are simply used to enrich the user's initial free-form terms. This query expansion is effective in retrieving relevant code search results even when the user has not provided in his query terms essential information such as API names. Contributions We propose a novel approach to augmenting user queries in a free-form code search scenario. This approach aims at improving the quality of code examples returned by Internet-scale code search engines by building a COde voCABUlary (CoCaBu). The originality of CoCaBu is that it addresses the vocabulary mismatch problem, by expanding/enriching/re-targeting a user's free-form query, building on similar questions in Q&A sites so that a code search engine can find highly relevant code in source code repositories. Overall, this paper makes the following contributions: -CoCaBu approach to the vocabulary mismatch problem: We propose a technique for finding relevant code with free-form query terms that describe programming tasks, with no a-priori knowledge on the API keywords to search for. In this regard, we differ from several state-of-the-art techniques, which perform by searching relevant usage examples of APIs that the user can already list as relevant for his task [10, 25, 31, 35]. -GitSearch free-form search engine for GitHub: We instantiate the CoCaBu approach based on indices of Java files built from GitHub and Q&A posts from Stack Overflow to find the most relevant source code examples for developer queries. -Empirical user evaluation: We present the evaluation results implying that Git-Search accurately extends user queries to produce correct (i.e., relevant) results. Comparison with popular code search engines further shows that GitSearch is more effective in returning acceptable code search results. In addition, Comparison against web search engines indicates that GitSearch is a competitive alternative. Finally, via a live study, we show that users on Q&A sites may find GitSearch's real code examples acceptable as answers to developer questions. The remainder of this paper is organized as follows. Section 2 motivates our work further, listing some limitations in the state-of-the-art and introducing the key ideas behind our approach. Section 3 then overviews the CoCaBu approach. We provide evaluation results in Section 4 and discuss related work in Section 5. Finally, Section 6 concludes the paper. Motivation The literature contains a large body of approaches that attempt to solve the vocabulary mismatch problem. They either 1) use a controlled vocabulary  maintained by 4 Raphael Sirres et al.