Curating GitHub for engineered software projects

Nuthan Munaiah, Steven Kroh, Craig Cabrey, Meiyappan Nagappan
2017 Empirical Software Engineering  
12 Software forges like GitHub host millions of repositories. Software engineering researchers have been able to take advantage of such a large corpora of potential study subjects with the help of tools like GHTorrent and Boa. However, the simplicity in querying comes with a caveat: there are limited means of separating the signal (e.g. repositories containing engineered software projects) from the noise (e.g. repositories containing home work assignments). The proportion of noise in a random
more » ... mple of repositories could skew the study and may lead to researchers reaching unrealistic, potentially inaccurate, conclusions. We argue that it is imperative to have the ability to sieve out the noise in such large repository forges. 13 14 15 16 17 18 19 20 We propose a framework, and present a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project. We identify software engineering practices (called dimensions) and propose means for validating their existence in a GitHub repository. We used reaper to measure the dimensions of 1,994,977 GitHub repositories. We then used the data set train classifiers capable of predicting if a given GitHub repository contains an engineered software project. The performance of the classifiers was evaluated using a set of 200 repositories with known ground truth classification. We also compared the performance of the classifiers to other approaches to classification (e.g. number of GitHub Stargazers) and found our classifiers to outperform existing approaches. We found stargazers-based classifier to exhibit high precision (96%) but an inversely proportional recall (27%). On the other hand, our best classifier exhibited a high precision (82%) and a high recall (83%). The stargazer-based criteria offers precision but fails to recall a significant potion of the population. 21 22 23 24 25 26 27 28 29 30 31 32 33 Software repositories contain a wealth of information about the code, people, and processes that go into 34 the development of a software product. Retrospective analysis of these software repositories can yield 35 valuable insights into the evolution and growth of the software products contained within. We can trace 36 such analysis all the way back to the 1970s, when Belady and Lehman (1976) proposed Lehman's Laws 37 of software evolution. Today, the field is significantly invested in retrospective analysis with the Boa 38 project (Dyer et al., 2013) receiving more than $1.4 million to support such analysis 1 . 39 The insights gained through retrospective analysis can affect the decision-making process in a project, 40 and improve the quality of the software system being developed. An example of this can be seen in the 41 recommendations made by Bird et al. (2011) in their study regarding the effects of code ownership on 42 the quality of software systems. The authors suggest that quality assurance efforts should focus on those 43 components with many minor contributors. 44 1 National Science Foundation (NSF) Grant CNS-1513263 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2617v1 | CC BY 4.0 Open Access | rec: 6 Dec 2016, publ: 48 Bitbucket, SourceForge, and CodePlex for source code and Bugzilla, Mantis, and Trac for bugs, researchers 49 now have an abundance of data from which to mine and draw interesting conclusions. 50 Every source code commit contains a wealth of information that can be used to gain an understanding 51 of the art of software development. For example, Eick et al. (2001) dived into the rich (fifteen-plus 52 year) commit history of a large telephone switching system in order to explore the idea of code decay. 53 Modern day source code repositories provide features that make managing a software project as seamless 54 as possible. While the integration of features provides improved traceability for developers and project 55 managers, it also provides MSR researchers with a single, self-contained, organized, and more importantly, 56 publicly-accessible source of information from which to mine. However, anyone may create a repository 57 for any purpose at no cost. Therefore, the quality of information contained within the forges may 58 be diminishing with the addition of many noisy repositories e.g. repositories containing home work 59 assignments, text files, images, or worse, the backup of a desktop computer. Kalliamvakou et al. (2014) 60 identified this noise as one of the nine perils to be aware of when mining GitHub data for software 61 engineering research. The situation is compounded by the sheer volume of repositories contained in these 62 forges. As of June, 2016, GitHub alone hosts over 38 million repositories 2 and this number is rapidly 63 increasing. 64 Researchers have used various criteria to slice the mammoth software forges into data sets manageable 65 for their studies. For example, MSR researchers have leveraged simple filters such as popularity to remove 66 noisy repositories. Filters like popularity (measured as number of watchers or stargazers on GitHub, for 67 example) are merely proxies and may neither be general-purpose nor representative of an engineered 68 software project. Furthermore, MSR researchers should not have to reinvent filters to eliminate unwanted 69 repositories. There are a few examples of research that take the approach of developing their own filters 70 in order to procure a data set to analyze: 71 • In a study of the relationship between programming languages and code quality, Ray et al. (2014) 72 selected 50 most popular (measured by the number of stars) repositories in each of the 19 most 73 popular languages. 74 • Bissyandé et al. (2013) chose the first 100,000 repositories returned by the GitHub API in their 75 study of the popularity, interoperability, and impact of programming languages. 76 • Allamanis and Sutton (2013) chose 14,807 Java repositories with at least one fork in their study of 77 applying language modeling to mining source code repositories. 78 The project sites for GHTorrent (GHTorrent, 2016) and Boa (Iowa State University, 2016) list more 79 papers that employ different filtering schemes. 80 The assumption that one could make is that the repositories sampled in these studies contain engineered 81 software projects. However, source code forges are rife with repositories that do not contain source code, 82 let alone an engineered software project. Kalliamvakou et al. (2014) manually sampled 434 repositories 83 from GitHub and found that only 63.4% (275) of them were for software development; the remaining 84 159 repositories were used for experimental, storage, or academic purposes, or were empty or no longer 85 accessible. The inclusion of repositories containing such non-software artifacts in studies targeting 86 software projects could lead to conclusions that may not be applicable to software engineering at large. 87 At the same time, selecting a sample by manual investigation is not feasible given the sheer volume of 88 repositories hosted by these source code forges. 89 The goal of our work is to identify practices that an engineered software project would typically 90 exhibit with the intention of developing a generalizable framework with which to identify such projects in 91 the real-world. 92 The contributions of our work are: 93 • A generalizable evaluation framework defined on a set of dimensions that encapsulate typical 94 software engineering practices; 95 2 https://github.com/about/press 2/29 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2617v1 | CC BY 4.0 Open Access | rec: 6 Dec 2016, publ: • A reference implementation of the evaluation framework, called reaper, available as an open-96 source project (Munaiah et al., 2016c); 97 • A publicly-accessible data set of dimensions obtained from 1,994,977 GitHub repositories (Munaiah 98 et al., 2016b). 99 The remainder of this paper is organized as follows: we begin by introducing the notion of an 100 engineered software project in Section 2. We then propose an evaluation framework in Section 2.1 that 101 aims to operationalize the definition of an engineered software project along a set of dimensions. We 102 describe the various sources of data used in our study in Section 3. In Section 4, we introduce the eight 103 dimensions used to represent a repository in our study. In Section 5, we define propose two variations 104 to the definition of an engineered software project, collect a set of repositories that conform to the 105 definitions, present approaches to build classifiers capable of identifying other repositories that conform 106 to the definition of an engineered software project. The results from validating the classifiers and using 107 them to identify repositories that conform to a particular definition of an engineered software project 108 from a sample of 1,994,977 GitHub repositories is presented in Section 6. We contrast our study with 109 prior literature in Section 7, discuss prior and potential research scenarios in which the data set and the 110 classifier could be used in Section 8, and discuss nuances of certain repositories in Section 9. We address 111 threats to validity in Section 10 and conclude the paper with Section 11. 112 2 ENGINEERED SOFTWARE PROJECT 113 Laplante (2007) defines software engineering as "a systematic approach to the analysis, design, assessment, 114 implementation, test, maintenance and reengineering of software". A software project may be regarded as 115 "engineered" if there is discernible evidence of the application of software engineering principles such as 116 design, test, maintenance, etc. On similar lines, we define an engineered software project in Definition 117 2.1. 118 Definition 2.1. An engineered software project is a software project that leverages sound software 119 engineering practices in each of its dimensions such as documentation, testing, and project management. 120 Definition 2.1 is intentionally abstract; the definition may be customized to align with a set of different, 121 yet relevant, concerns. For instance, a study concerned with the extent of testing in software projects 122 could define an engineered software project as a software project that leverages sound software testing 123 practices. In our study, we have customized the definition of an engineered software project in two 124 ways: (a) an engineered software project is similar to the projects contained within repositories owned by 125 popular software engineering organizations such as Amazon, Apache, Microsoft and Mozilla and (b) an 126 engineered software project is similar to the projects that have a general-purpose utility to users other 127 than the developers themselves. We elaborate on these two definitions in the Implementation Section ( §5). 128 2.1 Evaluation Framework 129 In order to operationalize Definition 2.1, we need to (a) identify the essential software engineering 130 practices that are employed in the development and maintenance of a typical software project and (b) 131 propose means of quantifying the evidence of their use in a given software project. The evaluation 132 framework is our attempt at achieving this goal. 133 The evaluation framework, in its simplest form, is a boolean-valued function defined as a piece-wise 134 function shown in (1). 135 f (r) = true If repository r contains an engineered software project f alse Otherwise (1) The evaluation framework makes no assumption of the implementation of the boolean-valued function, 136 f (r). In our implementation of the evaluation framework, we have chosen to realize f (r) in two ways: 137 (a) f (r) as a score-based classifier and (b) f (r) as a Random Forest classifier. In both approaches, the 138 implementation of the function, f (r), is achieved by expressing the repository, r, using a set of quantifiable 139 attributes (called dimensions) that we believe are essential in reasoning that a repository contains an 140 engineered software project. GitHub metadata contains a wealth of information with which we could describe several phenomena 147 surrounding a source code repository. For example, some of the important pieces of metadata are the 148 primary language of implementation in a repository and the commits made by developers to a repository. 149 GitHub provides a REST API (GitHub, Inc., 2016a) with which GitHub metadata may be obtained 150 over the Internet. There are several services that capture and publish this metadata in bulk, avoiding 151 the latency of the official API. The GitHub Archive project (GitHub, Inc., 2016b) was created for this 152 purpose. It stores public events from the GitHub timeline and publishes them via Google BigQuery. 153 Google BigQuery is a hosted querying engine that supports SQL-like constructs for querying large data 154 sets. However, accessing the GitHub Archive data set via BigQuery incurs a cost per terabyte of data 155 processed. 156 Fortunately, Gousios (2013) has a free solution via their GHTorrent Project. The GHTorrent project 157 provides a scalable and queryable offline mirror of all Git and GitHub metadata available through the 158 GitHub REST API. The GHTorrent project is similar to the GitHub Archive project in that both start 159 with the GitHub's public events timeline. While the GitHub Archive project simply records the details 160 of a GitHub event, the GHTorrent project exhaustively retrieves the contents of the event and stores 161 them in a relational database. Furthermore, the GHTorrent data sets are available for download, either as 162 incremental MongoDB dumps or a single MySQL dump, allowing offline access to the metadata. We 163 have chosen to use the MySQL dump which was downloaded and restored on to a local server. In the 164 remainder of the paper, whenever we use the term database we are referring to the GHTorrent database. 165 The database dump used in this study was released on April 1, 2015. The database dump contained 166 metadata for 16,331,225 GitHub repositories. In this study, we restrict ourselves to repositories in which 167 the primary language is one of Java, Python, PHP, Ruby, C++, C, or C#. Furthermore, we do not consider 168 repositories that have been marked as deleted and those that are forks of other repositories. Deleted 169 repositories restrict the amount of data available for the analysis while forked repositories can artificially 170 inflate the results by introducing near duplicates into the sample. With these restrictions applied, the size 171 of our sample is reduced to 2,247,526 repositories. 172 An inherent limitation of the database is the staleness of data. There may be repositories in the 173 database that no longer exist on GitHub as they may have been deleted, renamed, made private, or blocked 174 by GitHub. 175 3.2 Source Code 176 In addition to the metadata about a repository, the code contained within is an important source of 177 information about the project. Developers typically interact with their repositories using either the git 178 client or the GitHub web interface. Developers may also use the GitHub REST API to programmatically 179 interact with GitHub. 180 We use GitHub to obtain a copy of the source code for each repository. We cannot use GitHub's 181 REST API to retrieve repository snapshots, as the API internally uses the git archive command to 182 create those snapshots. As a result, the snapshots may not include files the developers may have marked 183 irrelevant to an end user (such as unit test files). Since we wanted to examine all development files in our 184 analysis, we used the git clone command instead to ensure all files are downloaded. 185 As mentioned earlier, the metadata used in this study is current as of April 1, 2015. However, this 186 metadata may not be consistent with a repository cloned after April 1, 2015, as the repository contributors 187 may have made commits after that date. In order to synchronize the repository with the metadata, we 188 reset the state of the repository to a past date. For each evaluated repository in the database, we retrieved 189 the date of the most recent commit to the repository. We then identified the SHA of the last commit 190 made to the repository before the end of the day identified by date using the command git log -1 191 --before="{date} 11:59:59". For repositories with no commits recorded in the database, we 192 used the date when the GHTorrent metadata dump was released i.e. 2015-04-01. With the appropriate 193 4/29 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2617v1 | CC BY 4.0 Open Access | rec: 6 Dec 2016, publ: 8/29 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2617v1 | CC BY 4.0 Open Access | rec: 6 Dec 2016, publ: 493 We then use the cloc tool to compute sloc from all source files in the repository and slotc from the 494 test files identified. Occasionally, a software project may use multiple unit testing frameworks e.g. a 495 Django web application project may use Python's unittest framework and Django's extension of 496 unittest-django.test. In order to account for this scenario, we accumulate the test files identified 497 using patterns for multiple language-specific unit testing frameworks before computing slotc. 498 The multitude of unit testing frameworks available for each of the programming languages considered 499 makes the approach limited in its capabilities. We currently support 20 unit testing frameworks. The unit 500 testing frameworks currently supported are: Boost, Catch, googletest, and Stout gtest for C++; clar, GLib 501 Testing, and picotest for C; NUnit, Visual Studio Testing, and xUnit for C#; JUnit and TestNG for Java; 502 PHPUnit for PHP; django.test, nose, and unittest for Python; and minitest, RSpec, and Ruby Unit Testing 503 for Ruby. 504 In scenarios where we are unable to identify a unit testing framework, we resort to considering all 505 files in directories named test, tests, or spec as test files. 506 11/29 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2617v1 | CC BY 4.0 Open Access | rec: 6 Dec 2016, publ: 2. False Negative -EsotericSoftware/dnsmadeeasy is a repository that contains a simple 714 Java tool that periodically updates the IP addresses in the DNS servers maintained by DNS Made 715 Easy. 7 Clearly, the tool has a general-purpose utility for any customer using the DNS Made Easy 716 service. However, the repository received a score of 10. Analyzing the source code contained in 717 and the dimensions of the repository, we found that the project is too simple. The architecture 718 dimension could not be computed because the repository contains only a single source file which 719 has no source code comments. The repository does have a license but does not use continuous 720 integration, unit testing, or issues. However, the ground truth classification of this repository was 721 that it contained an engineered software project because of the utility of the Java tool. 722 5 In the case where the ground truth classification is "not project", 82% of the time, both approaches 775 correctly classified repositories as "not project". In addition, the stargazers-based classifier correctly 776 classified repositories as "not project" 99% of the time where the random forest classifier did so 82% 777 of the time. Consider the repository liorkesos/drupalcamp, which has sufficient documentation, 778 commit history, and community to be classified as a "project" by the random forest classifier, however, the 779 repository is essentially a collection of static PHP files of a Drupal Camp website, not incredibly useful in a 780 general software engineering study. The stargazers-based classifier predicts liorkesos/drupalcamp 781 as "not project" only for its lack of stars. 782 Summary 783 We can make three observations about the suitability of the score-based or random forest classifiers to 784 help researchers generate useful data sets. First, the strict stargazers-based classifier ignores many valid 785 projects but enjoys almost 0% false positive rate. Second, the random forest classifier trained with the 786 utility data set is able to correctly classify many "unpopular" projects, helping extend the population from 787 which sample data sets may be drawn. Third, the score-based and random forest projects have their own 788 imperfections as well. Our classifiers are likely to introduce false positives into research data sets. Perhaps, 789 our classifiers could be used an an initial selection criteria augmented by the stargazers-based classifier. 790 Nevertheless, we have shown that more work can be done to improve the data collection methods in 791 software engineering research. 792 Prediction 793 In this section, we present the results from applying the score-based and random forest classifiers to 794 identify engineered software projects in a sample of 1,994,977 GitHub repositories. Shown in Table 6 795 are the number of repositories classified as containing an engineered software project by the score-based 796 and random forest classifiers. With the exception of the score-based classifier trained using the utility 797 data set, the number of repositories classified as containing an engineered software project is, on average, 798 12.45% of the total number of repositories analyzed. We can also see from Table 6 that there are far fewer 799 21/29 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2617v1 | CC BY 4.0 Open Access | rec: 6 Dec 2016, publ: As mentioned in Section 4.1 (Architecture), the computational complexity may prevent the collection 806 of the monolithicity metric for certain large repositories. There were 4,451 such repositories in our data set 807 (a mere 0.22% of the total number of repositories). On average, 1,770 of the 4,451 repositories (˜39.77%) 808 were classified as containing an engineered software project with the architecture dimension defaulted to 809 zero. 810 The entire data set may be viewed and downloaded as a CSV file from https://reporeapers. 811 github.io. The data set includes the metric values collected from each repository. The data set 812 available online contains information pertaining to 2,247,526 GitHub repositories but 252,105 of those 813 repositories were inactive at the time reaper was run and as a result the metric values will all be NULL. 814 22/29 PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.2617v1 | CC BY 4.0 Open Access | rec: 6 Dec 2016, publ: Breiman, L. (2001). Random Forests. Machine Learning, 45(1):5-32. 979 CA Technologies (2016). Waffle.io -Work Better on GitHub Issues. https://waffle.io/. Ac-980 cessed: 2016-03-11.
doi:10.1007/s10664-017-9512-6 fatcat:nxp2qrs5ivb6zppi5ujw7jbkym