Setting Crunchbase for Data Science: Preprocessing, Data Integration and Feature Engineering
CARMA 2020 - 3rd International Conference on Advanced Research Methods and Analytics
In order to support equity investors in their decision-making process, researchers are exploring the potential of machine learning algorithms to predict the financial success of startup ventures. In this context, a key role is played by the significance of the data used, which should reflect most of the variables considered by investors in their screening and evaluation activity. This paper provides a detailed description of the data management process that can be followed to obtain such a
... obtain such a dataset. Using Crunchbase as the main data source, other databases have been integrated to enrich the information content and support the feature engineering process. Specifically, the following sources has been considered: USPTO PatentsView, Kauffman Indicators of Entrepreneurship, Academic Ranking of World Universities, CB Insights ranking of top-investors. The final dataset contains the profiles of 138,637 US-based ventures founded between 2000 and 2019. For each company the elements assessed by equity investors have been analyzed. Among others, the following specific areas were considered for each company: location, industry, founding team, intellectual property and funding round history. Data related to each area have been formalized in a series of features ready to be used in a machine learning context.