Connecting family trees to construct a population-scale and longitudinal geo-social network for the U.S

Caglar Koylu, Diansheng Guo, Yuan Huang, Alice Kasakoff, Jack Grieve
We collected 92,832 user-contributed and publicly available family trees from, including 250 million individuals who were born in North America and Europe between 1630 and 1930. We cleaned and connected the family trees to create a population-scale and longitudinal family tree dataset using a workflow of data collection and cleaning, geocoding, fuzzy record linkage and a relation-based iterative search for connecting trees and deduplication of records. Given the largest connected
more » ... largest connected component of nearly 40 million individuals, and a total of 80 million individuals, we generated, to date, the largest population-scale and longitudinal geo-social network over centuries. We evaluated the representativeness of the family tree dataset for historical population demography and mobility by comparing the data to the 1880 Census. Our results showed that the family trees were biased towards males, the elderly, farmers, and native-born white segments of the population. Individuals were highly mobile – in our 1880 sample of parent-child pairs where both were born in the U.S., 47% were born in different states. Our findings agreed with prior studies that people migrated from East to West in horizontal bands, and the trend was reflected in the dialects and regional structure of the U.S.
doi:10.6084/m9.figshare.13026504.v1 fatcat:caeb7hu2s5cypap3hz5eolxk4y