Geocoding genomic databases using GBIF [article]

Roderic D. M. Page
2018 bioRxiv   pre-print
AbstractMany nucleotide sequences in the publicly available genomics databases lack spatial information, such as the latitude and longitude coordinates for the locality where the sample for sequencing was taken. In this note I discuss several approaches to geocoding sequence records. The first method uses the Global Biodiversity Information Facility (GBIF: as a gazetter. The availability of a simple full text search across GBIF data makes it possible to rapidly geocode
more » ... information simply by searching for matching records within GBIF. Hence if a sequence lacks coordinates but has some locality information it could be rapidly geocoded. The second method matches voucher specimen code for sequences with the corresponding specimen records in GBIF, which may be geocoded even if the sequence obtained from that specimen is not. Lastly, there will be cases where sequence records lack either locality or specimen information, but that information is available elsewhere, such as in the published literature or in supplementary data files. The possibility of publishing geocoded sequence records using Github is discussed.
doi:10.1101/469650 fatcat:o4jjs7ni3nadpm4x5wdge7gjee