40 million open chemical structures from patents: treasure trove? junk yard? or both?

Christopher Southan
2022 Zenodo  
Compared to the literature, the patent corpus has both pros and cons for chemistry data mining The latter include being a) a "Cinderella" source that is difficult to get to grips with, b) massively redundant document corpus from patent families and kind codes and c) include various degrees of deliberate obfustation to impede data mining. Pros include a) paradoxically, compared to restricted acess to the literature, they are completely open for text mining and entity extraction, b) they contain
more » ... 3x to ~5x more medicinal chemistry SAR than published papers, c) include discloses of new drug targets and chemotypes years ahead of papers d) consitute a rich source of executed synthesis protocols and experimental chemistry property data e) withing the last few years open automated chemical named entity recognitian (CNER) has broken the monopoly of commercial chemistry curation. Because Medicines Discovery Catapault needs to keep up with developments in both commercial and open sources this work was undertaken to update our overview of patent extractions in general and the expanding integration within PubChem in particular. The four largest PubChem sources, SureChEMBL, Google Patents, WIPO, and IBM, use similar CNER pipelines that include name look-ups, IUPAC conversions and image-to-struc extractions. Their compound (CID) counts are 21.5, 17.9, 17.7 and 10.7 million, respectively, and together with small sources such as NextMove Software synthetic pathway extractions at 1.8 million, the CNER sources add up to just under 40 million from the PubChem March 2022 total of 111 million. The "treasure trove" aspects that will be presented includes a) expert curation of SAR from patents by BindingDB with 400K compounds from 5.4K US patents and data points covering 2,197 target proteins b) extensive covertage of the ~5 million exemplified compounds from all C07 and A61 patent classificied filings relevant to medicinal chemistry c) the ability to track back to exact example numbers in documents via SureChEMBL and WIPO. However, t [...]
doi:10.5281/zenodo.6656398 fatcat:dkya6umx6zbt5ex3xujrsv2y74