Effective web scraping with OXPath

Giovanni Grasso, Tim Furche, Christian Schallhart
2013 Proceedings of the 22nd International Conference on World Wide Web - WWW '13 Companion  
Even in the third decade of the Web, scraping web sites remains a challenging task: Most scraping programs are still developed as ad-hoc solutions using a complex stack of languages and tools. Where comprehensive extraction solutions exist, they are expensive, heavyweight, and proprietary. OXPath is a minimalistic wrapping language that is nevertheless expressive and versatile enough for a wide range of scraping tasks. In this presentation, we want to introduce you to a new paradigm of
more » ... declarative navigation-instead of complex scripting or heavyweight, limited visual tools, OXPath turns scraping into a simple two step process: pick the relevant nodes through an XPath expression and then specify which action to apply to those nodes. OXPath takes care of browser synchronisation, page and state management, making scraping as easy as node selection with XPath. To achieve this, OXPath does not require a complex or heavyweight infrastructure. OXPath is an open source project and has seen first adoption in a wide variety of scraping tasks.
doi:10.1145/2487788.2487796 dblp:conf/www/GrassoFS13 fatcat:vu2oa2c7ibe3naj27poffssvze