WebSelF: A Web Scraping Framework

Jakob Thomsen, Erik Ernst, Claus Brabrand, Michael Schwartzbach

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

We present, WebSelF, a framework for web scraping which models the process of web scraping and decomposes it into four conceptually independent, reusable, and composable constituents. We have validated our framework through a full parameterized implementation that is flexible enough to capture previous work on web scraping. We have experimentally evaluated our framework and implementation in an experiment that evaluated several qualitatively different web scraping constituents (including previous work and combinations hereof) on about 11,000 HTML pages on daily versions of 17 web sites over a period of more than one year. Our framework solves three concrete problems with current web scraping and our experimental results indicate that com- position of previous and our new techniques achieve a higher degree of accuracy, precision and specificity than existing techniques alone.
Original languageEnglish
Title of host publicationICWE'12 Proceedings of the 12th international conference on Web Engineering
Number of pages15
Volume7387
PublisherSpringer
Publication date2012
Pages347-361
ISBN (Print)978-3-642-31752-1
Publication statusPublished - 2012
EventInternational Conference on Web Engineering - Berlin, Germany
Duration: 23 Jul 201227 Jul 2012
Conference number: 12
http://www.informatik.uni-trier.de/~ley/db/conf/icwe/

Conference

ConferenceInternational Conference on Web Engineering
Number12
Country/TerritoryGermany
CityBerlin
Period23/07/201227/07/2012
Internet address

Keywords

  • Web scraping framework
  • Parameterized implementation
  • Experimental evaluation
  • HTML data extraction
  • Accuracy and precision in data collection

Fingerprint

Dive into the research topics of 'WebSelF: A Web Scraping Framework'. Together they form a unique fingerprint.

Cite this