WebSelF: A Web Scraping Framework

Jakob Thomsen, Erik Ernst, Claus Brabrand, Michael Schwartzbach

    Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

    Abstract

    We present, WebSelF, a framework for web scraping which models the process of web scraping and decomposes it into four conceptually independent, reusable, and composable constituents. We have validated our framework through a full parameterized implementation that is flexible enough to capture previous work on web scraping. We have experimentally evaluated our framework and implementation in an experiment that evaluated several qualitatively different web scraping constituents (including previous work and combinations hereof) on about 11,000 HTML pages on daily versions of 17 web sites over a period of more than one year. Our framework solves three concrete problems with current web scraping and our experimental results indicate that com- position of previous and our new techniques achieve a higher degree of accuracy, precision and specificity than existing techniques alone.
    OriginalsprogEngelsk
    TitelICWE'12 Proceedings of the 12th international conference on Web Engineering
    Antal sider15
    Vol/bind7387
    ForlagSpringer
    Publikationsdato2012
    Sider347-361
    ISBN (Trykt)978-3-642-31752-1
    StatusUdgivet - 2012
    BegivenhedInternational Conference on Web Engineering - Berlin, Tyskland
    Varighed: 23 jul. 201227 jul. 2012
    Konferencens nummer: 12
    http://www.informatik.uni-trier.de/~ley/db/conf/icwe/

    Konference

    KonferenceInternational Conference on Web Engineering
    Nummer12
    Land/OmrådeTyskland
    ByBerlin
    Periode23/07/201227/07/2012
    Internetadresse

    Emneord

    • Web scraping framework
    • Parameterized implementation
    • Experimental evaluation
    • HTML data extraction
    • Accuracy and precision in data collection

    Fingeraftryk

    Dyk ned i forskningsemnerne om 'WebSelF: A Web Scraping Framework'. Sammen danner de et unikt fingeraftryk.

    Citationsformater