Automated corpus data collection

View the Project on GitHub UUDigitalHumanitieslab/corpus-scraper


Welcome. CorpusScraper is a small script that you can run from your bookmarks bar (a “bookmarklet”). It extracts data from an online text corpus and lets you save the data to a CSV file, in just a few clicks.

In order to install, drag one of the boxes below to your bookmarks bar.

CorpusScraper can easily be extended with other corpora. For copyright and citation information, see the bottom of this page.

Since version 1.0, the following corpora have been supported:

The following corpora from the Fundación Rafael Lapesa were added in version 2.0:

In order to use CorpusScraper, enter a word-in-context query in your corpus of choice and load the first page of search results. When the page has completely loaded, click on the bookmarklet in your bookmarks bar. A progress bar is shown during extraction. After a while (about a second per page), the page turns white and a text box appears from which you can copy the CSV code (the text has already been selected for you). Paste the text into an empty plain text file (with e.g. Notepad, TextEdit or gedit) and save it with a .csv extension. Choose UTF-8 as the file encoding. The resulting file can be opened with Excel or SPSS. It uses semicolons to separate the values and double quotes to mark text values.

Latest version

If you install the bookmarklet through the box below, clicking on the bookmarklet will always give you the latest version of CorpusScraper, fully automagically. In other words, you never need to update your bookmarklet manually.


Version 2.0

If you wish to stick with version 2.0 and not receive automatic updates, install the box below instead.

ScrapeCorpus 2.0

Version 1.0

This was the first public version of CorpusScraper. If you find that version 2.0 breaks your corpus, you can try whether this version works for you.

ScrapeCorpus 1.0

Copyright and citing

© 2014, 2016, 2017 Digital Humanities Lab, Faculty of Humanities, Utrecht University
Author: Julian Gonggrijp

In (academic) papers, you can credit CorpusScraper using one of the following BibTeX entries. The first is for version 2.0, the second is for version 1.0.

    title        = {{CorpusScraper 2.0}},
    author       = {Gonggrijp, Julian},
    month        = {may},
    year         = {2017},
    note         = {Computer program by the Digital Humanities Lab, Utrecht University. \url{}}
    title        = {{CorpusScraper 1.0}},
    author       = {Gonggrijp, Julian},
    month        = {nov},
    year         = {2014},
    note         = {Computer program by the Digital Humanities Lab, Utrecht University. \url{}}