Automated corpus data collection

Welcome. CorpusScraper is a small script that you can run from your bookmarks bar (a “bookmarklet”). It extracts data from an online text corpus and lets you save the data to a CSV file, in just a few clicks.

CorpusScraper can easily be extended with other corpora. For copyright and citation information, see the bottom of this page.

Since version 1.0, the following corpora have been supported:

The following corpora from the Fundación Rafael Lapesa were added in version 2.0:

In order to use CorpusScraper, enter a word-in-context query in your corpus of choice and load the first page of search results. When the page has completely loaded, click on the bookmarklet in your bookmarks bar. A progress bar is shown during extraction. After a while (about a second per page), the page turns white and a text box appears from which you can copy the CSV code (the text has already been selected for you). Paste the text into an empty plain text file (with e.g. Notepad, TextEdit or gedit) and save it with a .csv extension. Choose UTF-8 as the file encoding. The resulting file can be opened with Excel or SPSS. It uses semicolons to separate the values and double quotes to mark text values.

Latest version

Version 2.0

Version 1.0

This was the first public version of CorpusScraper. If you find that version 2.0 breaks your corpus, you can try whether this version works for you.

Copyright and citing

© 2014, 2016, 2017 Digital Humanities Lab, Faculty of Humanities, Utrecht University
Author: Julian Gonggrijp

