Open-source Corpora: Using the net to fish for linguistic data

Notice

The full text article is available externally.

Source: International Journal of Corpus Linguistics, Volume 11, Number 4, 2006, pp. 435-462(28)

Publisher: John Benjamins Publishing Company

DOI: https://doi.org/10.1075/ijcl.11.4.05sha

The paper proposes a methodology for collecting “open-source” corpora, i.e. corpora that are automatically collected from the Internet and distributed in the form of a list of links with open-source software for recreating their full text. The result is a random snapshot of Internet pages which contain stretches of connected text in a given language. The paper discusses a methodology for acquiring such corpora, two ways of documenting them (using a set of metatextual categories and by comparison to frequency lists from existing corpora) and their function as benchmarks for comparing results of linguistic inquiry. Experiments with a variety of languages show that Internet-derived corpora can be successfully used in the absence of large representative corpora that are rare and expensive to build.

Keywords: Internet; corpus composition; frequency lists; representative corpora

Document Type: Research Article

Affiliations: University of Leeds

Publication date: 01 January 2006

Access Key
Free content
Partial Free content
New content
Open access content
Partial Open access content
Subscribed content
Partial Subscribed content
Free trial content

Open-source Corpora: Using the net to fish for linguistic data

Notice

Sign-in

Tools

Share Content