Issues with the capture-recapture measure of vocabulary size

Notice

The full text article is available externally.

Source: The Mental Lexicon, Volume 10, Number 1, 2015, pp. 152-163(12)

Publisher: John Benjamins Publishing Company

DOI: https://doi.org/10.1075/ml.10.1.06nel

This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams, Segalowitz & Leclair, 2014). When sampling from a population generated by a power-law process (e.g., a Zipf distribution), the probability that any given member is selected is dependent on its rank, such that higher frequency rank (i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower rank (i.e., 100th, 1000th) members. Because of this, sampling is much more likely to select from the same limited group of words. The CR measure, however, assumes a uniform distribution, and so drastically underestimates the size of the vocabulary when applied to power-law data. Work with simulated data shows ways that the degree of underestimation may be lessened. Applying these methods to real data shows effects parallel to those in the simulations.

Keywords: Zipf’s law; corpora; measurement; vocabulary

Document Type: Research Article

Publication date: 01 January 2015

Access Key
Free content
Partial Free content
New content
Open access content
Partial Open access content
Subscribed content
Partial Subscribed content
Free trial content

Issues with the capture-recapture measure of vocabulary size

Notice

Sign-in

Tools

Share Content