Part-of-speech ratios in English corpora

Author: Hardie, Andrew

Source: International Journal of Corpus Linguistics, Volume 12, Number 1, 2007 , pp. 55-81(27)

Publisher: John Benjamins Publishing Company

Buy & download fulltext article:

OR

Price: $37.41 plus tax (Refund Policy)

Abstract:

Using part-of-speech (POS) tagged corpora, Hudson (1994) reports that approximately 37% of English tokens are nouns, where 'noun' is a superordinate category including nouns, pronouns and other word-classes. It is argued here that difficulties relating to the boundaries of Hudson's 'noun' category demonstrate that there is no uncontroversial way to derive such a superordinate category from POS tagging. Decisions regarding the boundary of the 'noun' category have small but statistically significant effects on the ratio that emerges for 'nouns' as a whole. Tokenisation and categorisation differences between tagging schemes make it problematic to compare the ratio of 'nouns' across different tagsets. The precise figures for POS ratios are therefore effectively artefacts of the tagset. However, these objections to the use of POS ratios do not apply to their use as a metric of variation for comparing data-sets tagged with the same tagging scheme.

Keywords: PART-OF-SPEECH TAGGING; WORD-CLASS FREQUENCY; TEXT TYPE; TAGSET; TAGGING SCHEME; LOB; BROWN; BNC SAMPLER

Document Type: Research article

DOI: http://dx.doi.org/10.1075/ijcl.12.1.05har

Publication date: 2007-03-01

Related content

Tools

Key

Free Content
Free content
New Content
New content
Open Access Content
Open access content
Subscribed Content
Subscribed content
Free Trial Content
Free trial content

Text size:

A | A | A | A
Share this item with others: These icons link to social bookmarking sites where readers can share and discover new web pages. print icon Print this page