Some Statistical Properties and Zipf's Law in Korean Text Corpus
Author: Choi S.-W.
Source: Journal of Quantitative Linguistics, Volume 7, Number 1, April 2000 , pp. 19-30(12)
Key:
- Free Content
- New Content
- Subscribed Content
- Free Trial Content
Abstract:
Some statistical characteristics of Korean texts are analyzed by experiments on large corpora. We obtain the number of occurrences of syllables and of words in Korean texts. The entropy of syllables is estimated using finite context model. Digram and trigram entropy of syllables are also estimated. The entropy of words is estimated using the same model. We try to examine how Korean text obeys the well-known Zipf's law. Two mathematical models are constructed by modifying Mandelbrot distribution and are simulated for Korean texts. The coefficient B in Mandelbrot distribution is determined for our models by experiment. We compare Zipf's law in Korean text with that in English and in French. According to Mandelbrot, the coefficient B is B > 1 in all the usual cases, however, we obtain B < 1 in some range of the rank-frequency distribution of Korean text. We also checked that the coefficient B does not depend on the kind and on the size of corpus but on the language.Document Type: Research article
DOI: 10.1076/0929-6174(200004)07:01;1-3;FT019
Key:
- Free Content
- New Content
- Subscribed Content
- Free Trial Content

Click here for Page Help