On the Relative Influence of Corpus and Dictionary Size in a Study Using Non-Parallel Corpora
Abstract:We did an experiment on Japanese-to-German translation of 2-part compound nouns via their components using a small dictionary and a large Target Language (TL) corpus. As TL translation variants, we considered expressions containing adjectives or genitive adjuncts, as well as diverse forms for the first component of a German compound. Verification in a TL corpus is a good means of deciding among these forms, at least. In order to get significant statistics from corpora, large data quantities are important. As parallel data are still quite scarce, using monolingual corpora instead is an option, but it requires the use of a dictionary. In our study, insufficient dictionary size was an obstacle much bigger than corpus size. We tried to quantify the relative influence of the two resources to assess system balance. We predict that a middle-sized dictionary of about 100,000 entries would give good coverage of compound noun components.
Document Type: Research Article
Publication date: August 1, 2001