The advantage of using relational databases for large corpora : Speed, advanced queries, and unlimited annotation
Relational databases can be used to create large corpora that provide both very good search performance and a wide range of queries. This paper outlines how this approach has been used to create theCorpus del EspaƱol, which contains 100 million words of text in Spanish
texts from the 1200s-1900s. The main databases are composed of n-grams tables (all unique 1, 2, 3, and 4 word sequences) and the associated frequency of all n-grams in each century (historical Spanish) and register (Modern Spanish). These tables are then joined to other tables containing part
of speech, lemma, synonyms, and user-defined lists of words and lemma. There is essentially no limit to the amount of annotation that can be added in additional tables (with little or no impact on performance), and the SQL-based queries allow a wide range of searches that are not available
with traditional corpora.
Keywords: SQL; Spanish; historical; n-grams; relational databases
Document Type: Research Article
Publication date: 01 January 2005
- Access Key
- Free content
- Partial Free content
- New content
- Open access content
- Partial Open access content
- Subscribed content
- Partial Subscribed content
- Free trial content