Extracting geographic features from the Internet to automatically build detailed regional gazetteers
Abstract:The utility of every imaginable application which incorporates a gazetteer hinges on the simple fact that the resulting system will only be as useful, complete, or accurate as the underlying gazetteer itself. A major issue confronting gazetteers utilized in systems today is that they are not complete and measures of their accuracy are largely unknown. In this paper we describe a methodology which addresses this problem by automatically generating highly complete and detailed regional gazetteers from Internet sources. We utilize information extraction and integration techniques to automatically obtain geographic features and associated footprints and feature types from freely and widely available online data which could be applied to create a gazetteer for nearly any area. We discuss the distinguishing characteristics of the generated gazetteer and extend previous work to define measures which can be used to assess the completeness and accuracy of gazetteers. Using these measures, the generated gazetteer is evaluated against the Alexandria Digital Library Gazetteer and the Los Angeles Comprehensive Bibliographic Database. Our results indicate that a gazetteer created by our methods will be at least as complete as any gazetteer currently available for certain feature classes, while falling short in others. We conclude by offering suggestions to address these shortcomings.
Document Type: Research Article
Affiliations: 1: Department of Computer Science, University of Southern California, Los Angeles, CA 90089-0255, USA 2: Department of Geography, University of Southern California, Los Angeles, CA 90089-0255, USA 3: Department of Computer Science, University of Southern California, Marina del Rey, CA 90292, USA
Publication date: January 1, 2009