Implementation of a high performance architecture for managing and storing web-harvested collections
Our architecture consists of two components. The first is a database application for organizing WARC-based web data called a WarcManager. The WarcManager was designed to track URL location and to allow easy extraction of crawl statistics across collections of warc-stored data. It provides both a REST-based API to harvested data as well as a portal for viewing statistics across the collection. The second component is a high performance, http based, storage service called the Simple Web-Accessible Preservation(SWAP) system. The SWAP system is distributed, novel file placement and retrieval service. It has been designed to be minimally intrusive and to allow complete data recovery even in the absence of any SWAP software.
These two components have been used to successfully support research into high performance indexing of web-based content. We will describe the implementation and performance characteristics of each component as well as possible real-world uses for the system.
Document Type: Research Article
Publication date: January 1, 2011
The IS&T (digital) Archiving Conference offers a unique opportunity for imaging scientists and those working in the cultural heritage community (curators, archivists, librarians, photographers etc) from around the world to come together to discuss the most pressing issues related to the digital preservation and stewardship of hardcopy, and other cultural heritage documents and objects. Authors come from museums, archives, libraries, government institutions, industry and academia. Cutting edge topics related to multispectral and 3D imaging, as well as best practices for workflow, sharing, standards, and asset/collection management and dissemination are explored in papers presented at this annual, international event.
Please note: For purposes of its Digital Library content, IS&T defines Open Access as papers that will be downloadable in their entirety for free in pertuity. Copyright restrictions on papers vary; see individual paper for details.
- Editorial Board
- Information for Authors
- Submit a Paper
- Subscribe to this Title
- Membership Information
- Terms & Conditions
- Author guidelines
- IS&T publication guidelines
- IS&T publication policy
- Ingenta Connect is not responsible for the content or availability of external websites