Skip to main content
padlock icon - secure page this page is secure

Implementation of a high performance architecture for managing and storing web-harvested collections

Buy Article:

$17.00 + tax (Refund Policy)

As institutions continue to grow their collections of web-harvested content, there is an ever increasing need for tools that organize, index and share this data. Even a modest web crawl consisting of a few web sites may generate millions of harvested documents. Repeating these crawls over time greatly expands the complexity of stored data. Identifying the scope of a crawl, the location of a page within a crawl and the differences over time between crawls becomes a challenging task. In this paper we will describe a software architecture in use at the University of Maryland designed to support research on quickly extracting information about the crawls, including statistical information, and on indexing web content. While designed to support research, many of the challenges addressed in this software exist at any site which has to manage large sets of time-spanning data.

Our architecture consists of two components. The first is a database application for organizing WARC-based web data called a WarcManager. The WarcManager was designed to track URL location and to allow easy extraction of crawl statistics across collections of warc-stored data. It provides both a REST-based API to harvested data as well as a portal for viewing statistics across the collection. The second component is a high performance, http based, storage service called the Simple Web-Accessible Preservation(SWAP) system. The SWAP system is distributed, novel file placement and retrieval service. It has been designed to be minimally intrusive and to allow complete data recovery even in the absence of any SWAP software.

These two components have been used to successfully support research into high performance indexing of web-based content. We will describe the implementation and performance characteristics of each component as well as possible real-world uses for the system.
No Reference information available - sign in for access.
No Citation information available - sign in for access.
No Supplementary Data.
No Article Media
No Metrics

Document Type: Research Article

Publication date: January 1, 2011

More about this publication?
  • The IS&T (digital) Archiving Conference offers a unique opportunity for imaging scientists and those working in the cultural heritage community (curators, archivists, librarians, photographers etc) from around the world to come together to discuss the most pressing issues related to the digital preservation and stewardship of hardcopy, and other cultural heritage documents and objects. Authors come from museums, archives, libraries, government institutions, industry and academia. Cutting edge topics related to multispectral and 3D imaging, as well as best practices for workflow, sharing, standards, and asset/collection management and dissemination are explored in papers presented at this annual, international event.

    Please note: For purposes of its Digital Library content, IS&T defines Open Access as papers that will be downloadable in their entirety for free in pertuity. Copyright restrictions on papers vary; see individual paper for details.

  • Editorial Board
  • Information for Authors
  • Submit a Paper
  • Subscribe to this Title
  • Membership Information
  • Terms & Conditions
  • Author guidelines
  • IS&T publication guidelines
  • IS&T publication policy
  • Ingenta Connect is not responsible for the content or availability of external websites
  • Access Key
  • Free content
  • Partial Free content
  • New content
  • Open access content
  • Partial Open access content
  • Subscribed content
  • Partial Subscribed content
  • Free trial content
Cookie Policy
Cookie Policy
Ingenta Connect website makes use of cookies so as to keep track of data that you have filled in. I am Happy with this Find out more