Improvement of Crawling Time of Nutch by Performance-Based Data Distribution
We proposed the system to detect HTML5 new security vulnerabilities based on Apache Nutch which is a distributed web crawler in a previous paper. However, there is performance reduction in fetch phase if the number of documents per domain is not balanced because Nutch partitions target URLs based on domains. To improve crawling time of Nutch, we propose the method to partition and distribute target URLs based on performance of slave nodes. As performance-based distribution that we propose, we were able to reduce crawling time about 62.2% compare to Nutch’s domain-based distribution.
No Reference information available - sign in for access.
No Citation information available - sign in for access.
No Supplementary Data.
No Article Media
Document Type: Research Article
Affiliations: Department of Computer Engineering, Chungnam National University, Korea
Publication date: November 1, 2016
More about this publication?
- ADVANCED SCIENCE LETTERS is an international peer-reviewed journal with a very wide-ranging coverage, consolidates research activities in all areas of (1) Physical Sciences, (2) Biological Sciences, (3) Mathematical Sciences, (4) Engineering, (5) Computer and Information Sciences, and (6) Geosciences to publish original short communications, full research papers and timely brief (mini) reviews with authors photo and biography encompassing the basic and applied research and current developments in educational aspects of these scientific areas.
- Editorial Board
- Information for Authors
- Subscribe to this Title
- Ingenta Connect is not responsible for the content or availability of external websites