Improvement of Crawling Time of Nutch by Performance-Based Data Distribution
We proposed the system to detect HTML5 new security vulnerabilities based on Apache Nutch which is a distributed web crawler in a previous paper. However, there is performance reduction in fetch phase if the number of documents per domain is not balanced because Nutch partitions target
URLs based on domains. To improve crawling time of Nutch, we propose the method to partition and distribute target URLs based on performance of slave nodes. As performance-based distribution that we propose, we were able to reduce crawling time about 62.2% compare to Nutch’s domain-based
distribution.
Keywords: Apache Nutch; Data Distribution; HTML5; Web Crawling
Document Type: Research Article
Affiliations: Department of Computer Engineering, Chungnam National University, Korea
Publication date: 01 November 2016
- ADVANCED SCIENCE LETTERS is an international peer-reviewed journal with a very wide-ranging coverage, consolidates research activities in all areas of (1) Physical Sciences, (2) Biological Sciences, (3) Mathematical Sciences, (4) Engineering, (5) Computer and Information Sciences, and (6) Geosciences to publish original short communications, full research papers and timely brief (mini) reviews with authors photo and biography encompassing the basic and applied research and current developments in educational aspects of these scientific areas.
- Editorial Board
- Information for Authors
- Subscribe to this Title
- Ingenta Connect is not responsible for the content or availability of external websites
- Access Key
- Free content
- Partial Free content
- New content
- Open access content
- Partial Open access content
- Subscribed content
- Partial Subscribed content
- Free trial content