Approximate k-Closest-Pairs in Large High-Dimensional Data Sets

Authors: Angiulli, Fabrizio1; Pizzuti, Clara2

Source: Journal of Mathematical Modelling and Algorithms, Volume 4, Number 2, June 2005 , pp. 149-179(31)

Publisher: Springer

Key:
Free Content - Free Content
New Content - New Content
Subscribed Content - Subscribed Content
Free Trial Content - Free Trial Content

Abstract:

An approximate algorithm to efficiently solve the k-Closest-Pairs problem on large high-dimensional data sets is presented. The algorithm runs, for a suitable choice of the input parameters, in <IMG SRC="http://images.ingentaselect.com/absimages/klu/15701166/klu_jmma_2005_4_2_4080h.1.gif" ALT="$\mathcal{O}(dˆ{2}nk)$" TEXT="a mathematical formula"> time, where d is the dimensionality and n is the number of points of the input data set, and requires linear space in the input size. It performs at most d+1 iterations. At each iteration a shifted version of the data set is sequentially scanned according to the order induced on it by the Hilbert space filling curve and points whose contribution to the solution has already been analyzed are detected and eliminated. The pruning is lossless, in fact the remaining points along with the approximate solution found can be used for the computation of the exact solution. If the data set is entirely pruned, then the algorithm returns the exact solution. We prove that the pruning ability of the algorithm is related to the nearest neighbor distance distribution of the data set and show that there exists a class of data sets for which the method, augmented with a final step that applies an exact method to the reduced data set, calculates the exact solution with the same time requirements.

Although we are able to guarantee a <IMG SRC="http://images.ingentaselect.com/absimages/klu/15701166/klu_jmma_2005_4_2_4080h.2.gif" ALT="$\mathcal{O}(dˆ{1+{1}/{t}})$" TEXT="a mathematical formula"> approximation to the solution, where tisin{1,2,. . .,infin} identifies the Minkowski (Lt) metric of interest, experimental results give the exact k closest pairs for all the large high-dimensional synthetic and real data sets considered and show that the pruning of the search space is effective. We present a thorough scaling analysis of the algorithm for in-memory and disk-resident data sets showing that the algorithm scales well in both cases.

Keywords: k-Closest-Pairs problem; Space Filling Curves; approximate algorithms

Document Type: Research article

DOI: 10.1007/s10852-004-4080-3

Affiliations: 1: ICAR-CNR, Università della Calabria, Via Pietro Bucci 41C, 87036, Rende (CS), Italy, Email: angiulli@icar.cnr.it 2: ICAR-CNR, Università della Calabria, Via Pietro Bucci 41C, 87036, Rende (CS), Italy, Email: icar@icar.cnr.it

The full text electronic article is available for purchase. You will be able to download the full text electronic article after payment.

$47.00 plus tax      Refund Policy

 

OR

Back to top

Key:
Free Content - Free Content
New Content - New Content
Subscribed Content - Subscribed Content
Free Trial Content - Free Trial Content
Share this item with others: These icons link to social bookmarking sites where readers can share and discover new web pages.
Page Help Click here for Page Help
Shopping cart
Tools
Sign in






Need to register?
Sign up here
Text size: A | A | A | A