The PDF file you selected should load here if your Web browser has a PDF reader plug-in installed (for example, a recent version of Adobe Acrobat Reader).

If you would like more information about how to print, save, and work with PDFs, Highwire Press provides a helpful Frequently Asked Questions about PDFs.

Alternatively, you can download the PDF file directly to your computer, from where it can be opened using a PDF reader. To download the PDF, click the Download link above.

Fullscreen Fullscreen Off


Background/Objectives: Similarity search is a fundamental task in many domains. For large volumes of data, it is sometimes preferable to obtain approximate results instead of pursuing exact results that require long computation times. Methods/Statistical Analysis: The proposed method expresses a query using a fuzzy ball for the numerical data space and a conjunction of ground or general terms for categorical domains. It exploits a dual locality-sensitive hashing technique that constructs separate locality-sensitive hashing tables for the numerical and categorical data spaces. It determines buckets for a query by considering the relative distances of the numerical part of the query to the subspace boundaries and using concept hierarchies for the categorical attributes. It may also use a Bloom filter to select the candidate data set to be examined from among the determined buckets. Findings: The proposed approximate search technique was applied to two data sets. The portions of data to be examined were 0.02712% and 0.00084% for the first and the second data set. The average numbers of numerical buckets examined were 1.42 and 2.53 for the data sets, respectively. The figures mean that the proposed method significantly reduces the number of candidates to be considered and hence saves greatly computation cost. In addition, the proposed method is unique that it can be applied to data set with both numerical and categorical attributes. Improvements/Applications: The proposed locality-sensitive hashing method can be applied to approximate search tasks for large volume of data containing both numerical and categorical attributes.

Keywords

Locality-Sensitive Hashing, Nearest Neighbor Search, Similarity Search, Space Partitioning.
User