# Scalable Similarity Search Meetings, Spring 2016

Theme: dimensionality and similarity search “beyond worst case”.

## List of possible papers to look at

### On dimensionality measures

Edgar Chávez and Gonzalo Navarro. Measuring the dimensionality of general metric spaces. Technical Report TR/DCC-00-1, Department of Computer Science, University of Chile, 2000. Submitted. Online ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/metricmodel.ps.gz. This is the original paper on “intrinsic dimensionality”

Sahinalp, S.C.; Tasan, M.; Macker, J.; Ozsoyoglu, Z.M., "Distance based indexing for string proximity search" in Data Engineering, 2003. Proceedings. 19th International Conference on , vol., no., pp.125-136, 5-8 March 2003 Does the “distribution of pairwise distances” experiment for a linguistics data set, considers theoretical implications of distance functions that are not even metrics, let alone nice spaces.

A. V. Arkhangel’skiˇı and L. S. Pontryagin, editors. General Topology I, volume 17 of Encyclopaedia of Mathematical Sciences. Springer, 1990. This book (especially Part II of it) covers topological dimension measures in good detail

Measuring the strangeness of strange attractors Physica D: Nonlinear Phenomena, Volume 9, Issue 1, Pages 189-208 Peter Grassberger, Itamar Procaccia Measuring dimensionality via correlation dimension, in a physics experiment

L.-S. Young. Dimension, entropy, and Lyapunov exponents. Ergodic Theory & Dynamical Systems, 2:109–124, 1982. Theoretical paper on the link between entropy and other dimensionality measures

Chávez, E., Navarro, G., Baeza-Yates, R., & Marroquín, J. L. (2001). Searching in metric spaces. ACM computing surveys (CSUR), 33(3), 273-321. They prove a bound on search time for some data structures in terms of “intrinsic dimensionality”

Christos Faloutsos and Ibrahim Kamel. Beyond uniformity and independence: analysis of R-trees using the concept of fractal dimension. PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems

### On data structures working well “beyond worst case”:

Muja and Lowe: Scalable Nearest Neighbor Algorithms for High Dimensional Data. Pattern Analysis and Machine Intelligence (PAMI), Vol. 36, 2014.

Bilegsaikhan and Boytsov: Permutation Search Methods are Efficient, Yet Faster Search is Possible. PVLDB, 8(12):1618–1629, 2015. ✔

Dong, Charikar, and Li. Efficient k-nearest neighbor graph construction for generic similarity measures. Proceedings of the 20th international conference on World Wide Web, 2011.

Chavez, Figueroa, and Navarro. Effective proximity retrieval by ordering permutations. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30.9 (2008): 1647-1658.

Malkov, Ponomarenko, Logvinov, and Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45, 61-68, 2014.

Amato and Savino. Approximate similarity search in metric spaces using inverted files. Proceedings of the 3rd international conference on Scalable information systems, 2008.

Novak, David, and Pavel Zezula. M-Chord: a scalable distributed similarity search structure. Proceedings of the 1st international conference on Scalable information systems, 2006.

Beygelzimer, Kakade, and Langford, Cover trees for nearest neighbor. in Proc. International Conference on Machine Learning (ICML), 2006, p. 97–104. ✔

Lv, Josephson, Wang, Charikar, and Li: Multi-probe LSH: efficient indexing for high-dimensional similarity search. Proc. Int. Conference on Very Large Data Bases (VLDB), p. 950–961, 2007.

Tao, Yi, Sheng, and Kalnis: Quality and Efficiency in High Dimensional Nearest Neighbor Search. Proc. ACM Int. Conference on Management of Data (SIGMOD), p. 563–576, 2009.

Low and Zheng: Fast Top-K Similarity Queries Via Matrix Compression. Proc. ACM Int. Conference on Information and Knowledge Management (CIKM), pp. 2070- 2074, 2012. ✔

Ram and Gray. Maximum Inner-Product Search Using Cone Trees. Proc. Int. Conference on Knowledge Discovery and Data Mining (KDD), p. 931-939, 2012. ✔

Karger and Ruhl, Finding nearest neighbors in growth-restricted metrics, Proc. ACM Symposium on Theory of Computing (STOC), p. 741-750, 2002.

Cole and Gottlieb, Searching dynamic point sets in spaces with bounded doubling dimension, in Proc. Symposium on Theory of Computing (STOC), p. 574– 583, 2006.

Har-Peled and Kumar, Approximate nearest neighbor search for low dimensional queries. Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), p. 854–867, 2011.

Krauthgamer and Lee. Navigating nets: simple algorithms for proximity search. Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, 2004.

Bawa, Condie, and Ganesan, LSH forest: self-tuning indexes for similarity search, WWW '05 Proceedings of the 14th international conference on World Wide Web, 2005. ✔

Park, Cafarella, and Mozafari, Neighbor-sensitive hashing, VLDB 2015. ✔

Lejsek, Asmundsson, Jonsson, Amsaleg. NV-Tree: An Efficient Disk-Based Index for Approximate Search in Very Large High-Dimensional Collections

Curtin, Ryan. Faster Dual-Tree Traversal for Nearest Neighbors Search, SISAP 2015. ✔

Yingfan Liu, Jiangtao Cui, Zi Huang, Hui Li, Heng Tao Shen. SK-LSH : An Efficient Index Structure for Approximate Nearest Neighbor Search, VLDB 2014. ✔

Dong, Wei and Wang, Zhe and Josephson, William and Charikar, Moses and Li, Kai. Modeling LSH for Performance Tuning, Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM), 2008.

Ram et al. Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions, NIPS 2009.

Har-Peled and Mahabadi. Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search ✔

Hinneburg and Keim. Optimal Grid Clustering Towards Breaking the Curse of Dimensionality in High Dimensional Clustering

… Please add more papers, or put a comment if you think some paper is obsolete. The list is especially lacking papers from the last few years.

## Past seminars

### Wednesday February 3, 2016

Matthew: Measures of dimensionality slides

Francesco: Introducing the SSS cluster slides

## Wednesday February 17, 2016

Rasmus: Cover trees for nearest neighbor

Ninh: Permutation Search Methods are Efficient, Yet Faster Search is Possible

## March 2, 2016

Martin: LSH forest: Self-Tuning Indexes for Similarity Search, Faster Dual-Tree Traversal for Nearest Neighbor Search

Matthew: analysis of R-trees using fractal dimension (Faloutsos and Kamel)

## March 17, 2016

Ninh: Neighbour-sensitive hashing

Thomas: Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search (especially section 5)

## March 30, 2016

Round of presentations: What have we been working on recently?

## April 13, 2016

Matteo: Willi Mann, Nikolaus Augsten and Panagiotis Bouros, An Empirical Evaluation of Set Similarity Join Techniques, Proceedings of VLDB Endowment (PVLDB), Vol 9, No 9, May 2016. (to appear) Online http://ssjoin.dbresearch.uni-salzburg.at/downloads/projects/ssjoin/ssjoin_technical_report.pdf

Francesco: Yu. A. Malkov, D. A. Yashunin Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs http://arxiv.org/pdf/1603.09320v1.pdf

# April 26, 2016

Tobias: will talk about his FOCS submission

Matthew: will talk about his SISAP submission

Thomas: will talk about his ESA submission (with Martin and Rasmus)

# May 11, 2016

Rasmus talks about:

Low and Zheng: Fast Top-K Similarity Queries Via Matrix Compression.

Ram and Gray. Maximum Inner-Product Search Using Cone Trees.

# May 25, 2016

Matteo:

Martin: