Nearest neighbor search is the problem of computing the most similar objects to a given query object. Typically objects are represented as vectors in a high-dimensional space. You want to be able to compute similar objects fast even when the database contains millions of objects.

Varieties of the problem

  • Sparse Data vs. Dense Data
  • Systems (standalone systems vs. keyword search systems)
  • Online queries vs. batch
  • Queries vs. all nearest neighbors problem
  • Types of similarity

Applications

  • Monitoring computer clusters (http://www.msr-waypoint.com/en-us/people/sumann/correlation-sigmod10.pdf)
  • Anomaly detection in computer cluster: create a correlation matrix between different computers, over time
  • Anomaly detection in graphs: find near-bipartite cores (http://alexbeutel.com/papers/www2013_copycatch.pdf)

watch for unexpected changes in the structure of the matrix - Finding correlations in multiple data streams and with time lags: http://www.cs.cmu.edu/~christos/PUBLICATIONS/sigmod05-braid.pdf

References

Algorithms:

Dynamic Time Warping: https://github.com/lemire/lbimproved

Comments