## 2D Geometry

When the dataset has two features, one can visualize the whole dataset on a 2D plot.

Let the data be represented by the matrix $X$ where rows are data points and columns are features.

### Projection of a single point

For a one dimensional projection we require a direction which is given by a unit vector $d = [\beta_1, \beta_2]$. The projection of each data point is given by the dot product of the feature vector and the direction:

The above equations give us a procedure to geometrically find the projection of a data point.

• First, we draw a directed line given by the vector $d$.
• Second, we find the point on the projection line which is closest to the data point. The closest point is the projection of the original data point.
• We measure the directed distance (it can be negative) from 0 to the projected data point.

### Projection of a dataset

The projection of a single point is not informative unless we compare it with the projection of all other points.

Projecting the whole dataset in one dimension gives us a histogram. This histogram is somewhat informative because it may reveal some structure from the original dataset.

Here is an example. This dataset has two clusters.

We can observe the clustering structure in the projection above. However, we cannot observe the clustering structure on the projection below.

### The projection of largest variance does not necessarily reveal the clustering structure.

It is known that PCA (a technique related to SVD) is related to finding projections where the histograms are as wide as possible (a wide histogram means large variance). The projection of largest variance is the widest histogram. However, the largest variance projection does not necessarily capture the clustering structure or separability between clusters. This is somewhat of a textbook statement since many textbooks mention this fact.

In the example below we have two clusters which are separable across the axis $x_2$. The dominant right singular vector is however along the direction $x_1$. The singular values are 15 and 6 which tell us that the variance along $x_1$ is about 2.5 times larger than the variance along $x_2$. The clustering structure is revealed however by the second right singular vector of the SVD which is along $x_2$.

### An example where no one dimensional projection can capture clustering structure

It is an interesting question whether for a given dataset we can find one dimensional projections from which you can read the clustering structure (i.e. the histogram of the projection has two peaks and you can guess the clusters from the histogram). This is not possible.

Here is an example with three spheres at equal distance from each other. When projected in one dimension we are never going to see the clusters clearly.

The red and green line are the first and second dominant singular vectors.

However, SVD is still useful for finding clusters in the dataset (see my blog post on SVD and clustering [TO APPEAR]).

### Interesting stuff is in the corners of the projection

Imagine that the data lies on a unit sphere. This is the case when the cosine similarity is used. As it can be seen from the figure below points which are projected in the corners of the projection line are likely to be situated in a much smaller region than points which are projected in the middle. One can have two points projected in the middle which are diametrally opposite each other. In the figure, those are some of the yellow points.

The value of thie statement is that if you take some points projected in one of the corners, those points are likely highly similar. On the other hand, SVD will choose projections for which more data is projected in the corners. This means that the SVD extracts some dominant patterns from the data.

The last statement is used as a heuristic for initialilization of the K-Means clustering algorithm. In the three cluster example above (titled: no one dimensional projection can capture clustering structure), one can see that three of the four corners of the projections contain points from distinct clusters. Points from different corners could be useful for initial centroids in K-means.

## High-Dimensional Geometry

Some of the intuition about one dimensional projections carries over to high dimensions. In particular, a one dimensional projection (the histogram) is a view of the dataset also in high dimension. However, it is a much noiser view than a projection in 2D.

In high dimensions it is much more likely that most of the dataset ends up in the middle of the histogram. Why is that? Think of the case of documents on the web. What is a projection is this case? It is two sets of words each representing a topic (why two sets? because the line on which we project has two directions). As an example of two sets of words consider sports and politics. Now projecting a point means that we compare a document on the spectrum from sports and politics. Most documents will be about neither of those things. So those documents (datapoints) are projected close to 0. Projection is dot product and dot product is overlap. A random document will have no overlap with sports or politics. So, in high-dimensions for most possible projections, most of the dataset is projected in the middle. (There is a formal mathematical theorem about this phenomenon.)

The intuition from 2D about the corners of the projections carries over. This means that we should seek interesting structure in the corners. However, there are too many corners in high dimensions and too little data projected on those corners.

## SVD/PCA is about seeking interesting projections

The following is the common notation:

The SVD of $X$ is given by

We can write this equation as

This is possible because $V V^T = I$ ($V$ is an orthonormal matrix).

Let us take $V$ to have only one column (this is the truncated SVD with $k = 1$), i.e.$V$ is a single column vector of dimension $m$. This is a projection direction as we described above. Now $U$ is a single column vector $u$ as well, however of dimension $n$. In this case $S$ is a single number $s_1$. $u$ contains the coordinates of the projected points.

We know the following is the best rank-1 approximation to the matrix $X$:

Therefore, we can argue that the projection on the first component of the SVD is the projection that will in some sense “best preserve” the dataset in one dimension. Typically this first projection of the SVD will capture “global structure”.

One way heuristic way to think about the first component is as follows. Think of $v_i$ as the average of the $i$-th column; think of $u_j$ as the average of the $j$-th row. This procedure is (sometimes) a heuristic approximation to the SVD with one dominant component when the dataset has global structure. You can see that this is confirmed in the MNIST dataset example below.

## PCA projections

One way to define “interesting” is away from normal. One way to be away from normal is to have high variance (heavier tails). The PCA takes exactly this route. It finds the projections which have the highest variance. One critical difference from the SVD is that PCA is SVD on the data after subtracting the means.

## Example with MNIST (Image data)

Here we want to see how the projections that the SVD produces look like. The MNIST dataset consists of 42000 images. Each image is represented as a 784 dimensional vector. One can obtain this digit recognizer dataset from Kaggle.

I use the following code to read the data and compute the matrix V

train_data <- read.csv('../data/train.flann', header = FALSE, sep = " ")
X <- sqrt(data.matrix(train_data))

num_examples <- dim(X)[1] #n
num_features <- dim(X)[2] #m

#The sample covariance matrix
XtX <- (t(X) %*% X)/(num_examples - 1)
s <- svd(XtX)


The projections are computed with the following code:

k = 100 #number of singular vectors wanted
V = s$v[, 1:k] lam <- 50 d <- s$d[1:k]
weights <- sqrt(d/(d + lam))

### Finding similar words after application of the SVD

In order to verify that the SVD has been computed correctly I run most similar words queries for a few words. This is mostly for a sanity check. It may not work so well for all words, but should work well for words which are strong topic indicators. The results are below.

neighbors of word windows
0)windows	0.500709023249
1)ms	0.0691616667061
2)microsoft	0.0557547217515
3)nt	0.0533303682273
4)dos	0.0464641006003
5)drivers	0.0365074723089
6)comp	0.0356562676008
7)running	0.0313832542514
8)ini	0.0311252761994
9)application	0.0299640996529

-----------------------------
neighbors of word jesus
0)jesus	0.168165123748
1)christ	0.0782742657812
2)christians	0.0565218749071
3)bible	0.0487185810181
4)christian	0.0407415891892
5)john	0.038291243194
6)sin	0.0368926182977
7)god	0.0343049378987
8)heaven	0.0293666059382
9)jews	0.0282645950168

-----------------------------
neighbors of word congress
0)clinton	0.0236133112056
1)president	0.0161455085553
2)going	0.0160453655028
3)congress	0.0108290114911
4)government	0.0107107140185
5)tax	0.00973937765693
6)bush	0.00902895572165
7)house	0.00886195906841
8)states	0.00885790105693
9)law	0.00824907856166