Skip to content

Commit

Permalink
several edits to highdim and ml
Browse files Browse the repository at this point in the history
  • Loading branch information
rafalab committed Dec 18, 2023
1 parent 9806914 commit 0fcaacd
Show file tree
Hide file tree
Showing 14 changed files with 827 additions and 1,351 deletions.
125 changes: 63 additions & 62 deletions highdim/dimension-reduction.qmd

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions highdim/intro-highdim.qmd
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# High dimensional data {.unnumbered}

There is a variety of computational techniques and statistical concepts that are useful for analysis of datasets for which each observation is associated with a large number of numerical variables. In this chapter we provide a basic introduction to these techniques and concepts by describing matrix operations in R, dimension reduction, regularization, and matrix factorization. Handwritten digits data and movie recommendation systems serve as motivating examples.
There is a variety of computational techniques and statistical concepts that are useful for analysis of datasets for which each observation is associated with a large number of numerical variables. In this chapter, we provide a basic introduction to these techniques and concepts by describing matrix operations in R, dimension reduction, regularization, and matrix factorization. Handwritten digits data and movie recommendation systems serve as motivating examples.

A task that serves as motivation for this part of the book is quantifying the similarity between any two observations. For example, we might want to know how much two handwritten digits look like each other. However, note that each observations is associated with $28 \times 28 = 784$ pixels so we can't simply use subtraction as we would do if our data was one dimensional.
A task that serves as motivation for this part of the book is quantifying the similarity between any two observations. For example, we might want to know how much two handwritten digits look like each other. However, note that each observation is associated with $28 \times 28 = 784$ pixels so we can't simply use subtraction as we would if our data was one dimensional.
Instead, we will define observations as *points* in a *high-dimensional* space and mathematically define a *distance*. Many machine learning techniques, discussed in the next part of the book, require this calculation.

Additionally, this part of the book discusses dimension reduction. Here we search of data summaries that result in more manageable lower dimension versions of the data, but preserve most or all the *information* we need. Here too we can use distance between observations as specific challenge: we will reduce the dimensions summarize the data into lower dimensions, but in a way that preserves the distance between any two observations. We use *linear algebra* as a mathematical foundation for all the techniques presented here.
Additionally, this part of the book discusses dimension reduction. FIX Here we search of data summaries that result in more manageable lower dimension versions of the data, but preserve most or all the *information* we need. FIX Here too we can use distance between observations as specific challenge: we will reduce the dimensions summarize the data into lower dimensions, but in a way that preserves the distance between any two observations. We use *linear algebra* as a mathematical foundation for all the techniques presented here.
45 changes: 23 additions & 22 deletions highdim/linear-algebra.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ Linear algebra is the main mathematical technique used to describe and motivate

## Matrix multiplication

A commonly used operation in data analysis is matrix multiplication. Here we define and motivate the operation.
A commonly used operation in data analysis is matrix multiplication. Here, we define and motivate the operation.

Linear algebra was born from mathematicians developing systematic ways to solve systems of linear equations, for example
Linear algebra originated from mathematicians developing systematic ways to solve systems of linear equations. For example:

$$
\begin{align}
Expand All @@ -26,7 +26,7 @@ $$

Mathematicians figured out that by representing these linear systems of equations using matrices and vectors, predefined algorithms could be designed to solve any system of linear equations. A basic linear algebra class will teach some of these algorithms, such as Gaussian elimination, the Gauss-Jordan elimination, and the LU and QR decompositions. These methods are usually covered in detail in university level linear algebra courses.

To explain matrix multiplication, define two matrices $\mathbf{A}$ and $\mathbf{B}$
To explain matrix multiplication, define two matrices: $\mathbf{A}$ and $\mathbf{B}$

$$
\mathbf{A} =
Expand All @@ -44,7 +44,7 @@ b_{n1}&b_{n2}&\dots&b_{np}
\end{pmatrix}
$$

and define the product of matrices $\mathbf{A}$ and $\mathbf{B}$ as the matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$ that has entries $c_{ij}$ equal to the sum of the component-wise product of the $i$th row of $\mathbf{A}$ with the $j$th column of $\mathbf{B}$. Using R code we can define $\mathbf{C}= \mathbf{A}\mathbf{B}$ as follows:
and define the product of matrices $\mathbf{A}$ and $\mathbf{B}$ as the matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$ that has entries $c_{ij}$ equal to the sum of the component-wise product of the $i$th row of $\mathbf{A}$ with the $j$th column of $\mathbf{B}$. Using R code, we can define $\mathbf{C}= \mathbf{A}\mathbf{B}$ as follows:

```{r, eval=FALSE}
m <- nrow(A)
Expand Down Expand Up @@ -123,7 +123,7 @@ x_n
\end{pmatrix}
$$

and rewriting the equation simply as
and rewriting the equation simply as:

$$
\mathbf{A}\mathbf{x} = \mathbf{b}
Expand All @@ -144,16 +144,17 @@ solve(A, b)
```

:::{.callout-note}
The function `solve` works well when dealing with small to medium-sized matrices with a similar range for each column and not too many 0s. The function `qr.solve` can be used when this is not the case.
The function `solve` works well when dealing with small to medium-sized matrices with a similar range for each column and not too many 0s. The function `qr.solve` can be used when this is not the case.
:::

## The identity matrix

The identity matrix, represented with a bold $\mathbf{I}$, is like the number 1, but for matrices: if you multiply a matrix by the identity matrix, you get back the matrix.
FIX The identity matrix, represented with a bold $\mathbf{I}$, is like the number 1, but for matrices: if you multiply a matrix by the identity matrix, you get back the matrix.

FIX:
$$
\mathbf{I}\mathbf{x} = \mathbf{x}
$$ If you do some math with the definition of matrix multiplication you will realize that $\mathbf{1}$ is a matrix with the same number of rows and columns (refereed to as square matrix) with 0s everywhere except the diagonal:
$$ If you do some math with the definition of matrix multiplication, you will realize that $\mathbf{1}$ is a matrix with the same number of rows and columns (referred to as square matrix) with 0s everywhere except the diagonal:
$$
\mathbf{I}=\begin{pmatrix}
Expand All @@ -162,7 +163,7 @@ $$
\vdots&\vdots&\ddots&\vdots\\
0&0&\dots&1
\end{pmatrix}
$$ It also implies that due to the definition of an inverse matrix we have
$$ It also implies that, due to the definition of an inverse matrix, we have:
$$
\mathbf{A}^{-1}\mathbf{A} = \mathbf{1}
Expand All @@ -178,7 +179,7 @@ solve(A) %*% b
Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. For example, most machine learning techniques rely on being able to define distances between observations, using features or predictors. Clustering algorithms, for example, search of observations that are *similar*. But what does this mean mathematically?
To define distance, we introduce another linear algebra concept: the *norm*. Recall that a point in two dimensions can represented in polar coordinates as:
To define distance, we introduce another linear algebra concept: the *norm*. Recall that a point in two dimensions can be represented in polar coordinates as:
```{r, echo=FALSE, fig.asp=0.7}
draw.circle <- function(angle, start = 0, center = c(0,0), r = 0.25){
Expand All @@ -200,7 +201,7 @@ text(cos(theta), sin(theta), expression('(' * x[1] * ',' * x[2] * ') = (' * pha
draw.circle(theta)
```
with $\theta = \arctan{\frac{x2}{x1}}$ and $r = \sqrt{x_1^2 + x_2^2}$. If we think of the point as two dimensional column vector $\mathbf{x} = (x_1, x_2)^\top$, $r$ defines the norm of $\mathbf{x}$. The norm can be thought of as the *size* of the two-dimensional vector disregarding the direction: if we change the angle, the vector changes but the size does not. The point of defining the norm is that we can extrapolated the concept of *size* to higher dimensions. Specifically, we write the norm for any vector $\mathbf{x}$ as:
with $\theta = \arctan{\frac{x2}{x1}}$ and $r = \sqrt{x_1^2 + x_2^2}$. If we think of the point as two dimensional column vector $\mathbf{x} = (x_1, x_2)^\top$, $r$ defines the norm of $\mathbf{x}$. The norm can be thought of as the *size* of the two-dimensional vector disregarding the direction: if we change the angle, the vector changes but the size does not. The point of defining the norm is that we can extrapolate the concept of *size* to higher dimensions. Specifically, we write the norm for any vector $\mathbf{x}$ as:
$$
||\mathbf{x}|| = \sqrt{x_1^2 + x_2^2 + \dots + x_p^2}
Expand All @@ -212,7 +213,7 @@ $$
||\mathbf{x}||^2 = \mathbf{x}^\top\mathbf{x}
$$
To define distance, suppose we have two two-dimensional points $\mathbf{x}_1$ and $\mathbf{x}_2$. We can define how similar they are by simply using euclidean distance.
To define distance, suppose we have two two-dimensional points: $\mathbf{x}_1$ and $\mathbf{x}_2$. We can define how similar they are by simply using euclidean distance:
```{r, echo=FALSE, fig.asp=0.7}
rafalib::mypar()
Expand Down Expand Up @@ -262,15 +263,15 @@ We can compute the distances between each pair using the definitions we just lea
c(sum((x_1 - x_2)^2), sum((x_1 - x_3)^2), sum((x_2 - x_3)^2)) |> sqrt()
```
In R, the function `crossprod(x)` is convenient for computing norms it multiplies `t(x)` by `x`
In R, the function `crossprod(x)` is convenient for computing norms. It multiplies `t(x)` by `x`:
```{r}
c(crossprod(x_1 - x_2), crossprod(x_1 - x_3), crossprod(x_2 - x_3)) |> sqrt()
```
Note `crossprod` takes a matrix as the first argument and therefore the vectors used here are being coerced into single column matrices. Also note that `crossprod(x,y)` multiples `t(x)` by `y`.
Note `crossprod` takes a matrix as the first argument. As a result, the vectors used here are being coerced into single column matrices. Also, note that `crossprod(x,y)` multiples `t(x)` by `y`.
We can see that the distance is smaller between the first two. This is in agreement with the fact that the first two are 2s and the third is a 7.
We can see that the distance is smaller between the first two. This agrees with the fact that the first two are 2s and the third is a 7.
```{r}
y[c(6, 17, 16)]
Expand All @@ -289,7 +290,7 @@ There are several machine learning related functions in R that take objects of c
d
```
We can quickly see an image of the distances between observations using this function. As an example, we compute the distance between each of the first 300 observations and then make an image:
FIX [alternate for sent that follows: The ?? function allows us to quickly see an image of distances between observations.] We can quickly see an image of the distances between observations using this function. As an example, we compute the distance between each of the first 300 observations and then make an image:
```{r distance-image, fig.width = 4, fig.height = 4, eval=FALSE}
d <- dist(x[1:300,])
Expand Down Expand Up @@ -318,7 +319,7 @@ image(as.matrix(d)[order(y[1:300]), order(y[1:300])])
We can think of all predictors $(x_{i,1}, \dots, x_{i,p})^\top$ for all observations $i=1,\dots,n$ as $n$ $p$-dimensional points. A *space* can be thought of as the collection of all possible points that should be considered for the data analysis in question. This includes points we could see, but have not been observed yet. In the case of the handwritten digits, we can think of the predictor space as any point $(x_{1}, \dots, x_{p})^\top$ as long as each entry $x_i, \, i = 1, \dots, p$ is between 0 and 255.
Some Machine Learning algorithms also define subspaces. A common approach is to define neighborhoods of points that are close to a *center*. We can do this by selecting a center $\mathbf{x}_0$, a minimum distance $r$, and defining the subspace as the collection of points $\mathbf{x}$ that satisfy
FIX Some Machine Learning algorithms also define subspaces. A common approach is to define neighborhoods of points that are close to a *center*. We can do this by selecting a center $\mathbf{x}_0$, a minimum distance $r$, and defining the subspace as the collection of points $\mathbf{x}$ that satisfy:
$$
|| \mathbf{x} - \mathbf{x}_0 || \leq r.
Expand All @@ -330,7 +331,7 @@ Other machine learning algorithms partition the predictor space into non-overlap
## Exercises
1\. Generate two matrix, `A` and `B`, containing randomly generated and normally distributed numbers. The dimensions of these two matrices should $4 \times 3$ and $3 \times 6$, respectively. Confirm that `C <- A %*% B` produce the same results as:
1\. Generate two matrix, `A` and `B`, containing randomly generated and normally distributed numbers. The dimensions of these two matrices should be $4 \times 3$ and $3 \times 6$, respectively. Confirm that `C <- A %*% B` produces the same results as:
```{r, eval=FALSE}
m <- nrow(A)
Expand All @@ -354,7 +355,7 @@ x + y + z + w &= 10\\
\end{align}
$$
3\. Define `x`
3\. Define `x`:
```{r}
#| eval: false
Expand All @@ -364,7 +365,7 @@ x <- mnist$train$images[1:300,]
y <- mnist$train$labels[1:300]
```
and compute the distance matrix
and compute the distance matrix:
```{r}
#| eval: false
Expand All @@ -373,8 +374,8 @@ d <- dist(x)
class(d)
```
Generate a boxplot showing the distances for the second row of `d` stratified by digits. Do not include the distance to itself which we know it is 0. Can you predict what digit is represented by the second row of `x`?
Generate a boxplot showing the distances for the second row of `d` stratified by digits. Do not include the distance to itself, which we know is 0. Can you predict what digit is represented by the second row of `x`?
4\. Use the `apply` function and matrix algebra to compute the distance between the second digit `mnist$train$images[4,]` and all other digits represented in `mnist$train$images`. Then generate as boxplot as in exercise 2 and predict what digit is the fourth row.
4\. Use the `apply` function and matrix algebra to compute the distance between the second digit `mnist$train$images[4,]` and all other digits represented in `mnist$train$images`. Then generate a boxplot as in exercise 2 and predict what digit is the fourth row.
5\. Compute the distance between each feature and the feature representing the middle pixel (row 14 column 14). Create an image plot of where the distance is shown with color in the pixel position.
Loading

0 comments on commit 0fcaacd

Please sign in to comment.