several edits to highdim and ml

rafalab · Dec 18, 2023 · 0fcaacd · 0fcaacd
1 parent 9806914
commit 0fcaacd
Show file tree

Hide file tree

Showing 14 changed files with 827 additions and 1,351 deletions.
diff --git a/highdim/dimension-reduction.qmd b/highdim/dimension-reduction.qmd
diff --git a/highdim/intro-highdim.qmd b/highdim/intro-highdim.qmd
@@ -1,8 +1,8 @@
 # High dimensional data {.unnumbered}
 
-There is a variety of computational techniques and statistical concepts that are useful for analysis of datasets for which each observation is associated with a large number of numerical variables. In this chapter we provide a basic introduction to these techniques and concepts by describing matrix operations in R, dimension reduction, regularization, and matrix factorization. Handwritten digits data and movie recommendation systems  serve as motivating examples.
+There is a variety of computational techniques and statistical concepts that are useful for analysis of datasets for which each observation is associated with a large number of numerical variables. In this chapter, we provide a basic introduction to these techniques and concepts by describing matrix operations in R, dimension reduction, regularization, and matrix factorization. Handwritten digits data and movie recommendation systems  serve as motivating examples.
 
-A task that serves as motivation for this part of the book is quantifying the similarity between any two observations. For example, we might want to know how much two handwritten digits look like each other. However, note that each observations is associated with $28 \times 28 = 784$ pixels so we can't simply use subtraction as we would do if our data was one dimensional.
+A task that serves as motivation for this part of the book is quantifying the similarity between any two observations. For example, we might want to know how much two handwritten digits look like each other. However, note that each observation is associated with $28 \times 28 = 784$ pixels so we can't simply use subtraction as we would if our data was one dimensional.
 Instead, we will define observations as *points* in a *high-dimensional* space and mathematically define a *distance*. Many machine learning techniques, discussed in the next part of the book, require this calculation.
 
-Additionally, this part of the book discusses dimension reduction. Here we search of data summaries that result in more manageable lower dimension versions of the data, but preserve most or all the *information* we need. Here too we can use distance between observations as specific challenge: we will reduce the dimensions summarize the data into lower dimensions, but in a way that preserves the distance between any two observations. We use *linear algebra* as a mathematical foundation for all the techniques presented here.
+Additionally, this part of the book discusses dimension reduction. FIX Here we search of data summaries that result in more manageable lower dimension versions of the data, but preserve most or all the *information* we need. FIX Here too we can use distance between observations as specific challenge: we will reduce the dimensions summarize the data into lower dimensions, but in a way that preserves the distance between any two observations. We use *linear algebra* as a mathematical foundation for all the techniques presented here.
diff --git a/highdim/linear-algebra.qmd b/highdim/linear-algebra.qmd
@@ -12,9 +12,9 @@ Linear algebra is the main mathematical technique used to describe and motivate
 
 ## Matrix multiplication
 
-A commonly used operation in data analysis is matrix multiplication. Here we define and motivate the operation.
+A commonly used operation in data analysis is matrix multiplication. Here, we define and motivate the operation.
 
-Linear algebra was born from mathematicians developing systematic ways to solve systems of linear equations, for example
+Linear algebra originated from mathematicians developing systematic ways to solve systems of linear equations. For example:
 
 $$
 \begin{align}
@@ -26,7 +26,7 @@ $$
 
 Mathematicians figured out that by representing these linear systems of equations using matrices and vectors, predefined algorithms could be designed to solve any system of linear equations. A basic linear algebra class will teach some of these algorithms, such as Gaussian elimination, the Gauss-Jordan elimination, and the LU and QR decompositions. These methods are usually covered in detail in university level linear algebra courses.
 
-To explain matrix multiplication, define two matrices $\mathbf{A}$ and $\mathbf{B}$
+To explain matrix multiplication, define two matrices: $\mathbf{A}$ and $\mathbf{B}$
 
 $$
 \mathbf{A} = 
@@ -44,7 +44,7 @@ b_{n1}&b_{n2}&\dots&b_{np}
 \end{pmatrix}
 $$
 
-and define the product of matrices $\mathbf{A}$ and $\mathbf{B}$ as the matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$ that has entries $c_{ij}$ equal to the sum of the component-wise product of the $i$th row of $\mathbf{A}$ with the $j$th column of $\mathbf{B}$. Using R code we can define $\mathbf{C}= \mathbf{A}\mathbf{B}$ as follows:
+and define the product of matrices $\mathbf{A}$ and $\mathbf{B}$ as the matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$ that has entries $c_{ij}$ equal to the sum of the component-wise product of the $i$th row of $\mathbf{A}$ with the $j$th column of $\mathbf{B}$. Using R code, we can define $\mathbf{C}= \mathbf{A}\mathbf{B}$ as follows:
 
 ```{r, eval=FALSE}
 m <- nrow(A)
@@ -123,7 +123,7 @@ x_n
 \end{pmatrix}
 $$
 
-and rewriting the equation simply as
+and rewriting the equation simply as:
 
 $$
 \mathbf{A}\mathbf{x} =  \mathbf{b}
@@ -144,16 +144,17 @@ solve(A, b)
 ```
 
 :::{.callout-note}
-The function `solve` works well when dealing with small to medium-sized  matrices with a similar range for each column and not too many 0s. The function `qr.solve` can be used when this is not the case.
+The function `solve` works well when dealing with small to medium-sized matrices with a similar range for each column and not too many 0s. The function `qr.solve` can be used when this is not the case.
 :::
 
 ## The identity matrix
 
-The identity matrix, represented with a bold $\mathbf{I}$, is like the number 1, but for matrices: if you multiply a matrix by the identity matrix, you get back the matrix.
+FIX The identity matrix, represented with a bold $\mathbf{I}$, is like the number 1, but for matrices: if you multiply a matrix by the identity matrix, you get back the matrix.
 
+FIX:
 $$
 \mathbf{I}\mathbf{x} = \mathbf{x}
-$$ If you do some math with the definition of matrix multiplication you will realize that $\mathbf{1}$ is a matrix with the same number of rows and columns (refereed to as square matrix) with 0s everywhere except the diagonal:
+$$ If you do some math with the definition of matrix multiplication, you will realize that $\mathbf{1}$ is a matrix with the same number of rows and columns (referred to as square matrix) with 0s everywhere except the diagonal:
 
 $$
 \mathbf{I}=\begin{pmatrix}
@@ -162,7 +163,7 @@ $$
 \vdots&\vdots&\ddots&\vdots\\
 0&0&\dots&1
 \end{pmatrix}
-$$ It also implies that due to the definition of an inverse matrix we have
+$$ It also implies that, due to the definition of an inverse matrix, we have:
 
 $$
 \mathbf{A}^{-1}\mathbf{A} = \mathbf{1}
@@ -178,7 +179,7 @@ solve(A) %*% b
 
 Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. For example, most machine learning techniques rely on being able to define distances between observations, using features or predictors. Clustering algorithms, for example, search of observations that are *similar*. But what does this mean mathematically?
 
-To define distance, we introduce another linear algebra concept: the *norm*. Recall that a point in two dimensions can represented in polar coordinates as:
+To define distance, we introduce another linear algebra concept: the *norm*. Recall that a point in two dimensions can be represented in polar coordinates as:
 
 ```{r, echo=FALSE, fig.asp=0.7}
 draw.circle <- function(angle, start = 0, center = c(0,0), r = 0.25){
@@ -200,7 +201,7 @@ text(cos(theta), sin(theta), expression('(' * x[1] * ',' * x[2] *  ') = (' * pha
 draw.circle(theta)
 ```
 
-with $\theta = \arctan{\frac{x2}{x1}}$ and $r = \sqrt{x_1^2 + x_2^2}$. If we think of the point as two dimensional column vector $\mathbf{x} = (x_1, x_2)^\top$, $r$ defines the norm of $\mathbf{x}$. The norm can be thought of as the *size* of the two-dimensional vector disregarding the direction: if we change the angle, the vector changes but the size does not. The point of defining the norm is that we can extrapolated the concept of *size* to higher dimensions. Specifically, we write the norm for any vector $\mathbf{x}$ as:
+with $\theta = \arctan{\frac{x2}{x1}}$ and $r = \sqrt{x_1^2 + x_2^2}$. If we think of the point as two dimensional column vector $\mathbf{x} = (x_1, x_2)^\top$, $r$ defines the norm of $\mathbf{x}$. The norm can be thought of as the *size* of the two-dimensional vector disregarding the direction: if we change the angle, the vector changes but the size does not. The point of defining the norm is that we can extrapolate the concept of *size* to higher dimensions. Specifically, we write the norm for any vector $\mathbf{x}$ as:
 
 $$
 ||\mathbf{x}|| = \sqrt{x_1^2 + x_2^2 + \dots + x_p^2}
@@ -212,7 +213,7 @@ $$
 ||\mathbf{x}||^2 = \mathbf{x}^\top\mathbf{x}
 $$
 
-To define distance, suppose we have two two-dimensional points $\mathbf{x}_1$ and $\mathbf{x}_2$. We can define how similar they are by simply using euclidean distance.
+To define distance, suppose we have two two-dimensional points: $\mathbf{x}_1$ and $\mathbf{x}_2$. We can define how similar they are by simply using euclidean distance:
 
 ```{r, echo=FALSE, fig.asp=0.7}
 rafalib::mypar()
@@ -262,15 +263,15 @@ We can compute the distances between each pair using the definitions we just lea
 c(sum((x_1 - x_2)^2), sum((x_1 - x_3)^2), sum((x_2 - x_3)^2)) |> sqrt()
 ```
 
-In R, the function `crossprod(x)` is convenient for computing norms it multiplies `t(x)` by `x`
+In R, the function `crossprod(x)` is convenient for computing norms. It multiplies `t(x)` by `x`:
 
 ```{r}
 c(crossprod(x_1 - x_2), crossprod(x_1 - x_3), crossprod(x_2 - x_3)) |> sqrt()
 ```
 
-Note `crossprod` takes a matrix as the first argument and therefore the vectors used here are being coerced into single column matrices. Also note that `crossprod(x,y)` multiples `t(x)` by `y`.
+Note `crossprod` takes a matrix as the first argument. As a result, the vectors used here are being coerced into single column matrices. Also, note that `crossprod(x,y)` multiples `t(x)` by `y`.
 
-We can see that the distance is smaller between the first two. This is in agreement with the fact that the first two are 2s and the third is a 7.
+We can see that the distance is smaller between the first two. This agrees with the fact that the first two are 2s and the third is a 7.
 
 ```{r}
 y[c(6, 17, 16)]
@@ -289,7 +290,7 @@ There are several machine learning related functions in R that take objects of c
 d
 ```
 
-We can quickly see an image of the distances between observations using this function. As an example, we compute the distance between each of the first 300 observations and then make an image:
+FIX [alternate for sent that follows: The ?? function allows us to quickly see an image of distances between observations.] We can quickly see an image of the distances between observations using this function. As an example, we compute the distance between each of the first 300 observations and then make an image:
 
 ```{r distance-image, fig.width = 4, fig.height = 4, eval=FALSE}
 d <- dist(x[1:300,])
@@ -318,7 +319,7 @@ image(as.matrix(d)[order(y[1:300]), order(y[1:300])])
 
 We can think of all predictors $(x_{i,1}, \dots, x_{i,p})^\top$ for all observations $i=1,\dots,n$ as $n$ $p$-dimensional points. A *space* can be thought of as the collection of all possible points that should be considered for the data analysis in question. This includes points we could see, but have not been observed yet. In the case of the handwritten digits, we can think of the predictor space as any point $(x_{1}, \dots, x_{p})^\top$ as long as each entry $x_i, \, i = 1, \dots, p$ is between 0 and 255.
 
-Some Machine Learning algorithms also define subspaces. A common approach is to define neighborhoods of points that are close to a *center*. We can do this by selecting a center $\mathbf{x}_0$, a minimum distance $r$, and defining the subspace as the collection of points $\mathbf{x}$ that satisfy
+FIX Some Machine Learning algorithms also define subspaces. A common approach is to define neighborhoods of points that are close to a *center*. We can do this by selecting a center $\mathbf{x}_0$, a minimum distance $r$, and defining the subspace as the collection of points $\mathbf{x}$ that satisfy:
 
 $$
 || \mathbf{x} - \mathbf{x}_0 || \leq r.
@@ -330,7 +331,7 @@ Other machine learning algorithms partition the predictor space into non-overlap
 
 ## Exercises
 
-1\. Generate two matrix, `A` and `B`, containing randomly generated and normally distributed numbers. The dimensions of these two matrices should $4 \times 3$ and $3 \times 6$, respectively. Confirm that `C <- A %*% B` produce the same results as:
+1\. Generate two matrix, `A` and `B`, containing randomly generated and normally distributed numbers. The dimensions of these two matrices should be $4 \times 3$ and $3 \times 6$, respectively. Confirm that `C <- A %*% B` produces the same results as:
 
 ```{r, eval=FALSE}
 m <- nrow(A)
@@ -354,7 +355,7 @@ x + y + z + w &= 10\\
 \end{align}
 $$
 
-3\. Define `x`
+3\. Define `x`:
 
 ```{r}
 #| eval: false
@@ -364,7 +365,7 @@ x <- mnist$train$images[1:300,]
 y <- mnist$train$labels[1:300]
 ```
 
-and compute the distance matrix
+and compute the distance matrix:
 
 ```{r}
 #| eval: false
@@ -373,8 +374,8 @@ d <- dist(x)
 class(d)
 ```
 
-Generate a boxplot showing the distances for the second row of `d` stratified by digits. Do not include the distance to itself which we know it is 0. Can you predict what digit is represented by the second row of `x`?
+Generate a boxplot showing the distances for the second row of `d` stratified by digits. Do not include the distance to itself, which we know is 0. Can you predict what digit is represented by the second row of `x`?
 
-4\. Use the `apply` function and matrix algebra to compute the distance between the second digit `mnist$train$images[4,]` and all other digits represented in `mnist$train$images`. Then generate as boxplot as in exercise 2 and predict what digit is the fourth row.
+4\. Use the `apply` function and matrix algebra to compute the distance between the second digit `mnist$train$images[4,]` and all other digits represented in `mnist$train$images`. Then generate a boxplot as in exercise 2 and predict what digit is the fourth row.
 
 5\. Compute the distance between each feature and the feature representing the middle pixel (row 14 column 14). Create an image plot of where the distance is shown with color in the pixel position.