typos

rafalab · Dec 20, 2023 · e91342f · e91342f
1 parent ace17d6
commit e91342f
Show file tree

Hide file tree

Showing 6 changed files with 83 additions and 89 deletions.
diff --git a/ml/algorithms.qmd b/ml/algorithms.qmd
@@ -1,18 +1,18 @@
 # Examples of algorithms {#sec-example-alogirhms}
 
-There are hundreds of machine learning algorithms. Here we provide a few examples spanning rather different approaches. Throughout the chapter we will be using the two predictor digits data introduced in @sec-two-or-seven to demonstrate how the algorithms work. We focus on the concepts and ideas behind the algorithms using illustrative datasets from the **dslabs** package.
+There are hundreds of machine learning algorithms. Here we provide a few examples spanning rather different approaches. Throughout the chapter, we will be using the two predictor digits data introduced in @sec-two-or-seven to demonstrate how the algorithms work. We focus on the concepts and ideas behind the algorithms using illustrative datasets from the **dslabs** package.
 
 ```{r warning = FALSE, message = FALSE, cache=FALSE}
 library(tidyverse)
 library(caret)
 library(dslabs)
 ```
 
-Then in @sec-ml-in-practice we show an efficient way to implement these ideas using the **caret** package.
+Then, in @sec-ml-in-practice, we show an efficient way to implement these ideas using the **caret** package.
 
 ## Logistic regression
 
-In @sec-two-or-seven we used linear regression to predict classes by fitting the model 
+In @sec-two-or-seven, we used linear regression to predict classes by fitting the model: 
 
 $$
 p(\mathbf{x}) = \mbox{Pr}(Y=1 \mid X_1=x_1 , X_2 = x_2) = 
@@ -25,7 +25,7 @@ fit_lm <- lm(y ~ x_1 + x_2, data = mutate(mnist_27$train,y = ifelse(y == 7, 1, 0
 range(fit_lm$fitted)
 ```
 
-To avoid this we can apply the approach described in @sec-glm that is more appropriate for binary data. We write the model like this:
+To avoid this, we can apply the approach described in @sec-glm that is more appropriate for binary data. We write the model like this:
 
 
 $$ 
@@ -50,15 +50,15 @@ mnist_27$true_p |> mutate(p_hat = p_hat) |>
   stat_contour(breaks = c(0.5), color = "black") 
 ```
 
-Just like regression, the decision rule is a line, a fact that can be corroborated mathematically, definint $g(x) = \log \{x/(1-x)\}$, we have:
+Just like regression, the decision rule is a line, a fact that can be corroborated mathematically, defining $g(x) = \log \{x/(1-x)\}$, we have:
 
 $$
 g^{-1}(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2) = 0.5 \implies
 \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = g(0.5) = 0 \implies
 x_2 = -\hat{\beta}_0/\hat{\beta}_2 -\hat{\beta}_1/\hat{\beta}_2 x_1
 $$
 
-Thus, just like with regression, $x_2$ is a linear function of $x_1$. This implies that our logistic regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. We now described some techniques that estimate the conditional probability in a way that is more flexible. 
+Thus, just like with regression, $x_2$ is a linear function of $x_1$. This implies that our logistic regression approach has no chance of capturing the non-linear nature of the true $p(\mathbf{x})$. FIX We now described some techniques that estimate the conditional probability in a way that is more flexible. 
 
 :::{.callout-note}
 You are ready to do exercises 1 - 11.
@@ -91,7 +91,7 @@ plot_cond_prob <- function(p_hat = NULL){
 train_knn <- knn3(y ~ ., k = 31, data = mnist_27$train)
 ```
 
-We introduced the kNN algorithm in @sec-knn-cv-intro. In @sec-mse-estimates we noted that $k=31$ provided the highest accuracy in the test set. Using $k=31$ we obtain an accuracy `r confusionMatrix(predict(train_knn, mnist_27$test, type = "class"),mnist_27$test$y)$overall["Accuracy"]`, an improvement over regression. A plot of the estimated conditional probability shows that the kNN estimate is flexible enough and does indeed capture the shape of the true conditional probability.
+We introduced the kNN algorithm in @sec-knn-cv-intro. In @sec-mse-estimates, we noted that $k=31$ provided the highest accuracy in the test set. Using $k=31$, we obtain an accuracy `r confusionMatrix(predict(train_knn, mnist_27$test, type = "class"),mnist_27$test$y)$overall["Accuracy"]`, an improvement over regression. A plot of the estimated conditional probability shows that the kNN estimate is flexible enough and does indeed capture the shape of the true conditional probability.
 
 ```{r best-knn-fit, echo = FALSE, out.width = "100%"}
 p1 <- plot_cond_prob() + ggtitle("True conditional probability")
@@ -210,7 +210,7 @@ Again, this is because the algorithm gives more weight to specificity to account
 specificity(data = factor(y_hat_bayes), reference = factor(test_set$sex))
 ```
 
-This is due mainly to the fact that $\hat{\pi}$ is substantially less than 0.5, so we tend to predict `Male` more often. It makes sense for a machine learning algorithm to do this in our sample because we do have a higher percentage of males. But if we were to extrapolate this to a general population, our overall accuracy would be affected by the low sensitivity. 
+This is mainly due to the fact that $\hat{\pi}$ is substantially less than 0.5, so we tend to predict `Male` more often. It makes sense for a machine learning algorithm to do this in our sample because we do have a higher percentage of males. But if we were to extrapolate this to a general population, our overall accuracy would be affected by the low sensitivity. 
 
 The Naive Bayes approach gives us a direct way to correct this since we can simply force $\hat{\pi}$ to be whatever value we want it to be. So to balance specificity and sensitivity, instead of changing the cutoff in the decision rule, we could simply change $\hat{\pi}$ to 0.5 like this:
 

diff --git a/ml/conditionals.qmd b/ml/conditionals.qmd
@@ -1,10 +1,10 @@
 # Conditional probabilities and expectations
 
-In machine learning applications, we rarely can predict outcomes perfectly. For example, spam detectors often miss emails that are clearly spam, Siri often misunderstands the words we are saying, and your bank at times thinks your card was stolen when it was not. The most common reason for not being able to build perfect algorithms is that it is impossible. To see this, note that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes. 
+In machine learning applications, we rarely can predict outcomes perfectly. For example, spam detectors often miss emails that are clearly spam, Siri often misunderstands the words we are saying, and sometimes your bank thinks your card was stolen when it was not. The most common reason for not being able to build perfect algorithms is that it is impossible. To see this, consider that most datasets will include groups of observations with the same exact observed values for all predictors, but with different outcomes. 
 
 Because our prediction rules are functions, equal inputs (the predictors) implies equal outputs (the predictions). Therefore, for a challenge in which the same predictors are associated with different outcomes across different individual observations, it is impossible to predict correctly for all these cases. We saw a simple example of this in the previous section: for any given height $x$, you will have both males and females that are $x$ inches tall.
 
-However, none of this means that we can't build useful algorithms that are much better than guessing, and in some cases better than expert opinions. To achieve this in an optimal way, we make use of probabilistic representations of the problem based on the ideas presented in Section @sec-conditional-expectation. Observations with the same observed values for the predictors may not all be the same, but we can assume that they all have the same probability of this class or that class. We will write this idea out mathematically for the case of categorical data. 
+However, none of this means that we can't build useful algorithms that are much better than guessing, and in some cases better than expert opinions. To achieve this in an optimal way, we make use of probabilistic representations of the problem based on the ideas presented FIX in Section @sec-conditional-expectation. Observations with the same observed values for the predictors may not all be the same, but we can assume that they all have the same probability of this class or that class. We will write this idea out mathematically for the case of categorical data. 
 
 ## Conditional probabilities
 
@@ -31,12 +31,11 @@ $$\hat{Y} = \max_k p_k(\mathbf{x})$$
 
 In machine learning, we refer to this as _Bayes' Rule_. But this is a theoretical rule since, in practice, we don't know $p_k(\mathbf{x}), k=1,\dots,K$. In fact, estimating these conditional probabilities can be thought of as the main challenge of machine learning. The better our probability estimates $\hat{p}_k(\mathbf{x})$, the better our predictor $\hat{Y}$.
 
-So how well we predict depends on two things: 1) how close are the $\max_k p_k(\mathbf{x})$ to 1 or 0 (perfect certainty)
-and 2) how close our estimates $\hat{p}_k(\mathbf{x})$ are to $p_k(\mathbf{x})$. We can't do anything about the first restriction as it is determined by the nature of the problem, so our energy goes into finding ways to best estimate conditional probabilities. 
+So how well we predict depends on two things: 1) how close are the $\max_k p_k(\mathbf{x})$ to 1 or 0 (perfect certainty) and 2) how close our estimates $\hat{p}_k(\mathbf{x})$ are to $p_k(\mathbf{x})$. We can't do anything about the first restriction as it is determined by the nature of the problem, so our energy goes into finding ways to best estimate conditional probabilities. 
 
-The first restriction does imply that we have limits as to how well even the best possible algorithm can perform. You should get used to the idea that while in some challenges we will be able to achieve almost perfect accuracy, with digit readers for example, in others, our success is restricted by the randomness of the process, with movie recommendations for example. 
+The first restriction does imply that we have limits as to how well even the best possible algorithm can perform. You should get used to the idea that while in some challenges we will be able to achieve almost perfect accuracy, with digit readers for example, in others, our success is restricted by the randomness of the process, such as with movie recommendations. 
 
-It is important to remember that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context. As discussed in @sec-evaluation-metrics, sensitivity and specificity may differ in importance. But even in these cases, having a good estimate of the $p_k(x), k=1,\dots,K$ will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish. For instance, we can simply change the cutoffs used to predict one outcome or the other. In the plane example, we may ground the plane anytime the probability of malfunction is higher than 1 in a million as opposed to the default 1/2 used when error types are equally undesired. 
+Keep in mind that defining our prediction by maximizing the probability is not always optimal in practice and depends on the context. As discussed in @sec-evaluation-metrics, sensitivity and specificity may differ in importance. But even in these cases, having a good estimate of the $p_k(x), k=1,\dots,K$ will suffice for us to build optimal prediction models, since we can control the balance between specificity and sensitivity however we wish. For instance, we can simply change the cutoffs used to predict one outcome or the other. In the plane example, we may ground the plane anytime the probability of malfunction is higher than 1 in a million as opposed to the default 1/2 used when error types are equally undesired. 
 
 ## Conditional expectations