quickstart.Rmd

---
title: "Quickstart Guide to Models"
output:
  html_document:
    number_sections: yes
    toc: yes
    toc_depth: 2
    toc_float:
      collapsed: no
---

<style>
h1{font-weight: 400;}
</style>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning = FALSE)
set.seed(76)
```


# Unsupervised learning

```{r}
library(tidyverse)
# No outcome variable for unsupervised learning:
iris_predictors <- iris %>% 
  select(-Species)
```


## $k$-means clustering

```{r}
k <- 2

# Fit kmc
k_means_results <- kmeans(iris_predictors, centers=k)

# Assign each of 150 rows to one of k clusters
clusters <- k_means_results$cluster

# Get cluster centers and add cluster number column
cluster_centers <- k_means_results$centers %>% 
  as_tibble() %>% 
  mutate(cluster = 1:k) %>% 
  select(cluster, everything())
```


# Regularization

```{r}
library(tidyverse)
credit <- read_csv("http://www-bcf.usc.edu/~gareth/ISL/Credit.csv") %>% 
  select(-X1)
```

## LASSO

### Fit/train model

```{r}
# Load packages and wrapper function. Note understanding the internal workings
# of this wrapper function is not required. If you're curious why all this is
# needed though: https://github.com/tidyverse/broom/issues/226
library(glmnet)
get_LASSO_coefficients <- function(LASSO_fit){
  coeff_values <- LASSO_fit %>% 
    broom::tidy() %>% 
    as_tibble() %>% 
    select(-c(step, dev.ratio)) %>% 
    tidyr::complete(lambda, nesting(term), fill = list(estimate = 0)) %>% 
    arrange(desc(lambda)) %>% 
    select(term, estimate, lambda)
  return(coeff_values)
}

# 1. Define model formula
model_formula <- as.formula("Balance ~ Income + Limit + Rating + Student + Cards + Age + Education + Married")

# 2. Define "model matrix"
# Note that:
# -This function conveniently converts all categorical outcomes to numerical
# ones using "one-hot encoding" as defined in the flashcard for Lec 3.7.
# -We also remove the first column corresponding to the intercept
predictor_matrix <- model.matrix(model_formula, data = credit)[, -1]

# 3. Define values of tuning/complexity parameter lambda
# Note: we set them to increase at an exponential rate in powers of 10
lambda_inputs <- 10^seq(-2, 10, length = 100)

# 4. Fit the model using glmnet
# Note:
# -Setting alpha=1 corresponds to LASSO, while setting alpha=0 corresponds to
# ridge regression
# -Here we manually specify the lambda values to use. If we didn't set lambda =
# lambda_inputs, glmnet() would choose default values.
LASSO_fit <- glmnet(x=predictor_matrix, y=credit$Balance, alpha = 1, lambda = lambda_inputs)

# 5. Get beta-hat coefficients for ALL values of knob/tuning parameter lambda
LASSO_coefficients <- get_LASSO_coefficients(LASSO_fit)
```


### Coefficients analysis

For each value of the $\lambda$ tuning parameter contained in the `lambda_inputs`
vector we specified above, let's look at all the $\widehat{\beta}$ coefficients:

```{r}
ggplot(LASSO_coefficients, aes(x=lambda, y=estimate, col=term)) +
  geom_line() +
  labs(x="lambda", y="beta-hat coefficient estimate")
```

There are two problems with this plot:

1. We're not interested in the value of the intercept, since this is left out
of complexity penalization. So let's ignore this.
1. The values of $\lambda$ on the x-axis are difficult to see, so let's rescale
the x-axis to be on a $\log10$ scale:

```{r}
plot_LASSO_coefficients <- LASSO_coefficients %>% 
  filter(term != "(Intercept)") %>% 
  ggplot(aes(x=lambda, y=estimate, col=term)) +
  geom_line() +
  scale_x_log10() +
  labs(x="lambda (log10-scale)", y="beta-hat coefficient estimate",
       title="LASSO regularized coefficient for each lambda value")
plot_LASSO_coefficients
```


### Crossvalidation

How do we do crossvalidation to find the optimal value $\lambda^*$ of $\lambda$
that yields the $\widehat{\beta}$'s for the predictive model that in turn yields the
lowest MSE? Easy! Just add `cv.` to previous `glmnet()` call. Let's pull out the

* Value $\lambda^*$ corresponding to the minimal MSE (red line in plot below)
* Value of $\lambda^*_{1SE}$ corresponding to the simplest model within one
standard error of minimal MSE (blue line in plot below)

```{r}
LASSO_CV <- cv.glmnet(x=predictor_matrix, y=credit$Balance, alpha=1, lambda=lambda_inputs)

# Optimal lambdas
lambda_star <- LASSO_CV$lambda.min
lambda_star_1SE <- LASSO_CV$lambda.1se
```

Let's plot the result in base R. Note that here we are plotting $\log(\lambda)$
on the x-axis instead of $\lambda$ on a $\log10$-scale like we did in the earlier
plot "LASSO regularized coefficient for each lambda value".

```{r}
plot(LASSO_CV)
abline(v=log(lambda_star), col="red")
abline(v=log(lambda_star_1SE), col="blue")
```

What does this mean in terms of the values of the $\widehat{\beta}$'s? Let's revisit
the earlier plot "LASSO regularized coefficient for each lambda value" but plot the 
two $\lambda$ values with red and blue lines again.

```{r}
plot_LASSO_coefficients <- plot_LASSO_coefficients +
  geom_vline(xintercept = lambda_star, col="red", alpha=0.4, linetype="dashed") +
  geom_vline(xintercept = lambda_star_1SE, col="blue", alpha=0.4, linetype="dashed")
plot_LASSO_coefficients
```

Let's zoom-in for a closer look:

```{r}
plot_LASSO_coefficients +
  coord_cartesian(xlim=c(1, 1000), ylim=c(-1, 1))
```

Notice the order in which the $\widehat{\beta}$'s drop to 0, from first to last:

1. Education (olive green)
1. Whether or not they are married (smurf blue)
1. Age (red)
1. Number of cards (gold)
1. Income (green)
1. Whether or not they are a student (pink)
1. Credit limit (teal)
1. Rating (light purple)


### Predict outcomes for test data

Let's use the optimal $\lambda^*$ corresponding to the simplest model within one standard error of minimal MSE for our predictions by setting `s=lambda_star_1SE` (BTW `s` is a horrible name for a function argument):

```{r}
y_hat <- predict(LASSO_fit, newx=predictor_matrix, s=lambda_star_1SE) %>% 
  as.vector()
hist(y_hat)
```


```{r, echo=FALSE}
# compute_z_score <- function(x){
#   z <- (x-mean(x))/sd(x)
#   return(z)
# }
# 
# # Convert all numerical variables to z-scores:
# credit_rescaled <- credit %>%
#   mutate_if(is.numeric, compute_z_score)
# 
# # 1. Define model formula
# model_formula <- as.formula("Balance ~ Income + Limit + Rating + Student + Cards + Age + Education + Married")
# # 2. Define "model matrix"
# predictor_matrix <- model.matrix(model_formula, data = credit_rescaled)[, -1]
# # 3. Define values of tuning/complexity parameter lambda
# lambda_inputs <- 10^seq(-2, 10, length = 100)
# # 4. Fit the model using glmnet
# LASSO_fit <- glmnet(x=predictor_matrix, y=credit_rescaled$Balance, alpha=1, lambda=lambda_inputs)
# # 5. Get beta-hat coefficients for ALL values of knob/tuning parameter lambda
# LASSO_coefficients <- get_LASSO_coefficients(LASSO_fit)
# 
# # 1. Run crossvalidation
# LASSO_CV <- cv.glmnet(x=predictor_matrix, y=credit_rescaled$Balance, alpha=1, lambda=lambda_inputs)
# # 2. Get optimal lambdas
# lambda_star <- LASSO_CV$lambda.min
# lambda_star_1SE <- LASSO_CV$lambda.1se
# # 3. Plot optimal 
# plot(LASSO_CV)
# abline(v=log(lambda_star), col="red")
# abline(v=log(lambda_star_1SE), col="blue")
# 
# # A lot going on!
# LASSO_coefficients %>%
#   filter(term != "(Intercept)") %>%
#   ggplot(aes(x=lambda, y=estimate, col=term)) +
#   geom_line() +
#   scale_x_log10() +
#   coord_cartesian(xlim=c(min(lambda_inputs), 1)) +
#   labs(x="lambda (log10-scale)", y="beta-hat estimate (for rescaled numerical predictors)") +
#   geom_vline(xintercept = lambda_star, col="red", alpha=0.4, linetype="dashed") +
#   geom_vline(xintercept = lambda_star_1SE, col="blue", alpha=0.4, linetype="dashed")
```


# Categorical outcomes ($\geq$ 2 levels)

```{r}
library(tidyverse)
iris <- iris %>%
  as_tibble() %>%
  # Add ID column:
  mutate(ID = 1:n()) %>% 
  select(ID, Species, Sepal.Length, Sepal.Width)
```

## Classification and regression trees

### Fit/train model

```{r}
library(rpart)
model_formula <- as.formula(Species ~ Sepal.Length + Sepal.Width)
tree_parameters <- rpart.control(maxdepth = 3)
model_CART <- rpart(model_formula, data = iris, control=tree_parameters)

# Alas there is no broom functionality; this won't work
# model_CART %>% broom::tidy()
# 
# Use these instead, but they are not that helpful IMO
# print(model_CART)
# summary(model_CART)

# Plot
plot(model_CART, margin=0.25)
text(model_CART, use.n = TRUE)
title("Predicting iris species using sepal length & width")
box()
```

<!--
Where "Yes, go left. No, go right." So for example, recall our outcome variable
is a categorical variable with 3 levels: setosa, versicolor, and virginia.

```{r}
iris %>% 
  filter(Sepal.Length < 5.45, Sepal.Width < 2.8) %>% 
  count(Species)
```

and

```{r}
iris %>% 
  filter(Sepal.Length >= 5.45, Sepal.Length < 6.15, Sepal.Width < 3.1) %>% 
  count(Species)
```
-->


### Get fitted probabilities/predictions

**Output 1:** Get fitted probabilities

```{r}
p_hat_matrix <- model_CART %>% 
  predict(type = "prob", newdata = iris)

# Look at a random sample of 5 of them
p_hat_matrix %>% 
  as_tibble() %>% 
  sample_n(5)

# Score/error
MLmetrics::MultiLogLoss(y_true = iris$Species, y_pred = p_hat_matrix)
```

**Output 2:** Get explicit predictions y_hat based on fitted probabilities with
ties broken at random.

```{r, eval=FALSE}
y_hat <- model_CART %>% 
  predict(newdata=iris, type="class")

# Score/error
MLmetrics::Accuracy(y_true = iris$Species, y_pred = y_hat)
MLmetrics::ConfusionMatrix(y_true = iris$Species, y_pred = y_hat)
```
```{r, echo=FALSE}
y_hat <- model_CART %>% 
  predict(newdata=iris, type="class")

# Score/error
MLmetrics::Accuracy(y_true = iris$Species, y_pred = y_hat)
MLmetrics::ConfusionMatrix(y_true = iris$Species, y_pred = y_hat)
```


## $k$-nearest neighbors

### Fit/train model

```{r}
library(caret)
library(MLmetrics)

k <- 3
model_formula <- as.formula(Species ~ Sepal.Length + Sepal.Width)
model_knn <- caret::knn3(model_formula, data=iris, k = k)
```

### Get fitted probabilities/predictions

**Output 1:** Get fitted probabilities

```{r}
p_hat_matrix <- model_knn %>% 
  predict(newdata=iris, type="prob") %>% 
  round(3)

# Look at a random sample of 5 of them
p_hat_matrix %>% 
  as_tibble() %>% 
  sample_n(5)

# Score/error
MLmetrics::MultiLogLoss(y_true = iris$Species, y_pred = p_hat_matrix)
```

**Output 2:** Get explicit predictions y_hat based on fitted probabilities with
ties broken at random.

```{r, eval=FALSE}
y_hat <- model_knn %>% 
  predict(newdata=iris, type="class")

# Score/error
MLmetrics::Accuracy(y_true = iris$Species, y_pred = y_hat)
MLmetrics::ConfusionMatrix(y_true = iris$Species, y_pred = y_hat)
```
```{r, echo=FALSE}
y_hat <- model_knn %>% 
  predict(newdata=iris, type="class")

# Score/error
MLmetrics::Accuracy(y_true = iris$Species, y_pred = y_hat)
MLmetrics::ConfusionMatrix(y_true = iris$Species, y_pred = y_hat)
```


# Binary outcomes (2 levels)

```{r}
library(tidyverse)
library(broom)
library(okcupiddata)

profiles <- profiles %>%
  as_tibble() %>%
  # Create binary outcome variable y:
  mutate(y = ifelse(sex=="f", 1, 0)) %>%
  # Remove heights below 50 inches:
  filter(height>50) %>%
  # Add ID column:
  mutate(ID = 1:n()) %>%
  select(ID, sex, y, height) %>%
  # Remove all rows with NA missing values:
  na.omit()
profiles_train <- profiles %>% 
  sample_frac(0.5)
profiles_test <- profiles %>% 
  anti_join(profiles_train, by="ID")
```

## Logistic regression via `glm`

### Fit/train model

```{r, cache=TRUE}
model_formula <- as.formula(y~height)
model_logistic <- glm(model_formula, data=profiles_train, family="binomial")

# 1.a) Extract regression table in tidy format
model_logistic %>% 
  broom::tidy(conf.int=TRUE)

# 1.b) Extract point-by-point info in tidy format
model_logistic %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  sample_n(5)

# 1.c) Extract summary stats info in tidy format
model_logistic %>% 
  broom::glance()
```

### Predict outcomes for test data

```{r, cache=TRUE}
# 2. Make predictions on test data
# Method 1:
# -input: profiles_test is a data frame
# -output: log_odds_hat is a vector of log odds
log_odds_hat <- predict(model_logistic, newdata=profiles_test)
p_hat <- 1/(1 + exp(-log_odds_hat))

# Method 2: All new variables start with a period
model_logistic %>% 
  broom::augment(newdata=profiles_test) %>% 
  as_tibble() %>% 
  mutate(p_hat = 1/(1 + exp(-.fitted))) %>% 
  sample_n(5)
```

### Plot

```{r, cache=TRUE}
fitted_model <- model_logistic %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  mutate(p_hat = 1/(1 + exp(-.fitted)))
predictions <- model_logistic %>% 
  broom::augment(newdata=profiles_test) %>% 
  mutate(p_hat = 1/(1 + exp(-.fitted)))

# Logistic regression is fitted in log-odds(p) space
ggplot(NULL) +
  geom_line(data=fitted_model, aes(x=height, y=.fitted), col="blue") +
  geom_point(data=predictions, aes(x=height, y=.fitted), col="red") +
  labs(x="height (in inches)", y="Fitted log-odds of p_hat", title="Fitted log-odds of probability of being female vs height")

# Convert back to probability space
ggplot(NULL) +
  # Add observed binary y's, and put a little random jitter to the points
  geom_jitter(data=fitted_model, aes(x=height, y=y), height=0.05, alpha=0.05) +
  geom_line(data=fitted_model, aes(x=height, y=p_hat), col="blue") +
  geom_point(data=predictions, aes(x=height, y=p_hat), col="red") +
  labs(x="height (in inches)", y="p_hat", title="Fitted probability of being female vs height")
```

### ROC curve {#ROC_curve}

```{r}
profiles_train_augmented <- model_logistic %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  mutate(p_hat = 1/(1+exp(-.fitted)))

library(ROCR)
# This bit of code computes the ROC curve
pred <- prediction(predictions = profiles_train_augmented$p_hat, labels = profiles_train_augmented$y)
perf <- performance(pred, "tpr","fpr")

# This bit of code computes the Area Under the Curve
auc <- as.numeric(performance(pred,"auc")@y.values)
auc

# This bit of code prints it
plot(perf, main=paste("Area Under the Curve =", round(auc, 3)))
abline(c(0, 1), lty=2)
```


# Continuous outcomes

```{r}
library(tidyverse)
library(broom)

# Continuous outcome:
mtcars <- mtcars %>% 
  mutate(ID = 1:n()) %>% 
  select(ID, mpg, hp) %>% 
  as_tibble()
mtcars_train <- mtcars %>% 
  sample_frac(0.5)
mtcars_test <- mtcars %>% 
  anti_join(mtcars_train, by="ID")
```


## Regression via `lm`

### Fit/train model

```{r}
model_formula <- as.formula("mpg ~ hp")
model_lm <- lm(model_formula, data=mtcars_train)

# 1.a) Extract regression table in tidy format
model_lm %>% 
  broom::tidy(conf.int=TRUE)

# 1.b) Extract point-by-point info in tidy format
model_lm %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  sample_n(5)

# 1.c) Extract summary stats info in tidy format
model_lm %>% 
  broom::glance()
```

### Predict outcomes for test data

```{r}
# 2. Make predictions on test data
# Method 1:
# -input: mtcars_test is a data frame
# -output: y_hat is a vector
y_hat <- predict(model_lm, newdata=mtcars_test)

# Method 2: All new variables start with a period
model_lm %>% 
  broom::augment(newdata=mtcars_test) %>% 
  as_tibble() %>% 
  sample_n(5)
```

### Plot

```{r}
fitted_model <- model_lm %>% 
  broom::augment() %>% 
  as_tibble()
predictions <- model_lm %>% 
  broom::augment(newdata=mtcars_test)

ggplot(NULL) +
  geom_point(data=fitted_model, aes(x=hp, y=mpg)) +
  geom_line(data=fitted_model, aes(x=hp, y=.fitted), col="blue") +
  geom_point(data=predictions, aes(x=hp, y=.fitted), col="red") +
  labs(x="Horse power", y="Miles per gallon")
```


## LOESS {#loess}

### Fit/train model

```{r}
model_formula <- as.formula("mpg ~ hp")
model_loess <- loess(model_formula, data=mtcars_train, span=0.9)

# 1.a) Extract point-by-point info in tidy format
model_loess %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  sample_n(5)
```

### Predict outcomes for test data

```{r}
# 2. Make predictions on test data
# Method 1:
# -input: mtcars_test is a data frame
# -output: y_hat is a vector
y_hat <- predict(model_loess, newdata=mtcars_test)

# Method 2: All new variables start with a period
model_loess %>% 
  broom::augment(newdata=mtcars_test) %>% 
  sample_n(5)
```

### Plot

```{r}
fitted_model <- model_loess %>% 
  broom::augment() %>% 
  as_tibble()
predictions <- model_loess %>% 
  broom::augment(newdata=mtcars_test) %>% 
  as_tibble()

ggplot(NULL) +
  geom_point(data=fitted_model, aes(x=hp, y=mpg)) +
  geom_line(data=fitted_model, aes(x=hp, y=.fitted), col="blue") +
  geom_point(data=predictions, aes(x=hp, y=.fitted), col="red") +
  labs(x="Horse power", y="Miles per gallon")
```


## Splines {#splines}

### Fit/train model

```{r}
model_spline <- smooth.spline(x=mtcars_train$hp, y=mtcars_train$mpg, df = 4)

# 1.a) Extract point-by-point info in tidy format
model_spline %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  sample_n(5)

# 1.b) Extract summary stats info in tidy format
model_spline %>% 
  broom::glance()
```

### Predict outcomes for test data

```{r}
# 2. Make predictions on test data
# Method 1:
# -input: mtcars_test$hp is a vector
# -output: is a list with two slots: x & y
spline_fitted <- predict(model_spline, x=mtcars_test$hp)

# Convert y_hat to tibble data frame with x, y columns
spline_fitted <- spline_fitted %>% 
  as_tibble() %>% 
  rename(hp = x, .fitted = y)

y_hat <- spline_fitted$.fitted
```

### Plot

```{r}
fitted_model <- model_spline %>% 
  broom::augment() %>% 
  as_tibble() %>% 
  rename(hp = x, mpg = y)
predictions <- mtcars_test %>% 
  mutate(.fitted = y_hat)

ggplot(NULL) +
  geom_point(data=fitted_model, aes(x=hp, y=mpg)) +
  geom_line(data=fitted_model, aes(x=hp, y=.fitted), col="blue") +
  geom_point(data=predictions, aes(x=hp, y=.fitted), col="red") +
  labs(x="Horse power", y="Miles per gallon")
```