model_extensions/1.categorical_gwas.Rmd

---
title: "Categorical GWAS"
author: "Filippo"
date: "5/12/2021"
output: html_document
---

```{r setup, include=FALSE}
library("knitr")
library("readxl")
library("tidyverse")

knitr::opts_chunk$set(echo = TRUE)
```

```{r, label='set_up'}
ext_data_folder = "model_extensions_data"
```

## Categorical traits

1.  unordered categories $\rightarrow$ `multinomial logistic regression`
2.  ordered categories $\rightarrow$ `ordinal logistic regression`

## Multinomial logistic regression

**Multinomial logistic regression** reports the odds of being in the different outcome categories in reference to some base group. If we have three classes (A, B, C) there will be two different sets of regression results corresponding to the following two models (the reference class is arbitrary):

$$
log(\frac{P(Y=B)}{P(Y=A)}) = \beta_0 + \beta_1x_1 + \ldots + \beta_mx_m
$$

$$
log(\frac{P(Y=C)}{P(Y=A)}) = \beta_0 + \beta_1x_1 + \ldots + \beta_mx_m
$$

**Breast Tissue Dataset** (from: <https://archive.ics.uci.edu/ml/datasets/Breast+Tissue>)

-   6 categories: three are grouped together $\rightarrow$ 4 categories:
    -   `adi`: adipose
    -   `car`: carcinoma
    -   `con`: connective
    -   `other`
-   9 features: electrical impedance measurements of freshly excised tissue samples from the breast (all quantitative, continuous)

```{r breast_data}
# odata <- read_dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
fname = file.path("..", ext_data_folder, "BreastTissue.xls")
tissue <- read_excel(fname, sheet = 2)
tissue <- tissue[, -1] ## remove sequential numbering of records
tissue$Class <- as.factor(tissue$Class) ## convert classes to factor
levels(tissue$Class)[levels(tissue$Class) %in% c("fad", "gla", "mas")] <- "other"
levels(tissue$Class)
```

Below, the barplot of the frequencies of breast tissue categories:

```{r labels, echo = FALSE}
p <- ggplot(tissue, aes(x = Class)) + geom_bar(aes(fill=Class))
p
```

**Descriptive statistics**

```{r descriptives}
dd <- tissue |>
  select(-Class) |>
  gather(key = "variable", value = "value") |>
  group_by(variable) |>
  summarise(across(value, list(mean = mean, std = sd, min = min, max = max),
                   .names = "{fn}"))

kable(dd)
```

### Fit the multinomial logistic regression model

We use the R package `nnet`:

```{r}
library("nnet")

tissue$Class <- relevel(tissue$Class, ref = "other")
levels(tissue$Class)
```

Let's first try to fit a *NULL* model (intercept only)

```{r, results='hide'}
OIM <- multinom(Class ~ 1, data = tissue)
summary(OIM)
```

Let's now fit a **multinomial logistic regression model** with all the variables in the dataset:

```{r}
multinom_mod <- multinom(Class ~ ., data = tissue, model=TRUE)
```

We see that we have different sets of coefficients for each class (the multiple logistic regression models):

```{r}
summary(multinom_mod)
```

The model coefficients can be interpreted as risk ratios (odds) relative to the reference class (for one unit increase of the independent variable):

```{r}
coef_DR_car <- coef(multinom_mod)["car","DR"]
exp(coef_DR_car)
```

The **odds** (**relative risk**) for the tissue of being `carcinomatous` rather than `other` increase by 13.1 with one unit increase in the variable DR.

#### p-values

**Wald test**

$$
W=\frac{(\hat{p}-p_0)^2}{Var(\hat{p})}
$$

-   $p_0$: parameter value under $H_0$
-   $W$ is distributed as a `chi-square` random variable with 1 d.f. (this is equivalent to the square of a $N(0,1)$ random variable)

```{r}
## Wald test
coefs = summary(multinom_mod)$coefficients
std_errs = summary(multinom_mod)$standard.errors
chisq_stats = (coefs^2)/(std_errs^2)
p <- 1-pchisq(q = chisq_stats, df = 1)
print(p)
```

```{r, label='normal_wald'}
## Gaussian equivalence
z <- coefs/std_errs
# 2-tailed Wald z tests to test significance of coefficients
p <- (1 - pnorm(abs(z), 0, 1))*2
p
```

Instead of writing out all the steps on our own, we can use a R package (`AER`) to carry out the Wald test:

```{r, message=FALSE}
library("AER")
coeftest(multinom_mod)
```

**Likelihood ratio test**

As an alternative to the Wald test, we can use the (log)likelihood ratio test:

$$
LR = 2\{L(y,\hat{p}) - L(y,p_0)\}
$$

We fit a reduced multinomial logistic regression model without the variable of interest, in this case the variable `P`:

```{r, results='hide'}
library("lmtest")
## remove variable P from the model 
multinom_mod_red <- multinom(Class ~ ., data = tissue[,-ncol(tissue)], model=TRUE)
lrtest(multinom_mod,multinom_mod_red)
```

```{r, results='hide'}
## alternatively, with the lrtest function we can specify the variable for which we want to run the LRT
lrtest(multinom_mod, "P") 
```

The function `anova` can also be used to compare models (as it is done in linear regression models):

```{r}
anova(multinom_mod,multinom_mod_red)
```

The R package `afex` allows to run multiple model comparisons on all the variables included in the full model:

```{r, message=FALSE}
library("afex")
Anova(multinom_mod, type = "III")
```

### Adding a SNP effect

In the **breast tissue** dataset we don't have any SNP data: we can add a fictitious (randomly sampled) SNP genotype column to the data, and see how we would use a **multinomial logistic regression model** to run a **GWAS** analysis for **unordered (nominal) categorical phenotypes**.

```{r, results='hide'}
tissue$SNP <- sample(c(0,1,2), nrow(tissue), replace = TRUE)
ggplot(tissue, aes(x=SNP)) + geom_bar(aes(fill=as.factor(SNP))) + guides(fill=guide_legend(title="SNP")) +
  facet_wrap(~Class)
```

#### Running the model

```{r, results='hide'}
multinom_snp <- multinom(Class ~ ., data = tissue, model=TRUE)
```

```{r}
summary(multinom_snp)
```

```{r}
## alternatively, with the lrtest function we can specify the variable for which we want to run the LRT
lrtest(multinom_snp, "SNP") 
```

## Ordered logistic regression

Ordinal Logistic Regression is used when there are three or more categories with a natural ordering to the levels (the ranking of the levels doesn't necessarily mean the intervals between them are equal)

Examples of ordinal responses could be:

-   Medical condition (e.g., good, stable, serious, critical)
-   Diseases can be graded on scales from least severe to most severe
-   Survey respondents choose answers on scales from strongly agree to strongly disagree
-   Students are graded on scales from A to F

When response categories are ordered, the logits can utilize the ordering. You can modify the binary logistic regression model to incorporate the ordinal nature of a dependent variable by defining the probabilities differently.

Instead of considering the probability of an individual event, you consider the probabilities of that event and all events that are ordered before it.

A cumulative probability for $Y$ is the probability that $Y$ falls at or below a particular point. For outcome category $j$, the cumulative probability is:

$$
P(Y \leq j) = \pi_1 + \ldots + \pi_j, [j=1, \ldots, J]
$$

where $\pi_j$ is the probability to choose category $j$

Probit and logit models are reasonable choices when the changes in the cumulative probabilities are gradual. If there are abrupt changes, other link functions can be used.

**proportional odds logistic regression**: highlights the proportional odds assumption in our model

-   $y$: ordinal outcome variable with $J$ categories
-   $P(y \leq j)$: cumulative probability of $y$ less (or equal) than a specific category $j$ ($j = 1, \ldots, J-1$)

The **odds** of being lass than or equal to a specific category is:

$$
\frac{P(y \leq j)}{P(y > j)}
$$

for $j = 1, \ldots, J-1$ (since $P>J = 0$, division by 0 is undefined). We can now take the **log(odds)** (the **logit**!):

$$
log\left( \frac{P(y \leq j)}{P(y > j)} \right) = logit(P(y \leq j)) = \beta_{j0} + \beta_1x_1 + \ldots + \beta_px_p
$$

(PS: high values of $\eta = \beta_{j0}+\mathbf{x^T\beta}$ correspond to low values of $y$: this is why sometimes the **proportional odds logistic regression model** is parameterised as $logit(P(y \leq j)) = \beta_{j0}-\mathbf{x^T\beta}$. This is the way the R function `polr` parameterizes the model)

This model has a different intercept term for each category in the output variable, but each explanatory variable has only one single coefficient (fewer terms compared to the multinomial logistic regression model for nominal categorical traits). The intercepts indicate where the latent variable is cut to make the *j* groups that we observe in our data. Note that this latent variable is continuous.

#### Illustration with a Covid-19 dataset

```{r}
## data from : https://zenodo.org/record/4723941#.YJ0xGKIzZhE
fname = file.path("..",ext_data_folder,"Race_clincial_trial_Spreadsheet_de-identified__noagegender_final.xlsx")
covid <- read_excel(fname, sheet = 1)
covid <- covid[,-c(1:3,12,13,15:18,ncol(covid))] ## select some variables
covid <- na.omit(covid)
names(covid)[ncol(covid)] <- "disease_score"

table(covid$disease_score) ## remove 0 and 7, too few
```

We remove categories `0' and`7\` because there are too few observations; then we recode the categories from 1 to 4

```{r}
covid <- covid %>%
  filter(disease_score > 0 & disease_score < 7)

covid <- covid %>% mutate(across(everything(), as.numeric))

## make it "more" ordinal
covid$disease_score = covid$disease_score-2
covid$disease_score <- factor(covid$disease_score, levels = c(1,2,3,4), labels = c("mild","moderate","impaired","severe"))
```

```{r}
p <- ggplot(covid, aes(x = disease_score)) + geom_bar(aes(fill=disease_score))
p
```

### Fit the ordered logistic regression model

We use the `polr` function from the `MASS` package: `polr` stands for **proportional odds logistic regression**

```{r}
library("MASS")
OLRmodel <- polr(disease_score ~ . , data = covid, Hess = TRUE, method = "logistic") ## you can select logistic or probit as method
summary(OLRmodel)
```

For instance, for the mild/moderate categories:

$$
logit(P(y \leq 1)) = 8.33 - 0.14 \cdot \text{highest temp first 24 hrs (F)} - (-0.008 \cdot \text{lowest Systolic BP 1st 24 hrs}) - \\ \ldots -0.598 \cdot \text{Peak O2 requirement in first 24 hrs of admission since time arriving to  ED (see tab 2 for scale)}
$$

By exponentiating model coefficients (on the logit scale), we get the **odds** (risk rations, relative risks) for each variable in the model (odds of belonging to that category for one-unit increase for that variable):

```{r}
coefs <- coef(OLRmodel)
exp(coefs) ## odds
```

#### p-values

As usual, there are multiple ways to get `p-values` from generalised linear models (GLMs):

1.  **Wald test**

```{r}
OLRestimates <- coef(summary(OLRmodel))
# n = nrow(covid)
# k = ncol(covid)-1
# p <- pt(abs(OLRestimates[, "t value"]), lower.tail = FALSE, df = n-k) * 2
# p <- pnorm(abs(OLRestimates[, "t value"]), lower.tail = FALSE) * 2

## Wald test
coefs = OLRestimates[,"Value"]
std_errs = OLRestimates[,"Std. Error"]
chisq_stats = (coefs^2)/(std_errs^2)
p <- 1-pchisq(q = chisq_stats, df = 1)
kable(p)
```

```{r}
# coeftest(OLRmodel)
```

2.  **likelihood ratio test**

```{r}
names(covid)[ncol(covid)-1] <- "peak_o2"
covid_reduced <- covid %>% dplyr::select(-peak_o2)
OLRmodel_reduced <- polr(disease_score ~ . , data = covid_reduced, Hess = TRUE, method = "logistic") ## you can select logistic or probit as method

lrtest(OLRmodel,OLRmodel_reduced)
# anova(OLRmodel,OLRmodel_reduced)
```

```{r}
OLRestimates <- coef(summary(OLRmodel))[10,]

## Wald test
coefs = OLRestimates["Value"]
std_errs = OLRestimates["Std. Error"]
chisq_stats = (coefs^2)/(std_errs^2)
p <- 1-pchisq(q = chisq_stats, df = 1)
print(p)
```

#### add a SNP effect

```{r}
covid$SNP <- sample(c(0,1,2), nrow(covid), replace = TRUE)
OLRmodel_snp <- polr(disease_score ~ . , data = covid, Hess = TRUE, method = "logistic")
```

```{r}
anova(OLRmodel,OLRmodel_snp)
```

```{r}
library("lmtest")
lrtest(OLRmodel, OLRmodel_snp)
```

Double-check with the the Wald test:

```{r}
OLRestimates <- coef(summary(OLRmodel_snp))["SNP",]

## Wald test
coefs = OLRestimates["Value"]
std_errs = OLRestimates["Std. Error"]
chisq_stats = (coefs^2)/(std_errs^2)
p <- 1-pchisq(q = chisq_stats, df = 1)
print(p)
```