-
Notifications
You must be signed in to change notification settings - Fork 13
/
1.categorical_gwas.Rmd
384 lines (278 loc) · 11.9 KB
/
1.categorical_gwas.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
---
title: "Categorical GWAS"
author: "Filippo"
date: "5/12/2021"
output: html_document
---
```{r setup, include=FALSE}
library("knitr")
library("readxl")
library("tidyverse")
knitr::opts_chunk$set(echo = TRUE)
```
```{r, label='set_up'}
ext_data_folder = "model_extensions_data"
```
## Categorical traits
1. unordered categories $\rightarrow$ `multinomial logistic regression`
2. ordered categories $\rightarrow$ `ordinal logistic regression`
## Multinomial logistic regression
**Multinomial logistic regression** reports the odds of being in the different outcome categories in reference to some base group. If we have three classes (A, B, C) there will be two different sets of regression results corresponding to the following two models (the reference class is arbitrary):
$$
log(\frac{P(Y=B)}{P(Y=A)}) = \beta_0 + \beta_1x_1 + \ldots + \beta_mx_m
$$
$$
log(\frac{P(Y=C)}{P(Y=A)}) = \beta_0 + \beta_1x_1 + \ldots + \beta_mx_m
$$
**Breast Tissue Dataset** (from: <https://archive.ics.uci.edu/ml/datasets/Breast+Tissue>)
- 6 categories: three are grouped together $\rightarrow$ 4 categories:
- `adi`: adipose
- `car`: carcinoma
- `con`: connective
- `other`
- 9 features: electrical impedance measurements of freshly excised tissue samples from the breast (all quantitative, continuous)
```{r breast_data}
# odata <- read_dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")
fname = file.path("..", ext_data_folder, "BreastTissue.xls")
tissue <- read_excel(fname, sheet = 2)
tissue <- tissue[, -1] ## remove sequential numbering of records
tissue$Class <- as.factor(tissue$Class) ## convert classes to factor
levels(tissue$Class)[levels(tissue$Class) %in% c("fad", "gla", "mas")] <- "other"
levels(tissue$Class)
```
Below, the barplot of the frequencies of breast tissue categories:
```{r labels, echo = FALSE}
p <- ggplot(tissue, aes(x = Class)) + geom_bar(aes(fill=Class))
p
```
**Descriptive statistics**
```{r descriptives}
dd <- tissue |>
select(-Class) |>
gather(key = "variable", value = "value") |>
group_by(variable) |>
summarise(across(value, list(mean = mean, std = sd, min = min, max = max),
.names = "{fn}"))
kable(dd)
```
### Fit the multinomial logistic regression model
We use the R package `nnet`:
```{r}
library("nnet")
tissue$Class <- relevel(tissue$Class, ref = "other")
levels(tissue$Class)
```
Let's first try to fit a *NULL* model (intercept only)
```{r, results='hide'}
OIM <- multinom(Class ~ 1, data = tissue)
summary(OIM)
```
Let's now fit a **multinomial logistic regression model** with all the variables in the dataset:
```{r}
multinom_mod <- multinom(Class ~ ., data = tissue, model=TRUE)
```
We see that we have different sets of coefficients for each class (the multiple logistic regression models):
```{r}
summary(multinom_mod)
```
The model coefficients can be interpreted as risk ratios (odds) relative to the reference class (for one unit increase of the independent variable):
```{r}
coef_DR_car <- coef(multinom_mod)["car","DR"]
exp(coef_DR_car)
```
The **odds** (**relative risk**) for the tissue of being `carcinomatous` rather than `other` increase by 13.1 with one unit increase in the variable DR.
#### p-values
**Wald test**
$$
W=\frac{(\hat{p}-p_0)^2}{Var(\hat{p})}
$$
- $p_0$: parameter value under $H_0$
- $W$ is distributed as a `chi-square` random variable with 1 d.f. (this is equivalent to the square of a $N(0,1)$ random variable)
```{r}
## Wald test
coefs = summary(multinom_mod)$coefficients
std_errs = summary(multinom_mod)$standard.errors
chisq_stats = (coefs^2)/(std_errs^2)
p <- 1-pchisq(q = chisq_stats, df = 1)
print(p)
```
```{r, label='normal_wald'}
## Gaussian equivalence
z <- coefs/std_errs
# 2-tailed Wald z tests to test significance of coefficients
p <- (1 - pnorm(abs(z), 0, 1))*2
p
```
Instead of writing out all the steps on our own, we can use a R package (`AER`) to carry out the Wald test:
```{r, message=FALSE}
library("AER")
coeftest(multinom_mod)
```
**Likelihood ratio test**
As an alternative to the Wald test, we can use the (log)likelihood ratio test:
$$
LR = 2\{L(y,\hat{p}) - L(y,p_0)\}
$$
We fit a reduced multinomial logistic regression model without the variable of interest, in this case the variable `P`:
```{r, results='hide'}
library("lmtest")
## remove variable P from the model
multinom_mod_red <- multinom(Class ~ ., data = tissue[,-ncol(tissue)], model=TRUE)
lrtest(multinom_mod,multinom_mod_red)
```
```{r, results='hide'}
## alternatively, with the lrtest function we can specify the variable for which we want to run the LRT
lrtest(multinom_mod, "P")
```
The function `anova` can also be used to compare models (as it is done in linear regression models):
```{r}
anova(multinom_mod,multinom_mod_red)
```
The R package `afex` allows to run multiple model comparisons on all the variables included in the full model:
```{r, message=FALSE}
library("afex")
Anova(multinom_mod, type = "III")
```
### Adding a SNP effect
In the **breast tissue** dataset we don't have any SNP data: we can add a fictitious (randomly sampled) SNP genotype column to the data, and see how we would use a **multinomial logistic regression model** to run a **GWAS** analysis for **unordered (nominal) categorical phenotypes**.
```{r, results='hide'}
tissue$SNP <- sample(c(0,1,2), nrow(tissue), replace = TRUE)
ggplot(tissue, aes(x=SNP)) + geom_bar(aes(fill=as.factor(SNP))) + guides(fill=guide_legend(title="SNP")) +
facet_wrap(~Class)
```
#### Running the model
```{r, results='hide'}
multinom_snp <- multinom(Class ~ ., data = tissue, model=TRUE)
```
```{r}
summary(multinom_snp)
```
```{r}
## alternatively, with the lrtest function we can specify the variable for which we want to run the LRT
lrtest(multinom_snp, "SNP")
```
## Ordered logistic regression
Ordinal Logistic Regression is used when there are three or more categories with a natural ordering to the levels (the ranking of the levels doesn't necessarily mean the intervals between them are equal)
Examples of ordinal responses could be:
- Medical condition (e.g., good, stable, serious, critical)
- Diseases can be graded on scales from least severe to most severe
- Survey respondents choose answers on scales from strongly agree to strongly disagree
- Students are graded on scales from A to F
When response categories are ordered, the logits can utilize the ordering. You can modify the binary logistic regression model to incorporate the ordinal nature of a dependent variable by defining the probabilities differently.
Instead of considering the probability of an individual event, you consider the probabilities of that event and all events that are ordered before it.
A cumulative probability for $Y$ is the probability that $Y$ falls at or below a particular point. For outcome category $j$, the cumulative probability is:
$$
P(Y \leq j) = \pi_1 + \ldots + \pi_j, [j=1, \ldots, J]
$$
where $\pi_j$ is the probability to choose category $j$
Probit and logit models are reasonable choices when the changes in the cumulative probabilities are gradual. If there are abrupt changes, other link functions can be used.
**proportional odds logistic regression**: highlights the proportional odds assumption in our model
- $y$: ordinal outcome variable with $J$ categories
- $P(y \leq j)$: cumulative probability of $y$ less (or equal) than a specific category $j$ ($j = 1, \ldots, J-1$)
The **odds** of being lass than or equal to a specific category is:
$$
\frac{P(y \leq j)}{P(y > j)}
$$
for $j = 1, \ldots, J-1$ (since $P>J = 0$, division by 0 is undefined). We can now take the **log(odds)** (the **logit**!):
$$
log\left( \frac{P(y \leq j)}{P(y > j)} \right) = logit(P(y \leq j)) = \beta_{j0} + \beta_1x_1 + \ldots + \beta_px_p
$$
(PS: high values of $\eta = \beta_{j0}+\mathbf{x^T\beta}$ correspond to low values of $y$: this is why sometimes the **proportional odds logistic regression model** is parameterised as $logit(P(y \leq j)) = \beta_{j0}-\mathbf{x^T\beta}$. This is the way the R function `polr` parameterizes the model)
This model has a different intercept term for each category in the output variable, but each explanatory variable has only one single coefficient (fewer terms compared to the multinomial logistic regression model for nominal categorical traits). The intercepts indicate where the latent variable is cut to make the *j* groups that we observe in our data. Note that this latent variable is continuous.
#### Illustration with a Covid-19 dataset
```{r}
## data from : https://zenodo.org/record/4723941#.YJ0xGKIzZhE
fname = file.path("..",ext_data_folder,"Race_clincial_trial_Spreadsheet_de-identified__noagegender_final.xlsx")
covid <- read_excel(fname, sheet = 1)
covid <- covid[,-c(1:3,12,13,15:18,ncol(covid))] ## select some variables
covid <- na.omit(covid)
names(covid)[ncol(covid)] <- "disease_score"
table(covid$disease_score) ## remove 0 and 7, too few
```
We remove categories `0' and`7\` because there are too few observations; then we recode the categories from 1 to 4
```{r}
covid <- covid %>%
filter(disease_score > 0 & disease_score < 7)
covid <- covid %>% mutate(across(everything(), as.numeric))
## make it "more" ordinal
covid$disease_score = covid$disease_score-2
covid$disease_score <- factor(covid$disease_score, levels = c(1,2,3,4), labels = c("mild","moderate","impaired","severe"))
```
```{r}
p <- ggplot(covid, aes(x = disease_score)) + geom_bar(aes(fill=disease_score))
p
```
### Fit the ordered logistic regression model
We use the `polr` function from the `MASS` package: `polr` stands for **proportional odds logistic regression**
```{r}
library("MASS")
OLRmodel <- polr(disease_score ~ . , data = covid, Hess = TRUE, method = "logistic") ## you can select logistic or probit as method
summary(OLRmodel)
```
For instance, for the mild/moderate categories:
$$
logit(P(y \leq 1)) = 8.33 - 0.14 \cdot \text{highest temp first 24 hrs (F)} - (-0.008 \cdot \text{lowest Systolic BP 1st 24 hrs}) - \\ \ldots -0.598 \cdot \text{Peak O2 requirement in first 24 hrs of admission since time arriving to ED (see tab 2 for scale)}
$$
By exponentiating model coefficients (on the logit scale), we get the **odds** (risk rations, relative risks) for each variable in the model (odds of belonging to that category for one-unit increase for that variable):
```{r}
coefs <- coef(OLRmodel)
exp(coefs) ## odds
```
#### p-values
As usual, there are multiple ways to get `p-values` from generalised linear models (GLMs):
1. **Wald test**
```{r}
OLRestimates <- coef(summary(OLRmodel))
# n = nrow(covid)
# k = ncol(covid)-1
# p <- pt(abs(OLRestimates[, "t value"]), lower.tail = FALSE, df = n-k) * 2
# p <- pnorm(abs(OLRestimates[, "t value"]), lower.tail = FALSE) * 2
## Wald test
coefs = OLRestimates[,"Value"]
std_errs = OLRestimates[,"Std. Error"]
chisq_stats = (coefs^2)/(std_errs^2)
p <- 1-pchisq(q = chisq_stats, df = 1)
kable(p)
```
```{r}
# coeftest(OLRmodel)
```
2. **likelihood ratio test**
```{r}
names(covid)[ncol(covid)-1] <- "peak_o2"
covid_reduced <- covid %>% dplyr::select(-peak_o2)
OLRmodel_reduced <- polr(disease_score ~ . , data = covid_reduced, Hess = TRUE, method = "logistic") ## you can select logistic or probit as method
lrtest(OLRmodel,OLRmodel_reduced)
# anova(OLRmodel,OLRmodel_reduced)
```
```{r}
OLRestimates <- coef(summary(OLRmodel))[10,]
## Wald test
coefs = OLRestimates["Value"]
std_errs = OLRestimates["Std. Error"]
chisq_stats = (coefs^2)/(std_errs^2)
p <- 1-pchisq(q = chisq_stats, df = 1)
print(p)
```
#### add a SNP effect
```{r}
covid$SNP <- sample(c(0,1,2), nrow(covid), replace = TRUE)
OLRmodel_snp <- polr(disease_score ~ . , data = covid, Hess = TRUE, method = "logistic")
```
```{r}
anova(OLRmodel,OLRmodel_snp)
```
```{r}
library("lmtest")
lrtest(OLRmodel, OLRmodel_snp)
```
Double-check with the the Wald test:
```{r}
OLRestimates <- coef(summary(OLRmodel_snp))["SNP",]
## Wald test
coefs = OLRestimates["Value"]
std_errs = OLRestimates["Std. Error"]
chisq_stats = (coefs^2)/(std_errs^2)
p <- 1-pchisq(q = chisq_stats, df = 1)
print(p)
```