03_Rmd_smartpca.Rmd

---
title: "Principal Components Analysis (PCA)"
output: html_document
editor_options: 
  chunk_output_type: console
---

```{r, echo=FALSE}
knitr::opts_chunk$set(message = FALSE)
```

```{r}
library(magrittr)
```

Principal components analysis (PCA) is one of the most useful techniques to visualise genetic diversity in a dataset. The methodology is not restricted to genetic data, but in general allows breaking down high-dimensional datasets to two or more dimensions for visualisation in a two-dimensional space.

## Genotype Data

This lesson is also our first contact with the genotype data used in this and most of the following lessons. The dataset that we will work with contains 1,340 individuals, each represented by 593,124 single nucleotide polymorphisms (SNPs). Those SNPs have exactly two different alleles, and each individual has one of four possible values at each genotype: homozygous reference, heterozygous, homozygous alternative, or missing. Those four values are encoded 2, 1, 0 and 9 respectively.

The data is laid out as a matrix, with columns indicating individuals, and rows indicating SNPs. The data itself comes in the so-called \"EIGENSTRAT\" format, which is defined in the [Eigensoft package](https://github.com/DReichLab/EIG) used by many tools used in this workshop. In this format, a genotype dataset consists of three files, usually with the following file endings:

* `*.snp`: The file containing the SNP positions. It consists of six columns: SNP-name, chromosome, genetic positions, physical position, reference allele, alternative allele.
* `*.ind`: The file containing the names of the individuals. It consists of three columns: Individual Name, Sex (encoded as M(ale), F(emale), or U(nknown)), and population name.
* `*.geno`: The file containing the genotype matrix, with individuals laid out from left to right, and SNP positions laid out from top to bottom.

In the following, we will explore the files using R in this Rmarkdown document.

The data that we want to analyse is stored at `data/popgen_course`. Let's list the contents of that directory:

```{r}
list.files("data/popgen_course/")
```

Let's explore those files a bit. Here are the first 20 individuals:

```{r}
individuals <- readr::read_delim(
   "data/popgen_course/genotypes_small.ind", 
   delim = " ", 
   trim_ws = T,
   col_names = c(
      "name", 
      "sex", 
      "population"
   )
)

individuals %>% head(20)
```

And here the first 20 SNP rows:

```{r}
snps <- readr::read_delim(
   "data/popgen_course/genotypes_small.snp", 
   delim = " ", 
   trim_ws = T,
   col_names = c(
      "SNP_name", 
      "chromosome", 
      "genetic_position", 
      "physical_position", 
      "reference_allele", 
      "alternative_allele"
   )
)
```

And here are the first 20 genotypes of the first 50 individuals:

```{r}
geno <- readr::read_lines(
   "data/popgen_course/genotypes_small.geno",
   n_max = 20
)

geno %>% substr(1, 50)
```

Counting how many individuals and SNPs there are:

```{r}
nrow(individuals)
nrow(snps)
```

And now we check that the first row of the `*.geno` file indeed contains the same number of columns:

```{r}
nchar(geno[1])
```

Now counting the number of rows in the `*.geno`-file (this takes a few seconds, as the file is several hundred MB large):

```{r}
R.utils::countLines("data/popgen_course/genotypes_small.geno") %>% as.integer()
```

Great, the number of rows and columns agrees with the numbers indicated in the `*.ind` and `*.snp` file! Now we're counting how many different populations there are. Let's first see the first 10 populations in the sorted list, alongside the number of individuals in each group:

```{r}
individuals %>% 
   dplyr::group_by(population) %>%
   dplyr::count()
```

## How PCA works

To understand how PCA works, consider a single individual and its representation by its 593,124 markers. Formally, each individual is a point in a 593,124-dimensional space, where each dimension can take only the three possible genotypes indicated above, or have missing data. To visualise this high-dimensional dataset, we would like to project it down to two dimensions. But as there are many ways to project the shadow of a three-dimensional object on a two dimensional plane, there are many (and even more) ways to project a 593,124-dimensional cloud of points to two dimensions. What PCA does is figuring out the \"best\" way to do this project in order to visualise the major components of variance in the data.

For actually running the analysis, we use a software called `smartPCA` from the [Eigensoft package](https://github.com/DReichLab/EIG). As many other tools from this and related packages, `smartPCA` reads in a parameter file which specifies its input and output files and options. In our case, we want the parameter file to have the following content:

```
genotypename: data/popgen_course/genotypes_small.geno
snpname: data/popgen_course/genotypes_small.snp
indivname: data/popgen_course/genotypes_small.ind
evecoutname: pca.WestEurasia.evec
evaloutname: pca.WestEurasia.eval
poplistname: data/popgen_course/WestEurasia.poplist.txt
lsqproject: YES
numoutevec: 4
numthreads: 1
```

Here, the first three parameters specify the input genotype files. The next two rows specify two output file names, typically with ending `*.evec` and `*.eval`. The parameter line beginning with `poplistname` contains a file with a list of populations used for calculating the principal components (see below). The option `lsqproject` is important for applications including ancient DNA with lots of missing data, which I will not elaborate on. For the purpose of this workshop, you should use `lsqproject: YES`. The next option `numoutevec` specifies the number of principal components that we compute, the last option `numthreads` the number of CPUs to use for this run. We use just one since we're working together on the same computer, so cannot afford everyone running on lots of CPUs.

## Population lists vs. Projection

The parameter named `poplistname` is a very crucial one. It specifies the populations whose individuals are used to calculate the principal components. Why not just all of them you ask? For two reasons: First, there are simply too many of them and we don't want to use all of them, since the computation would take too long. More importantly, however, we generally try to avoid using ancient samples to compute principal components, to avoid specific ancient-DNA related artefacts affecting the computation. Finally, the list of populations to use for PCA should be informed by your question. If you're investigating African population structure, in makes no sense to put Asian or European individuals in your population list, since then the main axes of genetic differentiation would not be inside of Africa, but between Africans and Non-Africans.

So what happens to individuals that are not in populations listed in the population list? Well, fortunately, they are not just ignored, but \"projected\". This means that after the principal components have been computed, *all* individuals (not just the one in the list) are projected onto these principal components. That way, we can visualise ancient populations in the context of modern genetic variation. While that may sound a bit problematic at first (Some variation in ancient populations is not represented well by modern populations), but it turns out to be nevertheless one of the most useful tools for this purpose. The advantage of avoiding ancient-DNA artefacts and batch effects to affect the visualisation outweighs the disadvantage of missing some private genetic variation components in the ancient populations themselves. Of course, that argument breaks down once the analysed populations become too ancient and detached from modern genetic variation. But for our purposes it will work just fine.

For this workshop, I prepared two population lists::

```
data/popgen_course/WestEurasia.poplist.txt
data/popgen_course/AllEurasia.poplist.txt
```

As you can tell from the names of the files, they specify two sets of modern populations representing West Eurasia or all of Europe and Asia, respectively.

I recommend to look through both of the population lists and google some population names that you don't recognise to get a feeling for the ethnic groups represented here.

## Running `smartPCA`

Now go ahead and open a new text file using your Jupyter Browser, you can name it anything you like. For the sake of a concrete name, let's call it `pca.WestEurasia.params.txt`. Text files in Jupyter are opened in a text editor, so you can then simply copy-paste the above lines into the new file.

```{r}
readr::write_lines(c(
      "genotypename: data/popgen_course/genotypes_small.geno",
      "snpname: data/popgen_course/genotypes_small.snp",
      "indivname: data/popgen_course/genotypes_small.ind",
      "evecoutname: pca.WestEurasia.evec",
      "evaloutname: pca.WestEurasia.eval",
      "poplistname: data/popgen_course/WestEurasia.poplist.txt",
      "lsqproject: YES",
      "numoutevec: 4",
      "numthreads: 1"
   ),
   path = "pca.WestEurasia.params.txt"
)
```

Let's see whether it worked, by printing out the contents of that file into your notebook:

```{r}
readr::read_lines(
   "pca.WestEurasia.params.txt"
)
```

Great, so that's our parameter file for running `smartPCA`.

**Note:** that we specified two output files in our parameter file, here called `pca.WestEurasia.evec` and `pca.WestEurasia.eval`. You can actually put any names you want in there. But beware of relative vs. absolute paths. File names starting with `/` are considered \"absolute\", that is, taken to go from the root of the file system. In contrast, filenames not starting with `/` are considered \"relative\" to the current working directory. If you forgot which directory you're in, run `pwd`.

**Note:** The option `poplistname` is a crucial one. Here you need to specify which populations are used to compute the eigenvectors of the principal components analysis. In our case, I have prepared two population list files: `data/popgen_course/WestEurasia.poplist.txt` and `data/popgen_course/AllEurasia.poplist.txt`. Pick one of the two to carry on.

Good, now we can run `smartPCA`. To do that, it's more convenient to use the terminal than a Rmarkdown file. So open a terminal and run

```
smartpca -p pca.WestEurasia.params.txt
```

This will typically run for about 30 minutes and output lots of logging output to the screen.

In a similar manner we can prepare a parameter file for the AllEurasia population list. This is how it should look:

```
genotypename: data/popgen_course/genotypes_small.geno
snpname: data/popgen_course/genotypes_small.snp
indivname: data/popgen_course/genotypes_small.ind
evecoutname: pca.AllEurasia.evec
evaloutname: pca.AllEurasia.eval
poplistname: data/popgen_course/AllEurasia.poplist.txt
lsqproject: YES
numoutevec: 4
numthreads: 1
```

And similar to the command above, we can run pca on the AllEurasia population list via:

```
smartpca -p pca.AllEurasia.params.txt
```

which will run slightly longer than the first one because there are more populations.