Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plot_SnpsInRuns and plot_manhattanRuns plotting all SNPs as 100% #25

Open
victorialindsay opened this issue Mar 16, 2021 · 8 comments
Open
Labels

Comments

@victorialindsay
Copy link

victorialindsay commented Mar 16, 2021

I've used the PLINK-generated .hom file as the runs file, and have the .ped and .map files used to produce the .hom file also in the commands. However, for both plot_SnpsInRuns and plot_manhattanRuns I get every SNP as 100%, and the error relates to missing data, but not sure how to correct this!

>  plot_SnpsInRuns(runs, genotypeFilePath, mapFilePath, savePlots = TRUE, separatePlots = FALSE, outputName = NULL)
[1] "Chromosome is:  1"
[1] "N. of runs: 784"
[1] "N.of SNP is 46640"
[1] "Chromosome is:  2"
[1] "N. of runs: 482"
[1] "N.of SNP is 40193"
[1] "Chromosome is:  3"
[1] "N. of runs: 513"
[1] "N.of SNP is 21231"
[1] "Chromosome is:  4"
[1] "N. of runs: 489"
[1] "N.of SNP is 19402"
[1] "Chromosome is:  5"
[1] "N. of runs: 349"
[1] "N.of SNP is 17928"
[1] "Chromosome is:  6"
[1] "N. of runs: 442"
[1] "N.of SNP is 34254"
[1] "Chromosome is:  7"
[1] "N. of runs: 444"
[1] "N.of SNP is 17192"
[1] "Chromosome is:  8"
[1] "N. of runs: 383"
[1] "N.of SNP is 16451"
[1] "Chromosome is:  9"
[1] "N. of runs: 340"
[1] "N.of SNP is 14904"
[1] "Chromosome is:  10"
[1] "N. of runs: 301"
[1] "N.of SNP is 15795"
[1] "Chromosome is:  11"
[1] "N. of runs: 267"
[1] "N.of SNP is 11478"
[1] "Chromosome is:  12"
[1] "N. of runs: 139"
[1] "N.of SNP is 6799"
[1] "Chromosome is:  13"
[1] "N. of runs: 132"
[1] "N.of SNP is 6727"
[1] "Chromosome is:  14"
[1] "N. of runs: 354"
[1] "N.of SNP is 16912"
[1] "Chromosome is:  15"
[1] "N. of runs: 375"
[1] "N.of SNP is 16701"
[1] "Chromosome is:  16"
[1] "N. of runs: 255"
[1] "N.of SNP is 16837"
[1] "Chromosome is:  17"
[1] "N. of runs: 328"
[1] "N.of SNP is 14520"
[1] "Chromosome is:  18"
[1] "N. of runs: 348"
[1] "N.of SNP is 14812"
[1] "Chromosome is:  19"
[1] "N. of runs: 227"
[1] "N.of SNP is 11425"
[1] "Chromosome is:  20"
[1] "N. of runs: 217"
[1] "N.of SNP is 27866"
[1] "Chromosome is:  21"
[1] "N. of runs: 201"
[1] "N.of SNP is 10591"
[1] "Chromosome is:  22"
[1] "N. of runs: 229"
[1] "N.of SNP is 9292"
[1] "Chromosome is:  23"
[1] "N. of runs: 194"
[1] "N.of SNP is 9361"
[1] "Chromosome is:  24"
[1] "N. of runs: 171"
[1] "N.of SNP is 8645"
[1] "Chromosome is:  25"
[1] "N. of runs: 155"
[1] "N.of SNP is 7096"
[1] "Chromosome is:  26"
[1] "N. of runs: 156"
[1] "N.of SNP is 7884"
[1] "Chromosome is:  27"
[1] "N. of runs: 121"
[1] "N.of SNP is 7594"
[1] "Chromosome is:  28"
[1] "N. of runs: 174"
[1] "N.of SNP is 8532"
[1] "Chromosome is:  29"
[1] "N. of runs: 90"
[1] "N.of SNP is 6616"
[1] "Chromosome is:  30"
[1] "N. of runs: 143"
[1] "N.of SNP is 5911"
[1] "Chromosome is:  31"
[1] "N. of runs: 52"
[1] "N.of SNP is 3064"
Saving 7 x 7 in image
There were 11 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: Removed 2024 row(s) containing missing values (geom_path).
2: Removed 209 row(s) containing missing values (geom_path).
3: Removed 495 row(s) containing missing values (geom_path).
4: Removed 623 row(s) containing missing values (geom_path).
5: Removed 95 row(s) containing missing values (geom_path).
6: Removed 148 row(s) containing missing values (geom_path).
7: Removed 39 row(s) containing missing values (geom_path).
8: Removed 541 row(s) containing missing values (geom_path).
9: Removed 565 row(s) containing missing values (geom_path).
10: Removed 81 row(s) containing missing values (geom_path).
11: Removed 17 row(s) containing missing values (geom_path).
>

Plot produced is attached. I'm assuming the same thing is causing my issue with the Manhattan plots.

SNPinRunsAllChr.pdf

@bunop
Copy link
Contributor

bunop commented Mar 16, 2021

Hi @victorialindsay,

thank you for your interest in detectRUNS. Did you import the plink output file using readExternalRuns like this?

runs <- readExternalRuns(inputFile = <plink .hom path>, program = "plink")

if yes, could you post an head of your runs dataframe and attach an example of plink .hom output file?

@bunop bunop added the question label Mar 16, 2021
@victorialindsay
Copy link
Author

Hi @bunop

Thanks for your help.

Yes, imported it exactly as that. Head is here:

> head(runs)
      group     id chrom nSNP      from        to lengthBps
1 Connemara CCa007     1 1498 102434676 110310186   7875511
2 Connemara CCa007     1  703 110490452 114951135   4460684
3 Connemara CCa007     1 1252 115119521 122991850   7872330
4 Connemara CCa007     1  252 123031771 124389313   1357543
5 Connemara CCa007     1  947 176737081 178886949   2149869
6 Connemara CCa007     2  472 100921777 101993650   1071874

I've attached part of the .hom file here as an example, but with FID, IID and phenotype blanked out. Github has made me convert this to a TXT as it wouldn't recognise the extension - but the original was a whitespace delimited file in case it has changed.

example.hom.txt

@bunop
Copy link
Contributor

bunop commented Mar 17, 2021

Hi @victorialindsay ,

Your files seems ok to me at first sight, however I can't reproduce your errors without your input files. I don't know if you can share them with us (or if they are small enough for GitHub), the best I can do for the moment is try to guess where the problem is. The function plot_SnpsInRuns will plot the number of times (%) a SNP is in a run for your samples (ie lines in a ped file). If I define a fake run for a SNP for every sample I have in my PED, I will have 100% for such SNPs. The warning you get Removed x row(s) containing missing values (geom_path) is thrown by ggplot when you want to plot something greater than 100%, I can reproduce this error when defining multiple runs for same individuals using the same SNPs. So you should check:

  1. You have installed the latest version of detectRUNS (for the moment, cran has version 0.9.6). If you can share other informations regarding you R environment will be great (the template suggested when opening an issue helps a lot)
  2. You have used the same file with plink and detectRUNs (you don't have to filter out ped on samples)
  3. You don't have issues on ped or map (sample with the same ids, duplicate SNPs, ...)
  4. Your files need to be readed correctly by detectRUNs (so space delimited and chromosome must be numbers (see issues Deal with tab-separated ped files #10, refactor chromosome input types in snpInsideRunsCpp #11, Allow extra chromosomes in map files #24)
  5. Check your plink command: it seems to me you have a lot of runs. Do you have overlapping runs or do you create a consensus? this could explain why you have such higher values
  6. Try to calculate ROHs entirely with detectRUNs: the sliding window approach is inspired from plink and others researchers cite this package when calculate ROHs. If your problem is related to the input files, I suppose you could find another error and this can help me to understand what is happing. If you are able to calculate and plot runs without problems, then the issues could be in plink command or plink output file.
  7. Search for a public dataset or for a dataset you can share with us which has the same problems and please send us exactly all the steps required to reproduce the errors.

Hope this helps

@victorialindsay
Copy link
Author

Hi @bunop

Thanks for checking them out.

I can't share the entire files at the moment unfortunately.

  1. Yes, it's the latest version.
  2. Yes, same files.
  3. Yes, they seem to be fine
  4. Yes, they're ok (and work for other commands using detectRUNS)
  5. I've got no overlapping runs per individual but will do within populations
  6. I'm trying to to this now and see if this makes a difference
  7. Will try this is 6 is a no-go

Thanks for your help, I will let you know the results of 6/7.

@victorialindsay
Copy link
Author

It must be something to do with the PLINK .hom output as when I calculate the ROHs with detectRUNS I can produce these plots just fine. Thank you for your help!

@Integratedhaplotypescore

Hi @victorialindsay ,

I had the same problem when I used PLINK .hom output.
However, I want to perform the ROH analysis in detectRUNS, but I could not identify the .ped and .map files to the software. I tried the following ways.

genotypeFilePath <-"C:/Users/musta/Desktop/ROH/DetectRuns/Data.ped"
genotypeFilePath <- file.path("C:", "Users", "musta", "Desktop", "ROH","DetectRuns","Data.ped" )

Can you show me how the detectRUNS package program defines .ped and .map files?

@bunop
Copy link
Contributor

bunop commented Nov 7, 2022

Hi @Integratedhaplotypescore ,

could you be more specific? which type of error you got? could you ensure that your .map and .ped files exists and the path location you use is correct?

@Integratedhaplotypescore

Dear @bunop

I made a mistake in the .ped and .map file path. I apologize for taking up your time and keeping this place busy. Now when I use the slidingRUNS.run command, I get the following error.

Error in slidingRUNS.run(genotypeFile = "Data.ped", mapFile = "Data.map", :
Number of markers differ in mapFile and genotype: are those file the same dataset?

You mentioned in your previous articles that the solution to this error is to convert tabs to spaces. I will try what you said. I unintentionally kept this place busy as I am new to using R. Thank you very much for helping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants