Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in the output *.metrics.tsv file, many metrics are 0/-nan #37

Open
AlvinZuo opened this issue Mar 30, 2020 · 26 comments
Open

in the output *.metrics.tsv file, many metrics are 0/-nan #37

AlvinZuo opened this issue Mar 30, 2020 · 26 comments
Labels

Comments

@AlvinZuo
Copy link

I am using RNASeQC 2.3.5 to evaluate some RNA-seq data. However, in the output *.metrics.tsv file, many metrics are 0/-nan. I paste some lines in my outputs files here. Here is the command I use:
rnaseqc.v2.3.5 gencode.v33.primary_assembly.genes.gtf *bam outdir
I have tried several samples and all appear this problem. Could you please help to solve the problem?

Genes Detected 0
Estimated Library Complexity 0
Genes used in 3' bias 0
Mean 3' bias 0
Median 3' bias 0
3' bias Std 0
3' bias MAD_Std 0
3' Bias, 25th Percentile 0
3' Bias, 75th Percentile 0
Median of Avg Transcript Coverage 0
Median of Transcript Coverage Std 0
Median of Transcript Coverage CV 0
Median Exon CV 0
Exon CV MAD 0

@agraubert
Copy link
Collaborator

agraubert commented Apr 17, 2020

Would it be possible to provide the entire metrics file? This can be very helpful for diagnosing the issue. If not, could you at least show the count of Total Reads, Mapped Reads, and Unique Rate of Mapped

Edit: I sincerely apologize for the late reply

@AlvinZuo
Copy link
Author

Would it be possible to provide the entire metrics file? This can be very helpful for diagnosing the issue. If not, could you at least show the count of Total Reads, Mapped Reads, and Unique Rate of Mapped

Edit: I sincerely apologize for the late reply

Would it be possible to provide the entire metrics file? This can be very helpful for diagnosing the issue. If not, could you at least show the count of Total Reads, Mapped Reads, and Unique Rate of Mapped

Edit: I sincerely apologize for the late reply

Thanks for reply. Here are the contents of the entire metrics file:
Mapping Rate 0.95124
Unique Rate of Mapped 0.200036
Duplicate Rate of Mapped 0.799964
Duplicate Rate of Mapped, excluding Globins 0.799964
Base Mismatch 0
End 1 Mapping Rate 0
End 2 Mapping Rate 0
End 1 Mismatch Rate -nan
End 2 Mismatch Rate -nan
Expression Profiling Efficiency 0.852142
High Quality Rate 0
Exonic Rate 0.895822
Intronic Rate 0.0341465
Intergenic Rate 0.0188213
Intragenic Rate 0.929969
Ambiguous Alignment Rate 0.05121
High Quality Exonic Rate -nan
High Quality Intronic Rate -nan
High Quality Intergenic Rate -nan
High Quality Intragenic Rate -nan
High Quality Ambiguous Alignment Rate -nan
Discard Rate 0
rRNA Rate 0.00152836
Chimeric Alignment Rate 0
End 1 Sense Rate -nan
End 2 Sense Rate -nan
Avg. Splits per Read 0.618619
Alternative Alignments 10438247
Chimeric Reads 0
Duplicate Reads 45657482
End 1 Antisense 0
End 2 Antisense 0
End 1 Bases 0
End 2 Bases 0
End 1 Mapped Reads 0
End 2 Mapped Reads 0
End 1 Mismatches 0
End 2 Mismatches 0
End 1 Sense 0
End 2 Sense 0
Exonic Reads 51128532
Failed Vendor QC 0
High Quality Reads 0
Intergenic Reads 1074215
Intragenic Reads 53077423
Ambiguous Reads 2922779
Intronic Reads 1948891
Low Mapping Quality 9307636
Low Quality Reads 57074417
Mapped Duplicate Reads 45657482
Mapped Reads 57074417
Mapped Unique Reads 11416935
Mismatched Bases 0
Non-Globin Reads 57074417
Non-Globin Duplicate Reads 45657482
Reads excluded from exon counts 0
Reads used for Intron/Exon counts 57074417
rRNA Reads 87230
Total Bases 8341345385
Total Mapped Pairs 0
Total Reads 70438247
Unique Mapping, Vendor QC Passed Reads 60000000
Unpaired Reads 60000000
Read Length 150
Genes Detected 0
Estimated Library Complexity 0
Genes used in 3' bias 0
Mean 3' bias 0
Median 3' bias 0
3' bias Std 0
3' bias MAD_Std 0
3' Bias, 25th Percentile 0
3' Bias, 75th Percentile 0
Median of Avg Transcript Coverage 0
Median of Transcript Coverage Std 0
Median of Transcript Coverage CV 0
Median Exon CV 0
Exon CV MAD 0

@agraubert
Copy link
Collaborator

This seems to indicate that your data was a single-ended library. If that's the case, you need to run rnaseqc using the --unpaired flag. By default, several of the metrics only consider properly paired reads, which is why you see so many zeros/nans in your metrics. With this flag set, unpaired reads will be included in those metrics

@AlvinZuo
Copy link
Author

Actually, my data is from pair-end library.

@agraubert
Copy link
Collaborator

Then something really bad happened during alignment. Rna-seqc is reporting 60 million unpaired reads out of your library of about 70 million reads. However, it doesn't include alternative alignments in the count of unpaired reads, and I'd hazard a guess that those are unpaired too

@AlvinZuo
Copy link
Author

All the 8 samples I analyzed by RNASEQC show 60000000 unpaired reads, which is strange. I paste here total reads number.

1.metrics.tsv:Total Reads 70438247
2.metrics.tsv:Total Reads 69810673
3.metrics.tsv:Total Reads 66390502
4.metrics.tsv:Total Reads 66555270
5.metrics.tsv:Total Reads 71697867
6.metrics.tsv:Total Reads 70002406
7.metrics.tsv:Total Reads 68213216
8.metrics.tsv:Total Reads 67800885

@agraubert
Copy link
Collaborator

That is certainly strange. Would it be possible to share some of the fastqs? Even just a subset of reads would help us understand what is going on.

@AlvinZuo
Copy link
Author

yes, I can share some fqs, how can I share with you?

@agraubert
Copy link
Collaborator

Depending on access control requirements, you could upload them here if the data is safe to be shared 100% publicly, or you could upload them to a cloud storage platform of your choice and grant read access to [email protected] and I will take a look

@ushunmugam
Copy link

I am having the same problem, I am using paired end data and getting below results. the only difference in my gtf/bam file is that I have chr prefix both in gtf & bam will that be a problem?
Genes Detected 0
Estimated Library Complexity 0
Genes used in 3' bias 0
Mean 3' bias 0
Median 3' bias 0
3' bias Std 0
3' bias MAD_Std 0
3' Bias, 25th Percentile 0
3' Bias, 75th Percentile 0
Median of Avg Transcript Coverage 0
Median of Transcript Coverage Std 0
Median of Transcript Coverage CV 0
Median Exon CV 0
Exon CV MAD 0

@agraubert
Copy link
Collaborator

No, in fact we require that the chromosome names in the gtf and bam match. Are you also observing a significant number of unpaired reads?

@ushunmugam
Copy link

Not really. I don't have very high number of unpaired reads.

@kvaldez
Copy link

kvaldez commented Dec 11, 2020

Hello, is there an update on this issue? I'm having the same problem, I'm using paired end reads and I also have a chr prefix in my gtf and bam. I'm getting the below results:

Genes Detected 0
Estimated Library Complexity 0
Genes used in 3' bias 0
Mean 3' bias 0
Median 3' bias 0
3' bias Std 0
3' bias MAD_Std 0
3' Bias, 25th Percentile 0
3' Bias, 75th Percentile 0
Median of Avg Transcript Coverage 0
Median of Transcript Coverage Std 0
Median of Transcript Coverage CV 0
Median Exon CV 0
Exon CV MAD 0

@francois-a
Copy link
Collaborator

Hi, are you able to share example files that reproduce the issue?

@kvaldez
Copy link

kvaldez commented Dec 11, 2020

Thank you for your quick reply. No unfortunately I cannot share the data, I was hoping a resolution to one of the above issues would help me debug. Do you know of any possible causes that I can look into?

@kvaldez
Copy link

kvaldez commented Dec 11, 2020

If it helps, I was able to run an older version of rnaseqc. I attached some of the results here.
report_redact.pdf

@kvaldez
Copy link

kvaldez commented Dec 11, 2020

These are the fields in my gtf, I see that they differ slightly when compared to the example. Is gene_status a required gtf field?

chr1 HAVANA exon 13221 14403 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENSG00000223972.5_4"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2_4"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap"; exon_id "ENSG00000223972.5_4_4; exon_number 4";

@kvaldez
Copy link

kvaldez commented Dec 11, 2020

Hello hello, an update, I tried a couple of things but I am able to get gene counts when I use the legacy mode flag.

@kvaldez
Copy link

kvaldez commented Dec 14, 2020

Greetings, thank you for your assistance earlier. I was able to get rnaseqc to run and detect genes in legacy mode, however I believe I'm missing some metrics in my output, such as End 1 Bases/ End 2 Bases etc. They're reported as 0, but this seems improbable. Do you know what may be the cause of this?

Mapping Rate 0.959287
Unique Rate of Mapped 0.588873
Duplicate Rate of Mapped 0.411127
Duplicate Rate of Mapped, excluding Globins -nan
Base Mismatch 0
End 1 Mapping Rate 0
End 2 Mapping Rate 0
End 1 Mismatch Rate -nan
End 2 Mismatch Rate -nan
Expression Profiling Efficiency 0.827836
High Quality Rate 0.902745
Exonic Rate 0.86297
Intronic Rate 0.108792
Intergenic Rate 0.0282378
Intragenic Rate 0.971762
Ambiguous Alignment Rate 0
High Quality Exonic Rate 0.886898
High Quality Intronic Rate 0.105084
High Quality Intergenic Rate 0.008018
High Quality Intragenic Rate 0.991982
High Quality Ambiguous Alignment Rate 0
Discard Rate 0
rRNA Rate 0.0125571
End 1 Sense Rate 0.014916
End 2 Sense Rate 0.984976
Avg. Splits per Read 0.0129741
Alternative Alignments 21774450
Chimeric Reads 0
Chimeric Alignment Rate 0
Duplicate Reads 0
End 1 Antisense 47871497
End 2 Antisense 745772
End 1 Bases 0
End 2 Bases 0
End 1 Mapped Reads 0
End 2 Mapped Reads 0
End 1 Mismatches 0
End 2 Mismatches 0
End 1 Sense 724862
End 2 Sense 48893279
Exonic Reads 94475150
Failed Vendor QC 0
High Quality Reads 98829551
Intergenic Reads 3091382
Intragenic Reads 106385383
Ambiguous Reads 0
Intronic Reads 11910233
Low Mapping Quality 15293516
Low Quality Reads 10647214
Mapped Duplicate Reads 45008876
Mapped Reads 109476765
Mapped Unique Reads 64467889
Mismatched Bases 0
Non-Globin Reads 0
Non-Globin Duplicate Reads 0
Reads used for Intron/Exon counts 109476765
rRNA Reads 1374715
Total Bases 6375454477
Total Mapped Pairs 54245964
Total Reads 135897517
Unique Mapping, Vendor QC Passed Reads 114123067
Unpaired Reads 0
Read Length 75
Genes Detected 23026
Estimated Library Complexity 0
Genes used in 3' bias 9949
Mean 3' bias 0.377931
Median 3' bias 0.294071
3' bias Std 0.309967
3' bias MAD_Std 0.30531
3' Bias, 25th Percentile 0.111111
3' Bias, 75th Percentile 0.6
Median of Avg Transcript Coverage 1.29973
Median of Transcript Coverage Std 1.94646
Median of Transcript Coverage CV 1.59042
Median Exon CV 0.326226
Exon CV MAD 0.280732

@agraubert
Copy link
Collaborator

Can you provide the command used to generate these results?

@cmatKhan
Copy link

cmatKhan commented Jul 16, 2021

I am having the same issues as above -- I believe that I can share data, though I'll need to discuss with collaborators. I would be interested to know if there has been any progress on the issue?

Edit: I initially thought this had something to do with the annotation file I am using, though nothing sticks out to me yet. However, I ran into another issue with this data set earlier that was caused by a bug in version 2.5.1a of STAR, which is the version that was used to align this data. I haven't yet run rnaseqc on the data after re-aligning it with a recent version of STAR -- I wonder if others were a) using STAR and b) if it was an older version?

@hurrialice
Copy link

Hi @agraubert and @francois-a, I had the same issue with STAR-aligned Illumina Stranded RNA-seq BAM, and I realized it was because I did not change the default mapping-quality (255).

As "lower bound for exon coverage counting" by default is set to 255, nothing appears to be "high quality" since MAPQ is defined to be [0, 2^8-1] in SAM spec. Do you think it makes sense to lower this default to something more reasonable?

@agraubert
Copy link
Collaborator

I think that 255 is still a sensible default. STAR uses 255 for uniquely mapped reads and GMAP defaults to 255 unless it has reason to lower the quality. In my experience, a bam without any 255 quality reads is reason for concern. You can use the -q option to lower RNA-SeQC's MAPQ threshold to something sensible for your particular experiment.

@hurrialice
Copy link

Thanks @agraubert for clarifying !! I don't have access to the original STAR command but I suspect it used --outSAMmapqUnique 50 to ensure downstream compatibility, as the max MAPQ for all files are capped at 50. I did not realize it was not STAR's default. Thanks again!

@agraubert
Copy link
Collaborator

No problem! And thanks for making the mapq observation in the first place. I suspect that might be the same source of the problem for most of the people in this thread

@lintingyi2014
Copy link

Hello,

I keep getting 0/Na for End 1 End 2 mismatch and mapping rates columns, after reading this forum I suspect it has to do with STAR. Would --alignEndsType EndToEnd be a problem?

Thanks in advance

-------- parameters used
STAR --runThreadN 12
--genomeDir /data/HumanRNAProject/Aging_samples/RNA_seq/references/STAR_index
--readFilesIn *.R1.fastq.gz *.R2.fastq.gz
--outFileNamePrefix *
--readFilesCommand zcat
--outSAMtype BAM SortedByCoordinate
--outReadsUnmapped Fastx
--twopassMode Basic
--genomeSAindexNbases 11
--outFilterType BySJout
--outFilterMultimapNmax 20
--alignSJoverhangMin 8
--alignSJDBoverhangMin 1
--outFilterMismatchNmax 999
--alignIntronMin 20
--alignIntronMax 1000000
--alignMatesGapMax 1000000
--outSAMprimaryFlag AllBestScore
--quantMode TranscriptomeSAM
--alignEndsType EndToEnd

Sample *.starAligned.sortedByCoord.out.bam
Mapping Rate 1
Unique Rate of Mapped 1
Duplicate Rate of Mapped 0
Duplicate Rate of Mapped, excluding Globins 0
Base Mismatch 0
End 1 Mapping Rate 0
End 2 Mapping Rate 0
End 1 Mismatch Rate -nan
End 2 Mismatch Rate -nan
Expression Profiling Efficiency 0.650004
High Quality Rate 0.894536
Exonic Rate 0.650004
Intronic Rate 0.229384
Intergenic Rate 0.0649567
Intragenic Rate 0.879388
Ambiguous Alignment Rate 0.0556554
High Quality Exonic Rate 0.684956
High Quality Intronic Rate 0.221583
High Quality Intergenic Rate 0.0385116
High Quality Intragenic Rate 0.906539
High Quality Ambiguous Alignment Rate 0.0549495
Discard Rate 0
rRNA Rate 0.00898725
End 1 Sense Rate 0.0372744
End 2 Sense Rate 0.963138
Avg. Splits per Read 0.267283
Alternative Alignments 250136
Chimeric Reads 0
Chimeric Alignment Rate 0
Duplicate Reads 0
End 1 Antisense 17435688
End 2 Antisense 668131
End 1 Bases 0
End 2 Bases 0
End 1 Mapped Reads 0
End 2 Mapped Reads 0
End 1 Mismatches 0
End 2 Mismatches 0
End 1 Sense 675067
End 2 Sense 17457179
Exonic Reads 27924539
Failed Vendor QC 0
High Quality Reads 38429770
Intergenic Reads 2790576
Intragenic Reads 37778984
Ambiguous Reads 2390985
Intronic Reads 9854445
Low Mapping Quality 4529623
Low Quality Reads 4530775
Mapped Duplicate Reads 0
Mapped Reads 42960545
Mapped Unique Reads 42960545
Mismatched Bases 0
Non-Globin Reads 42960378
Non-Globin Duplicate Reads 0
Reads used for Intron/Exon counts 42960545
rRNA Reads 386097
Total Bases 5164509151
Total Mapped Pairs 21479632
Total Reads 43210681
Unique Mapping, Vendor QC Passed Reads 42960545
Unpaired Reads 0
Read Length 126
Genes Detected 21706
Estimated Library Complexity 0
Genes used in 3' bias 4994
Mean 3' bias 0.627757
Median 3' bias 0.647059
3' bias Std 0.278434
3' bias MAD_Std 0.335964
3' Bias, 25th Percentile 0.425532
3' Bias, 75th Percentile 0.875
Median of Avg Transcript Coverage 1.24729
Median of Transcript Coverage Std 1.55302
Median of Transcript Coverage CV 1.05653
Median Exon CV 0.32517
Exon CV MAD 0.341731

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants