Ezequiel Calvo-Roitberg, Christine L. Carroll, Sergey V. Venev, GyeungYun Kim, Steven T. Mick, Job Dekker, Ana Fiszbein, Athma A. Pai
Short read sequencing data was processed and analyzed using the HITindex pipeline as outlined in the methods. Briefly, fastq files are mapped to genome assembly GRCh38.p14 for human or GRCm38.95 for mouse. The resulting bam files were used as input for the HITindex pipeline. First, an annotated metaexon file is generated from a gtf file (corresponding to the relevant genome assembly used to create the bam files) by collapsing overlapping consituent exons. Next, the pipeline uses the bam file and annotated metaexon file to to classify metaexons into an exon-type (.exon output) as well as calculating relative usage (PSI) of first and last exons (.AFEPSI and .ALEPSI outputs).The resulting AFEPSI and ALEPSI outputs from the HITindex pipeline are used for the short-read data analysis. While each RNA sequencing sample is handled individually within the HITindex, we load relevant AFEPSI and ALEPSI outputs into R as lists in order to perform bulk analysis across sample groups
The long_read_filtering_and_correlations.R script analyzes and filters long-read sequencing bed files after using bam2bed split files where each feature of each read is one individual row. It outputs Spearman correlations between TSS-TES, filtered reads matching FE and polyA databases or any other databases that want to be checked.
Fastq files from PacBio reads were downloaded from the ENCODE Project data portal and mapped to the hg38 human or mm38 reference genome using minimap2 following the developers’ recommended parameters: -ax splice:hq -uf and -ax splice. Reads were divided into multiple split features to define exons using bedtools bam2bed and assigned to overlapping genes. Only primary alignments and reads with all their features assigned to the same gene were used as input for long_read_filtering_and_correlations.R.
The script conditions on reads whose 5’ end are within 20 nt of an empirically determined first exon identified in the full set of GTEx v8 samples processed by the HITindex pipeline (first_exons_hit_index.bed), and whose 3’ end are within 50nt of an experimentally verified human or mouse PAS in the polyASite 2.0 database (polyA_db.bed). In addition, the script also discards full unspliced reads or reads whose 5’ and 3’ end mapped to the same exon. Filtered reads at each step are stored for further inspection under different variable names specified in the script. After filtering, the script assigns a first exon and polyA peak index per read and calculates one Spearman correlation per gene between the start and end of the filtered reads.