Does enhanced RP gene score drop peaks in exons/promoter regions of multiple genes? #107

mengxiao · 2021-01-21T17:19:22Z

Based on the MAESTRO paper, my understanding is that a peak that is in the exons or promoter region of more than one gene is scored as 0 for all genes:

In an “enhanced RP model,” [...] if the peak_i is located in the promoter or exon regions of any nearby genes, then W_ij is set to 0 (i.e., the peak is excluded).

If I've understood correctly, W_ij is calculated by RP_AddExonRemovePromoter, which keeps track of which peaks are in exons/promoter regions. However, it doesn't seem to track which ones are in more than one gene. Is it possible that an additional step to filter such peaks has been left out?

taoliu · 2021-01-21T19:04:53Z

@mengxiao According to

MAESTRO/MAESTRO/scATAC_Genescore.py

Lines 233 to 237 in 28e2927

    
           ### remove genes in promoters and exons 
        
           peaks_info_set = [tuple(i) for i in peaks_info] 
        
           peaks_info_inbody_set = [tuple(i) for i in peaks_info_inbody] 
        
           peaks_info_outbody_set = list(set(peaks_info_set)-set(peaks_info_inbody_set)) 
        
           peaks_info_outbody = [list(i) for i in peaks_info_outbody_set]

All the peaks will be classified as either the peaks in any 'gene-bodies' (=exon+promoter)', or the peaks outside of any 'gene-bodies'. If a peak_i is in the first class, it will be excluded from later calculation (L238 and so one) while applying a decay function. The definition of a peak_i 'located in the promoter or exon regions of any nearby genes' doesn't exclude the peak overlapping with the exon/promoter of the given gene_j itself. So my understanding of the current implementation is that if the peak_i is overlapping with the exon of the given gene, it will have a weight of 1 no matter whether it's overlapping with other genes; if the peak_i is not overlapping with the given gene, but it's overlapping with any nearby gene exons, it will have a weight of 0. For me, the confusion is on how the isoforms of a gene model are considered? In the current codes, each isoform is independent while calculating the RPs and they will inevitably influence each other. In brief, we are still looking into this issue.

mengxiao · 2021-01-21T19:47:19Z

Ah interesting. May I ask a bit about the rationale behind that approach? Based on the supplement, it sounded like the enhanced model tries to avoid misattributing peaks to unexpressed neighboring genes. If that's the goal, wouldn't it make sense to exclude any peaks in the 'gene bodies' of multiple genes?

The issue of isoforms might be being addressed here

MAESTRO/MAESTRO/scATAC_Genescore.py

Lines 370 to 383 in 28e2927

    
           score_cells_dict_dedup = {} 
        
           score_cells_dict_max = {} 
        
           for gene in genes: 
        
               score_cells_dict_max[gene] = float("-inf") 
        
           for gene in genes_list: 
        
               symbol = gene.split("@")[0] 
        
               if score_cells_sum_dict[gene] > score_cells_dict_max[symbol]: 
        
                   score_cells_dict_dedup[symbol] = score_cells_dict[gene] 
        
                   score_cells_dict_max[symbol] = score_cells_sum_dict[gene] 
        
           gene_symbol = sorted(score_cells_dict_dedup.keys()) 
        
           matrix_row = [] 
        
           for gene in gene_symbol: 
        
               matrix_row.append(score_cells_dict_dedup[gene])

I think this chooses the isoform with highest total RP across all cells as the one to represent a gene. A possible wrinkle to that comes in parsing the gene info, where I think if there are multiple isoforms for a gene that share the first and last exonic bases (but possibly differ e.g. in middle exons?), only the first one in the annotation is kept:

MAESTRO/MAESTRO/scATAC_Genescore.py

Line 87 in 28e2927

bed = bed.drop_duplicates(subset='uid', keep="first")

mengxiao · 2021-01-21T21:07:53Z

I was thinking a bit more about this- is the reason for handling overlaps this way that one doesn't expect many instances of peaks that are in promoter/exons of multiple genes? Dropping the promoter/exon regions is explicitly done for the exponential decay component of the model because that is much more likely to involve overlap between genes?

crazyhottommy added the question Further information is requested label Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does enhanced RP gene score drop peaks in exons/promoter regions of multiple genes? #107

Does enhanced RP gene score drop peaks in exons/promoter regions of multiple genes? #107

mengxiao commented Jan 21, 2021

taoliu commented Jan 21, 2021

mengxiao commented Jan 21, 2021

mengxiao commented Jan 21, 2021

Does enhanced RP gene score drop peaks in exons/promoter regions of multiple genes? #107

Does enhanced RP gene score drop peaks in exons/promoter regions of multiple genes? #107

Comments

mengxiao commented Jan 21, 2021

taoliu commented Jan 21, 2021

mengxiao commented Jan 21, 2021

mengxiao commented Jan 21, 2021