Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does enhanced RP gene score drop peaks in exons/promoter regions of multiple genes? #107

Open
mengxiao opened this issue Jan 21, 2021 · 3 comments
Labels
question Further information is requested

Comments

@mengxiao
Copy link
Contributor

Based on the MAESTRO paper, my understanding is that a peak that is in the exons or promoter region of more than one gene is scored as 0 for all genes:

In an “enhanced RP model,” [...] if the peaki is located in the promoter or exon regions of any nearby genes, then Wij is set to 0 (i.e., the peak is excluded).

If I've understood correctly, Wij is calculated by RP_AddExonRemovePromoter, which keeps track of which peaks are in exons/promoter regions. However, it doesn't seem to track which ones are in more than one gene. Is it possible that an additional step to filter such peaks has been left out?

@taoliu
Copy link
Collaborator

taoliu commented Jan 21, 2021

@mengxiao According to

### remove genes in promoters and exons
peaks_info_set = [tuple(i) for i in peaks_info]
peaks_info_inbody_set = [tuple(i) for i in peaks_info_inbody]
peaks_info_outbody_set = list(set(peaks_info_set)-set(peaks_info_inbody_set))
peaks_info_outbody = [list(i) for i in peaks_info_outbody_set]

All the peaks will be classified as either the peaks in any 'gene-bodies' (=exon+promoter)', or the peaks outside of any 'gene-bodies'. If a peak_i is in the first class, it will be excluded from later calculation (L238 and so one) while applying a decay function. The definition of a peak_i 'located in the promoter or exon regions of any nearby genes' doesn't exclude the peak overlapping with the exon/promoter of the given gene_j itself. So my understanding of the current implementation is that if the peak_i is overlapping with the exon of the given gene, it will have a weight of 1 no matter whether it's overlapping with other genes; if the peak_i is not overlapping with the given gene, but it's overlapping with any nearby gene exons, it will have a weight of 0. For me, the confusion is on how the isoforms of a gene model are considered? In the current codes, each isoform is independent while calculating the RPs and they will inevitably influence each other. In brief, we are still looking into this issue.

@mengxiao
Copy link
Contributor Author

Ah interesting. May I ask a bit about the rationale behind that approach? Based on the supplement, it sounded like the enhanced model tries to avoid misattributing peaks to unexpressed neighboring genes. If that's the goal, wouldn't it make sense to exclude any peaks in the 'gene bodies' of multiple genes?

The issue of isoforms might be being addressed here

score_cells_dict_dedup = {}
score_cells_dict_max = {}
for gene in genes:
score_cells_dict_max[gene] = float("-inf")
for gene in genes_list:
symbol = gene.split("@")[0]
if score_cells_sum_dict[gene] > score_cells_dict_max[symbol]:
score_cells_dict_dedup[symbol] = score_cells_dict[gene]
score_cells_dict_max[symbol] = score_cells_sum_dict[gene]
gene_symbol = sorted(score_cells_dict_dedup.keys())
matrix_row = []
for gene in gene_symbol:
matrix_row.append(score_cells_dict_dedup[gene])
I think this chooses the isoform with highest total RP across all cells as the one to represent a gene. A possible wrinkle to that comes in parsing the gene info, where I think if there are multiple isoforms for a gene that share the first and last exonic bases (but possibly differ e.g. in middle exons?), only the first one in the annotation is kept:
bed = bed.drop_duplicates(subset='uid', keep="first")

@mengxiao
Copy link
Contributor Author

I was thinking a bit more about this- is the reason for handling overlaps this way that one doesn't expect many instances of peaks that are in promoter/exons of multiple genes? Dropping the promoter/exon regions is explicitly done for the exponential decay component of the model because that is much more likely to involve overlap between genes?

@crazyhottommy crazyhottommy added the question Further information is requested label Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants