Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN in coefficient when using model with several covariates (but FDR < 1) #135

Closed
fbrundu opened this issue Jun 15, 2020 · 9 comments
Closed

Comments

@fbrundu
Copy link

fbrundu commented Jun 15, 2020

Hi! Thanks for your work on this tool.
I have an issue computing differential gene expression in a model with several covariates.
The model is the following:

~ group + n_genes + pair + percent_mito + percent_ribo_p

where group is the condition that I want to test (i.e. Case or Control). n_genes is the number of detected genes, pair is a categorical label that indicates the batch, percent_mito and percent_ribo_p are respectively the percentages of mitochondrial RNA and of the ribosomal proteins RNA in each cell.

When I analyze the DEG results, I notice that some genes have NAs in place of coefficient, but with FDR < 1. For example:

primerid component contrast Pr(>Chisq) ci.hi ci.lo coef z
CHCHD2 C groupControl 1 NA NA NA NA
CHCHD2 C (Intercept) NA 13.2139294612287 7.27487862856727 10.244404044898 6.76157303138578
CHCHD2 C n_genes NA 0.00026628978864847 -0.000415653246692049 -7.46817290217894E-05 -0.42928365449996
CHCHD2 C pair NA 0.603598745441239 -1.44191779536421 -0.419159524961484 -0.803256836415448
CHCHD2 C percent_mito NA 26.8917686666073 -24.2948918010666 1.29843843277039 0.0994357725673384
CHCHD2 C percent_ribo_p NA 7.78446426292038 -2.99466694859656 2.39489865716191 0.870926426731957
CHCHD2 D groupControl 2.187508828583E-11 6.92093514687845 2.50218975637947 4.71156245162896 4.17968988933359
CHCHD2 D (Intercept) NA -5.45262320626328 -12.9538395476205 -9.20323137694191 -4.80935390191168
CHCHD2 D n_genes NA 0.000605309648438176 -0.000322803454311596 0.00014125309706329 0.596588890143987
CHCHD2 D pair NA 2.63157523052559 0.813623683086202 1.72259945680589 3.7143266000497
CHCHD2 D percent_mito NA 6.69954331431088 -48.5358111642693 -20.9181339249792 -1.4845125736504
CHCHD2 D percent_ribo_p NA 11.8916501312325 -5.06608244072294 3.41278384525477 0.788894788302249
CHCHD2 H groupControl 2.187508828583E-11 NA NA NA NA
CHCHD2 S groupControl NA NA NA NA NA
CHCHD2 S (Intercept) NA NA NA NA 1.38042738481323
CHCHD2 S n_genes NA NA NA NA 0.118302666651904
CHCHD2 S pair NA NA NA NA 2.0584371703729
CHCHD2 S percent_mito NA NA NA NA -0.979397198510006
CHCHD2 S percent_ribo_p NA NA NA NA 1.17367083670798
CHCHD2 logFC groupControl NA NA NA NA NA
CHCHD2 logFC n_genes NA 7.28238715372072E-07 -4.51844845269732E-07 1.3819693505117E-07 0.459053959410812
CHCHD2 logFC pair NA 0.0192958459598146 -0.0102844516875691 0.00450569713612275 0.597086900024951
CHCHD2 logFC percent_mito NA 0.00289076102603723 -0.00495405141142126 -0.00103164519269202 -0.515496689977027
CHCHD2 logFC percent_ribo_p NA 0.323601607126519 -0.248633415386968 0.0374840958697755 0.256773790513932

On the other hand, one gene with FDR < 1 but coefficient different from NA is the following:

primerid component contrast Pr(>Chisq) ci.hi ci.lo coef z
COMT C groupControl 0.85483797977702 2.91033719012339 -1.98702041381175 0.461658388155819 0.369519192643462
COMT C (Intercept) NA 10.7859291790623 3.55357927507236 7.16975422706733 3.88600116133854
COMT C n_genes NA 0.000338013386848622 -0.000638047565651715 -0.000150017089401546 -0.602478956953161
COMT C pair NA 0.763112385689135 -0.618137024899985 0.0724876803945752 0.20571700057504
COMT C percent_mito NA 39.1117813733424 -8.40630622779144 15.3527375727755 1.26649931534773
COMT C percent_ribo_p NA 6.32602519941846 -11.8236927561382 -2.74883377835986 -0.593685832282923
COMT D groupControl 8.72957003792791E-05 4.89103889665209 1.16019280877743 3.02561585271476 3.17895617385414
COMT D (Intercept) NA -4.01790206062373 -10.9537976822741 -7.48584987144891 -4.23074306248644
COMT D n_genes NA 0.00127192998088608 0.000307072973360384 0.000789501477123231 3.20751043695249
COMT D pair NA 0.816996259468842 -0.831152113252747 -0.00707792689195255 -0.0168340205566892
COMT D percent_mito NA 56.7523833573714 4.82638755934986 30.7893854583606 2.32431119238381
COMT D percent_ribo_p NA 10.1403022854827 -8.9881881848373 0.576057050322684 0.118049155360658
COMT H groupControl 0.000446798779218395 NA NA NA NA
COMT S groupControl NA NA NA NA 2.50915099452388
COMT S (Intercept) NA NA NA NA -0.243769336060817
COMT S n_genes NA NA NA NA 1.84203542471195
COMT S pair NA NA NA NA 0.133560436021699
COMT S percent_mito NA NA NA NA 2.53908645997288
COMT S percent_ribo_p NA NA NA NA -0.336325919632768
COMT logFC groupControl NA 0.358433760158453 -0.19205639243217 0.0831886838631414 0.592369631048775
COMT logFC n_genes NA 1.30767850726043E-05 -6.89901568399462E-06 3.08888469430484E-06 0.606143686253434
COMT logFC pair NA 0.00351857412776664 -0.00349453670414979 1.2018711808429E-05 0.00671777271161418
COMT logFC percent_mito NA 45.7679197278209 -0.730975601504273 22.5184720631583 1.89834162373432
COMT logFC percent_ribo_p NA 0.0549897837974593 -0.0542142072425823 0.000387788277438516 0.0139198402946213

I am not sure how to interpret genes with FDR < 0.01 (for example) but no coefficient.
Is this an issue, or how can it be interpreted? I was also reading #98 but I'm not sure how to adapt the reply to that issue to my data.

I created a small dataset (n=132) of cells in which this behavior appears, that I can share privately if necessary. Please let me know if you need other information.

Thanks

@lijiaxiaoxiong
Copy link

I also encounter the same problem. I found many genes with NA logFC are also with small FDR, and they could be interesting study objects. Any help will be appreciated!

@gfinak
Copy link
Member

gfinak commented Jun 18, 2020

NA is missing, NaN is "Not a Number". They are not the same.
You have inestimable coefficients for some genes because there's not enough data to fit the proposed model.
Do you filter genes that have low expression frequency across cells?
Have you plotted your data vs the predictors for one of these genes and compared it to a gene that has all coefficients properly estimated?
Maybe post your test data set, include genes that do and don't show this problem.

@fbrundu
Copy link
Author

fbrundu commented Jun 18, 2020

Hi @gfinak, thanks for your reply.
The NaN for the coefficient (i.e. coef) gets translated to NA when saving the results to file, that's why it appears as NA here. But in the original results is NaN, hence the title.
Regarding the low expression, how do you define the threshold for expression frequency across cells? I.e. which threshold is necessary for MAST to correctly estimate DE?
In the previous example, COMT - correctly estimated - is expressed in 12 cells, CHCHD2 (NaN coef) is instead detected in 22 cells.
I am not sure how to plot the data against the predictors, is it included in the tutorial?
I can send you a minimal example dataset privately, is there an email address I can use to send the link to the rds file?

Thanks

@fbrundu
Copy link
Author

fbrundu commented Jun 18, 2020

I possibly spotted the issue. Assuming testing by condition, I computed the number of counts for each condition, for each gene (correctly estimated and not).
The genes not estimated have one condition with zero counts, while the others have positive counts on both conditions. That's possibly why the coefficient cannot be estimated.

@gfinak
Copy link
Member

gfinak commented Jun 18, 2020

Yes, that's precisely why.
It is recommended to do some pre-filtering by removing genes that are expressed in fewer than 10% of cells i.e. if M is your matrix of counts and rows are genes and columns are cells, keeps genes where rowMeans(M>0) > 0.1
How many total cells are in your experiment?

@fbrundu
Copy link
Author

fbrundu commented Jun 18, 2020

Ok, thanks! Just asking: is there a way that you know to define the threshold more accurately? I imagine that if one population contributes to less than 10% of total number of cells and the markers are highly specific, we might lose those markers, even if we have enough cells to compute DE.
The experiment has a total of 30k cells, however, I evaluate the condition in each cell subtype (in this case n=132), usually ranging from 100 to 5k cells. I use a very loose threshold on the genes I test (minimum of n=3 cells) because I would like not to filter too much beforehand without a clear rationale.

@gfinak
Copy link
Member

gfinak commented Jun 18, 2020

Think about statistical power. How much power do you have to detect a difference with a sample size of 3? Especially after you adjust for multiple testing. We chose 10% because it's an empirical lower limit for the discrete part of the test.

@fbrundu
Copy link
Author

fbrundu commented Jun 18, 2020

Ok thanks, it is clear.

@fbrundu fbrundu closed this as completed Jun 18, 2020
@fbrundu
Copy link
Author

fbrundu commented Jun 18, 2020

For future reference:

If I read it correctly, such genes where the continuous component cannot be estimated should be dropped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants