-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Limit the MSA depth in Jackhmmer/Nhmmer/Phmmer/Hmmsearch #323
Comments
It's a high priority for us to make better ways of dealing with tons of hits and large alignments. Unlikely that we'll do something like first-N hits, though, because it makes search results dependent on the order of sequences in the target db. Or if you want the top-N, not just first-N, then that doesn't save you computation, only space - and you should be able to achieve a top-N strategy now by filtering or downsamplng an output tabular data file any way you want, and immediately discarding the original large output file. We're working on other downsampling methods, and also on highly compressed alignment representations. |
Oh sorry I wasn't clear: I meant top-N sorted by e-value, not first-N. Thinking of
Yes, good point, you have to align to query and sort anyway. But it would still be useful if this functionality was built in, as multi-GB MSA files could be produced, and saving those could take non-trivial amount of time, especially with remote filesystems. We observed this with a sequence that has over 1M hits in UniRef90 with e-value limit of 0.001 and produces 45 GB Stockholm MSA and 256 MB tblout files. |
One way you can go is to generate the tblout file (and not the full MSA); filter or downsample the lines in the tblout file; fetch those subsequences ( |
Thanks, that is an interesting thing to try. However, even with |
Without |
Here are timings with and without Without $ jackhmmer \
-o /dev/null --tblout tblout.txt \
--noali \
--cpu 22 \
-N 1 \
-E 0.001 --incE 0.001 \
query.fasta uniref90.fasta
real 18m26.775s
user 332m53.777s
sys 3m47.448s With $ jackhmmer \
-o /dev/null -A out.a3m --tblout tblout.txt \
--noali \
--cpu 22 \
-N 1 \
-E 0.001 --incE 0.001 \
query.fasta uniref90.fasta
real 20m1.193s
user 332m53.121s
sys 4m5.238s I am testing with this sequence: > query
PNPCPESSASFLSRITFWWITGMMVQGYRQPLESTDLWSLNKEDTSEQVVPVLVKNWKKECAKSRKQPVKIVYSSKDPAK
PKGSSKVDVNEEAEALIVKCPQKERDPSLFKVLYKTFGPYFLMSFLFKAVHDLMMFAGPEILKLLINFVNDKKAPEWQGY
FYTALLFISACLQTLVLHQYFHICFVSGMRIKTAVIGAVYRKALVITNAARKSSTVGEIVNLMSVDAQRFMDLATYINMI
WSAPLQVILALYLLWLNLGPSVLAGVAVMVLMVPLNAVMAMKTKTYQVAHMKSKDNRIKLMNEILNGIKVLKLYAWELAF
KDKVLAIRQEELKVLKKSAYLAAVGTFTWVCTPFLVALSTFAVYVTVDENNILDAQKAFVSLALFNILRFPLNILPMVIS
SIVQASVSLKRLRVFLSHEDLDPDSIQRRPIKDAGATNSITVKNATFTWARNDPPTLHGITFSVPEGSLVAVVGQVGCGK
SSLLSALLAEMDKVEGHVTVKGSVAYVPQQAWIQNISLRENILFGRQLQERYYKAVVEACALLPDLEILPSGDRTEIGEK
GVNLSGGQKQRVSLARAVYCDSDVYLLDDPLSAVDAHVGKHIFENVIGPKGLLKNKTRLLVTHAISYLPQMDVIIVMSGG
KISEMGSYQELLARDGAFAEFLRTYASAEQEQGQPEDGLAGVGGPGKEVKQMENGMLVTDTAGKQMQRQLSSSSSYSRDV
SQHHTSTAELRKPGPTEETWKLVEADKAQTGQVKLSVYWDYMKAIGLFISFLSIFLFLCNHVASLVSNYWLSLWTDDPIV
NGTQEHTQVRLSVYGALGISQGITVFGYSMAVSIGGIFASRRLHLDLLHNVLRSPISFFERTPSGNLVNRFSKELDTVDS
MIPQVIKMFMGSLFNVIGACIIILLATPMAAVIIPPLGLIYFFVQRFYVASSRQLKRLESVSRSPVYSHFNETLLGVSVI
RAFEEQERFIRQSDLKVDENQKAYYPSIVANRWLAVRLECVGNCIVLFASLFAVISRHSLSAGLVGLSVSYSLQVTTYLN
WLVRMSSEMETNIVAVERLKEYSETEKEAPWQIQDMAPPKDWPQVGRVEFRDYGLRYREDLDLVLKHINVTIDGGEKVGI
VGRTGAGKSSLTLGLFRIKESAEGEIIIDDINIAKIGLHDLRFKITIIPQDPVLFSGSLRMNLDPFSQYSDEEVWTSLEL
AHLKGFVSALPDKLNHECAEGGENLSVGQRQLVCLARALLRKTKILVLDEATAAVDLETDDLIQSTIRTQFDDCTVLTIA
HRLNTIMDYTRVIVLDKGEIQEWGSPSDLLQQRGLFYSMAKDSGLV |
(Pleasantly surprised to see the multithreaded parallelization scaling well to 22 cores; normally jackhmmer starts becoming input-bound very quickly, even at 2 cores or so. You must have some fast storage.) |
A few more observations. We have one further patch to output a3m instead of Stockholm from Jackhmmer: -if (textw > 0) esl_msafile_Write(afp, msa, eslMSAFILE_STOCKHOLM);
+// Write the output as A2M unless a '.sto' file extension is provided.
+if (textw > 0) {
+ if (afp_ext && strcmp(afp_ext, "sto") == 0) {
+ esl_msafile_Write(afp, msa, eslMSAFILE_STOCKHOLM);
+ } else if (strcmp(esl_opt_GetString(go, "-A"), "/dev/stdout") == 0 ||
+ (afp_ext && (strcmp(afp_ext, "a3m") == 0 || strcmp(afp_ext, "a2m") == 0))) {
+ esl_msafile_Write(afp, msa, eslMSAFILE_A2M);
+ } else {
+ p7_Fail("Bad output MSA format %s, only a2m/a3m and sto are supported\n", afp_ext);
+ }
+}
else esl_msafile_Write(afp, msa, eslMSAFILE_PFAM); This patch reduces the output MSA size from 45 GB (Stockholm) to 1.6 GB (a3m) and slightly reduces the runtime to:
This patch together with the
It therefore seems to me that there must me some non-trivial amount of work that is eliminated thanks to applying the naive sequence limit patch. |
I did a bit of profiling and here is the profile for the offending query with For comparison, here is profile for a query that runs 2.5x faster because it doesn't have that many hits: Does the profile look reasonable to you? Is it reasonable for |
The time breakdown of a hmmsearch run will vary tremendously depending on
the fraction of sequences in the database that are detected as homologs.
While I haven't done much profiling of hmmsearch with the -A option on,
those profiles show a much higher fraction of time being spent in the later
stages of our comparison pipeline, which would include
p7_domaindef_ByPosteriorHeuristics, than the profiles I've seen. A more
typical run would have the SSV filter, which you're seeing as calc_step_14
taking 40-60% of the run time of the compute thread, and the final stage,
which calls p7_domaindef_ByPosteriorHeuristics, taking more like 10%.
For example, here's the top of the output of a perf report on the compute
thread when I searched the Caudal_act HMM from Pfam against a version of
the Uniprot TrEMBL database, with the -A option turned on:
+ 72.02% 60.86% hmmsearch hmmsearch [.] calc_band_9
+ 13.78% 13.17% hmmsearch hmmsearch [.] p7_ViterbiFilter
+ 10.59% 5.68% hmmsearch hmmsearch [.] p7_pli_NewSeq
+ 8.52% 6.89% hmmsearch hmmsearch [.] esl_hmm_Forward
+ 6.01% 0.00% hmmsearch [unknown] [.] 0x8000800080008000
+ 4.41% 0.00% hmmsearch [unknown] [.] 0x8000800080008000
4.39% 4.34% hmmsearch hmmsearch [.] forward_engine
+ 2.35% 0.00% hmmsearch [unknown] [.] 0x00005740f62e3120
+ 2.35% 0.00% hmmsearch [unknown] [.] 0x00005740f62e3ad0
1.21% 1.00% hmmsearch hmmsearch [.] p7_MSVFilter
Here, the left-hand column of percentages is runtime fraction of the
function and all its children, while the column to the right is the
run-time fraction of the function itself. In this search, hmmsearch finds
about 2k homologs in searching 200M+ sequences, so spends most of its time
in the higher stages of the comparison pipeline. (Note that the number
after calc_band_ is the number of SIMD vectors used in computing a row of
the DP matrix, so varies with HMM length.)
If you have the outputs of these searches, could you send me the pipeline
statistics information that appears at the very end of the output? That'd
let me see what fraction of the HMM-sequence comparisons are making it to
each stage of the pipeline.
Also, can you tell me what tool you used to generate those performance
profiles? I haven't been happy with the ones I have access to, so would
appreciate a pointer to any good ones you know of.
…--Nick
On Fri, Apr 19, 2024 at 4:55 AM Augustin Zidek ***@***.***> wrote:
I did a bit of profiling and here is the profile for the offending query
with --cpu 1 (and with our --seq_limit and a3m-output patches applied).
image.png (view on web)
<https://github.com/EddyRivasLab/hmmer/assets/6466058/3ab3902c-fce0-46ed-b0f0-55d2d9ecde11>
For comparison, here is profile for a query that runs 2.5x faster because
it doesn't have that many hits:
image.png (view on web)
<https://github.com/EddyRivasLab/hmmer/assets/6466058/b3b55f5d-58cc-496c-83c6-638cc43f5c1c>
Does the profile look reasonable to you? Is it reasonable for
p7_domaindef_ByPosteriorHeuristics to account for most of the run time?
—
Reply to this email directly, view it on GitHub
<#323 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDJBZENH6LN7JHSQZW6W7LY6DLX5AVCNFSM6AAAAABGAG6PVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWGEZTCNRRGQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks for the extra details, Nick!
Indeed -- this query (see #323 (comment)) has over 1M homologs in UniRef90.
Could run this report with the sequence and database I reported above to check if you get similar numbers to what I am seeing when profiling?
Absolutely, here they are:
I used the perf record -g <jackhmmer command> I then visualiased the generated pprof -flame perf.data This might require an extra conversion step (see https://github.com/google/pprof#using-pprof-with-linux-perf). |
Thanks for the pointers on profiling tools. I've used perf, but hadn't
seen the pprof visualizer.
Looking at the pipeline statistics you sent, you're seeing a very large
fraction of both hits and comparisons that make it to the later stages of
the pipeline, which I think explains the profiling results you're seeing.
One of our rules of thumb is that the most extreme searches will find 1% of
the database as hits, and you're seeing close to that. You're also seeing
significantly more comparisons making it through each of the filter stages
than would normally be expected, even after subtracting the comparisons
that become hits. For example, the MSV filter is tuned such that we expect
2% of non-homologous sequences to pass, and you're seeing almost 4% (5M
that pass MSV - 1M that are detected as homologs).
Given all that, the profile you sent doesn't surprise me that much. I'm
pretty busy today but will try to run your query and database on our
machines sometime in the next few days to compare.
…-Nick
On Mon, Apr 22, 2024 at 5:39 AM Augustin Zidek ***@***.***> wrote:
Thanks for the extra details, Nick!
The time breakdown of a hmmsearch run will vary tremendously depending on
the fraction of sequences in the database that are detected as homologs.
Indeed -- this query (see #323 (comment)
<#323 (comment)>)
has over 1M homologs in UniRef90.
For example, here's the top of the output of a perf report on the compute
thread when I searched the Caudal_act HMM from Pfam against a version of
the Uniprot TrEMBL database
Could run this report with the sequence and database I reported above to
check if you get similar numbers to what I am seeing when profiling?
If you have the outputs of these searches, could you send me the pipeline
statistics information that appears at the very end of the output?
Absolutely, here they are:
Internal pipeline statistics summary:
-------------------------------------
Query model(s): 1 (1326 nodes)
Target sequences: 135301051 (45360760554 residues searched)
Passed MSV filter: 5218733 (0.0385713); expected 2706021.0 (0.02)
Passed bias filter: 4358682 (0.0322147); expected 2706021.0 (0.02)
Passed Vit filter: 1763645 (0.013035); expected 135301.1 (0.001)
Passed Fwd filter: 1501121 (0.0110947); expected 1353.0 (1e-05)
Initial search space (Z): 135301051 [actual number of targets]
Domain search space (domZ): 1061640 [number of targets reported over threshold]
# CPU time: 19273.26u 351.44s 05:27:04.69 Elapsed: 00:22:30.69
# Mc/sec: 44531.39
@@ New targets included: 1061640
@@ New alignment includes: 1072120 subseqs (was 1), including original query
# Alignment of 1072120 hits satisfying inclusion thresholds saved to: out.a3m
Also, can you tell me what tool you used to generate those performance
profiles?
I used the perf record command (
https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record
):
perf record -g '<jackhmmer command>'
I then visualiased the generated perf.data using pprof (
https://github.com/google/pprof):
pprof -flame perf.data
This might require an extra conversion step (see
https://github.com/google/pprof#using-pprof-with-linux-perf).
—
Reply to this email directly, view it on GitHub
<#323 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDJBZBSM76OU76FKZZIQVLY6TLGRAVCNFSM6AAAAABGAG6PVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYHE2DIMBSGE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi, |
Could you explain more about what you're trying to do here? Is this a
problem with AlphaFold, or some HMMER search you're trying to do on an
AlphaFold database? Also, are you running HMMER directly, or through
something like pyhmmer?
The cases where I've seen HMMER use large amounts of memory have involved
searches that find very large numbers of hits. For example, when searching
the Pfam HMMs against UniProt TrEMBL, a few of the HMMs find 1M + hits and
take over 16GB of RAM to run. How many hits did your search find? (Note,
these are runs without generating alignments).
HMMER runtimes of over an hour on large databases that find significant
numbers of hits aren't unusual. When I search a recent version of UniProt
TrEMBL, a typical search takes about 10 minutes, but a few searches that
find many hits take multiple hours. This is because of the way that
HMMER's comparison pipeline works -- HMM-sequence comparisons that produce
hits take much longer than ones that fail early in the pipeline.
…On Thu, Nov 21, 2024 at 12:42 AM hegelab ***@***.***> wrote:
Hi,
I attempted to run AlphaFold with CFTR_HUMAN, and found that 250GB of RAM
was insufficient, requiring me to allocate 500GB instead. The runtime was
also significant, exceeding an hour despite using a system with decent SSD
storage.
The AF3 team addressed this challenge by introducing alignment parsing and
narrowing from a file, which helps reduce resource demands. However, this
approach impacts performance.
It would be incredibly beneficial for the community to have a more
efficient solution to handle such cases without compromising performance.
—
Reply to this email directly, view it on GitHub
<#323 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDJBZENRJX4MNOKV4JH5ED2BVXEZAVCNFSM6AAAAABSGE7IF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJQGEZDMNRSGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Would it be possible to add a flag that limits the number of sequences returned by Jackhmmer/Nhmmer/Phmmer/Hmmsearch?
When HMMER is used in the context of protein structure prediction, we usually care only about the top n hits (usually ~5k) sorted by e-value and the rest are discarded.
Therefore aligning them, outputting them to stdout, and saving them to disk is a wasteful operation, as we then read just the top n and discard the rest.
Moreover, queries with extreme number of hits lead to unnecessary long run times and need a lot of RAM to store the results.
We added our own limited and brittle workaround for this in Jackhmmer, but that only affects the output MSA, but doesn't address the unnecessary computation that happens before:
A better and more robust solution would be probably to modify
P7_TOPHITS
to have amax_depth
attribute that doesn't allow top hits to grow beyond that number.The text was updated successfully, but these errors were encountered: