Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Limit the MSA depth in Jackhmmer/Nhmmer/Phmmer/Hmmsearch #323

Open
Augustin-Zidek opened this issue Apr 10, 2024 · 14 comments

Comments

@Augustin-Zidek
Copy link

Would it be possible to add a flag that limits the number of sequences returned by Jackhmmer/Nhmmer/Phmmer/Hmmsearch?

When HMMER is used in the context of protein structure prediction, we usually care only about the top n hits (usually ~5k) sorted by e-value and the rest are discarded.

Therefore aligning them, outputting them to stdout, and saving them to disk is a wasteful operation, as we then read just the top n and discard the rest.

Moreover, queries with extreme number of hits lead to unnecessary long run times and need a lot of RAM to store the results.

We added our own limited and brittle workaround for this in Jackhmmer, but that only affects the output MSA, but doesn't address the unnecessary computation that happens before:

 p7_tophits_SortBySortkey(info->th);
 p7_tophits_Threshold(info->th, info->pli);
+if (esl_opt_IsOn(go, "--seq_limit"))
+{
+    int seq_limit = esl_opt_GetInteger(go, "--seq_limit");
+    info->th->N = ESL_MIN(info->th->N, seq_limit);
+}
  p7_tophits_CompareRanking(info->th, kh, &nnew_targets);

A better and more robust solution would be probably to modify P7_TOPHITS to have a max_depth attribute that doesn't allow top hits to grow beyond that number.

@cryptogenomicon
Copy link
Member

It's a high priority for us to make better ways of dealing with tons of hits and large alignments.

Unlikely that we'll do something like first-N hits, though, because it makes search results dependent on the order of sequences in the target db.

Or if you want the top-N, not just first-N, then that doesn't save you computation, only space - and you should be able to achieve a top-N strategy now by filtering or downsamplng an output tabular data file any way you want, and immediately discarding the original large output file.

We're working on other downsampling methods, and also on highly compressed alignment representations.

@Augustin-Zidek
Copy link
Author

Oh sorry I wasn't clear: I meant top-N sorted by e-value, not first-N. Thinking of P7_TOPHITS essentially being something like a min-heap of size N.

Or if you want the top-N, not just first-N, then that doesn't save you computation, only space - and you should be able to achieve a top-N strategy now by filtering or downsamplng an output tabular data file any way you want, and immediately discarding the original large output file.

Yes, good point, you have to align to query and sort anyway. But it would still be useful if this functionality was built in, as multi-GB MSA files could be produced, and saving those could take non-trivial amount of time, especially with remote filesystems.

We observed this with a sequence that has over 1M hits in UniRef90 with e-value limit of 0.001 and produces 45 GB Stockholm MSA and 256 MB tblout files.

@cryptogenomicon
Copy link
Member

One way you can go is to generate the tblout file (and not the full MSA); filter or downsample the lines in the tblout file; fetch those subsequences (esl-sfetch for instance); and use hmmalign to align them to the query model. This saves storing the 45G MSA.

@Augustin-Zidek
Copy link
Author

Thanks, that is an interesting thing to try.

However, even with -o /dev/null won't it do a lot of unnecessary IO?

@cryptogenomicon
Copy link
Member

Without -A <msafile> it won't even generate the multiple alignment internally (saving a ton of RAM and some computation), and -o /dev/null saves a lot of storage and probably a bunch of output time too. I definitely focus on the tabular output files when I'm manipulating large output volumes (which these days is nigh-always).

@Augustin-Zidek
Copy link
Author

Augustin-Zidek commented Apr 10, 2024

Here are timings with and without -A:

Without -A

$ jackhmmer \
-o /dev/null --tblout tblout.txt \
--noali \
--cpu 22 \
-N 1 \
-E 0.001 --incE 0.001 \
query.fasta uniref90.fasta

real    18m26.775s
user    332m53.777s
sys     3m47.448s

With -A

$ jackhmmer \
-o /dev/null -A out.a3m --tblout tblout.txt \
--noali \
--cpu 22 \
-N 1 \
-E 0.001 --incE 0.001 \
query.fasta uniref90.fasta

real    20m1.193s
user    332m53.121s
sys     4m5.238s

I am testing with this sequence:

> query
PNPCPESSASFLSRITFWWITGMMVQGYRQPLESTDLWSLNKEDTSEQVVPVLVKNWKKECAKSRKQPVKIVYSSKDPAK
PKGSSKVDVNEEAEALIVKCPQKERDPSLFKVLYKTFGPYFLMSFLFKAVHDLMMFAGPEILKLLINFVNDKKAPEWQGY
FYTALLFISACLQTLVLHQYFHICFVSGMRIKTAVIGAVYRKALVITNAARKSSTVGEIVNLMSVDAQRFMDLATYINMI
WSAPLQVILALYLLWLNLGPSVLAGVAVMVLMVPLNAVMAMKTKTYQVAHMKSKDNRIKLMNEILNGIKVLKLYAWELAF
KDKVLAIRQEELKVLKKSAYLAAVGTFTWVCTPFLVALSTFAVYVTVDENNILDAQKAFVSLALFNILRFPLNILPMVIS
SIVQASVSLKRLRVFLSHEDLDPDSIQRRPIKDAGATNSITVKNATFTWARNDPPTLHGITFSVPEGSLVAVVGQVGCGK
SSLLSALLAEMDKVEGHVTVKGSVAYVPQQAWIQNISLRENILFGRQLQERYYKAVVEACALLPDLEILPSGDRTEIGEK
GVNLSGGQKQRVSLARAVYCDSDVYLLDDPLSAVDAHVGKHIFENVIGPKGLLKNKTRLLVTHAISYLPQMDVIIVMSGG
KISEMGSYQELLARDGAFAEFLRTYASAEQEQGQPEDGLAGVGGPGKEVKQMENGMLVTDTAGKQMQRQLSSSSSYSRDV
SQHHTSTAELRKPGPTEETWKLVEADKAQTGQVKLSVYWDYMKAIGLFISFLSIFLFLCNHVASLVSNYWLSLWTDDPIV
NGTQEHTQVRLSVYGALGISQGITVFGYSMAVSIGGIFASRRLHLDLLHNVLRSPISFFERTPSGNLVNRFSKELDTVDS
MIPQVIKMFMGSLFNVIGACIIILLATPMAAVIIPPLGLIYFFVQRFYVASSRQLKRLESVSRSPVYSHFNETLLGVSVI
RAFEEQERFIRQSDLKVDENQKAYYPSIVANRWLAVRLECVGNCIVLFASLFAVISRHSLSAGLVGLSVSYSLQVTTYLN
WLVRMSSEMETNIVAVERLKEYSETEKEAPWQIQDMAPPKDWPQVGRVEFRDYGLRYREDLDLVLKHINVTIDGGEKVGI
VGRTGAGKSSLTLGLFRIKESAEGEIIIDDINIAKIGLHDLRFKITIIPQDPVLFSGSLRMNLDPFSQYSDEEVWTSLEL
AHLKGFVSALPDKLNHECAEGGENLSVGQRQLVCLARALLRKTKILVLDEATAAVDLETDDLIQSTIRTQFDDCTVLTIA
HRLNTIMDYTRVIVLDKGEIQEWGSPSDLLQQRGLFYSMAKDSGLV

@cryptogenomicon
Copy link
Member

(Pleasantly surprised to see the multithreaded parallelization scaling well to 22 cores; normally jackhmmer starts becoming input-bound very quickly, even at 2 cores or so. You must have some fast storage.)

@Augustin-Zidek
Copy link
Author

Augustin-Zidek commented Apr 10, 2024

A few more observations.

We have one further patch to output a3m instead of Stockholm from Jackhmmer:

-if (textw > 0) esl_msafile_Write(afp, msa, eslMSAFILE_STOCKHOLM);
+// Write the output as A2M unless a '.sto' file extension is provided.
+if (textw > 0) {
+    if (afp_ext && strcmp(afp_ext, "sto") == 0) {
+        esl_msafile_Write(afp, msa, eslMSAFILE_STOCKHOLM);
+    } else if (strcmp(esl_opt_GetString(go, "-A"), "/dev/stdout") == 0 ||
+               (afp_ext && (strcmp(afp_ext, "a3m") == 0 || strcmp(afp_ext, "a2m") == 0))) {
+        esl_msafile_Write(afp, msa, eslMSAFILE_A2M);
+    } else {
+        p7_Fail("Bad output MSA format %s, only a2m/a3m and sto are supported\n", afp_ext);
+    }
+}
 else           esl_msafile_Write(afp, msa, eslMSAFILE_PFAM);

This patch reduces the output MSA size from 45 GB (Stockholm) to 1.6 GB (a3m) and slightly reduces the runtime to:

real    19m2.367s
user    324m10.399s
sys     2m23.680s

This patch together with the --seq_limit patch, we see significantly better runtime (and only 1.6 GB MSA a3m output file size):

real    15m22.416s
user    321m15.616s
sys     1m49.013s

It therefore seems to me that there must me some non-trivial amount of work that is eliminated thanks to applying the naive sequence limit patch.

@Augustin-Zidek
Copy link
Author

I did a bit of profiling and here is the profile for the offending query with --cpu 1 (and with our --seq_limit and a3m-output patches applied).

image

For comparison, here is profile for a query that runs 2.5x faster because it doesn't have that many hits:

image

Does the profile look reasonable to you? Is it reasonable for p7_domaindef_ByPosteriorHeuristics to account for most of the run time?

@npcarter
Copy link
Member

npcarter commented Apr 19, 2024 via email

@Augustin-Zidek
Copy link
Author

Augustin-Zidek commented Apr 22, 2024

Thanks for the extra details, Nick!

The time breakdown of a hmmsearch run will vary tremendously depending on the fraction of sequences in the database that are detected as homologs.

Indeed -- this query (see #323 (comment)) has over 1M homologs in UniRef90.

 For example, here's the top of the output of a perf report on the compute thread when I searched the Caudal_act HMM from Pfam against a version of the Uniprot TrEMBL database

Could run this report with the sequence and database I reported above to check if you get similar numbers to what I am seeing when profiling?

If you have the outputs of these searches, could you send me the pipeline statistics information that appears at the very end of the output?

Absolutely, here they are:

Internal pipeline statistics summary:
-------------------------------------
Query model(s):                            1  (1326 nodes)
Target sequences:                  135301051  (45360760554 residues searched)
Passed MSV filter:                   5218733  (0.0385713); expected 2706021.0 (0.02)
Passed bias filter:                  4358682  (0.0322147); expected 2706021.0 (0.02)
Passed Vit filter:                   1763645  (0.013035); expected 135301.1 (0.001)
Passed Fwd filter:                   1501121  (0.0110947); expected 1353.0 (1e-05)
Initial search space (Z):          135301051  [actual number of targets]
Domain search space  (domZ):         1061640  [number of targets reported over threshold]
# CPU time: 19273.26u 351.44s 05:27:04.69 Elapsed: 00:22:30.69
# Mc/sec: 44531.39

@@ New targets included:   1061640
@@ New alignment includes: 1072120 subseqs (was 1), including original query
# Alignment of 1072120 hits satisfying inclusion thresholds saved to: out.a3m

Also, can you tell me what tool you used to generate those performance profiles?

I used the perf record command (https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record):

perf record -g <jackhmmer command>

I then visualiased the generated perf.data using pprof (https://github.com/google/pprof):

pprof -flame perf.data

This might require an extra conversion step (see https://github.com/google/pprof#using-pprof-with-linux-perf).

@npcarter
Copy link
Member

npcarter commented Apr 22, 2024 via email

@hegelab
Copy link

hegelab commented Nov 21, 2024

Hi,
I attempted to run AlphaFold with CFTR_HUMAN, and found that 250GB of RAM was insufficient, requiring me to allocate 500GB instead. The runtime was also significant, exceeding an hour (the whole data pipeline; I did not benchmarked the seq search alone) despite using a system with decent SSD storage.
The AF3 team addressed this challenge by introducing alignment parsing and narrowing from a file, which helps reduce resource demands. However, this approach impacts performance.
It would be incredibly beneficial for the community to have a more efficient solution to handle such cases without compromising performance.

@npcarter
Copy link
Member

npcarter commented Nov 21, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants