Skip to content
/ ProtEnc Public
forked from kklemon/ProtEnc

Extract protein embeddings the easy way.

License

Notifications You must be signed in to change notification settings

jderiz/ProtEnc

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProtEnc: generate protein embeddings the easy way

ProtEnc aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:

Usage

Installation

pip install protenc

Python API

import protenc

# List available models
print(protenc.list_models())

# Load encoder model
encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')

proteins = [
  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]

for embed in encoder(proteins, return_format='numpy'):
  # Embeddings have shape [L, D] where L is the sequence length and D the  embedding dimensionality.
  print(embed.shape)
  
  # Derive a single per-protein embedding vector by averaging along the sequence dimension
  embed.mean(0)

Command-line interface

After installation, use the protenc shell command for bulk generation and export of protein embeddings.

python -m protenc.tools.extract --help

run example:

  • one worker per GPU
  • batch size 128
  • 4 workers
  • use data parallel
  • subsitute amino acid wildcards by possible substitutes
  • lmdb_writer.flush_after 1000
  • lmdb_writer.map_size 100 GiB
python -m protenc.tools.extract sequences.fasta  embeddings.lmdb --model_name esm2_t33_650M_UR50D --data_parallel --batch_size 128  --num_workers 4 --substitute_wildcards

By default, input and output formats are inferred from the file extensions. Run

protenc --help

for a detailed usage description.

Example

Generate protein embeddings using the ESM2 650M model for sequences provided in a FASTA file and write embeddings to an LMDB:

protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D

The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the read_from_lmdb utility function:

from protenc.utils import read_from_lmdb

for label, embed in read_from_lmdb('embeddings.lmdb'):
    print(label, embed)

Features

Input formats:

Output format:

General:

  • Multi-GPU inference with (--data_parallel)
  • FP16 inference (--amp)

Development

Clone the repository:

git clone git+https://github.com/kklemon/protenc.git

Install dependencies via Poetry:

poetry install

Contribution

Have feature ideas or found a bug? Love to see support for a new model? Feel free to create an issue.

Todo

  • Support for more input formats
    • CSV
    • Parquet
    • FASTA
    • JSON
  • Support for more output formats
    • LMDB
    • HDF5
    • DataFrame
    • Pickle
  • Support for large models
    • Model offloading
    • Sharding
    • FlashAttention (via Kernl?)
  • Support for more protein language models
    • Whole ProtTrans family
    • Whole ESM family
    • AlphaFold (?)
  • Implement all remaining TODOs in code
  • Evaluation
  • Demos
  • Distributed inference
  • Maybe support some sort of optimized inference such as quantization
    • This may be up to the model providers
  • Improve documentation
  • Support translation of gene sequences
  • Add tests. We need tests!!!

About

Extract protein embeddings the easy way.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%