ProtEnc aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:
pip install protenc
import protenc
# List available models
print(protenc.list_models())
# Load encoder model
encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')
proteins = [
'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]
for embed in encoder(proteins, return_format='numpy'):
# Embeddings have shape [L, D] where L is the sequence length and D the embedding dimensionality.
print(embed.shape)
# Derive a single per-protein embedding vector by averaging along the sequence dimension
embed.mean(0)
After installation, use the protenc
shell command for bulk generation and export of protein embeddings.
python -m protenc.tools.extract --help
run example:
- one worker per GPU
- batch size 128
- 4 workers
- use data parallel
- subsitute amino acid wildcards by possible substitutes
- lmdb_writer.flush_after 1000
- lmdb_writer.map_size 100 GiB
python -m protenc.tools.extract sequences.fasta embeddings.lmdb --model_name esm2_t33_650M_UR50D --data_parallel --batch_size 128 --num_workers 4 --substitute_wildcards
By default, input and output formats are inferred from the file extensions. Run
protenc --help
for a detailed usage description.
Example
Generate protein embeddings using the ESM2 650M model for sequences provided in a FASTA file and write embeddings to an LMDB:
protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D
The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the read_from_lmdb
utility function:
from protenc.utils import read_from_lmdb
for label, embed in read_from_lmdb('embeddings.lmdb'):
print(label, embed)
Features
Input formats:
- CSV
- JSON
- FASTA
Output format:
General:
- Multi-GPU inference with (
--data_parallel
) - FP16 inference (
--amp
)
Clone the repository:
git clone git+https://github.com/kklemon/protenc.git
Install dependencies via Poetry:
poetry install
Have feature ideas or found a bug? Love to see support for a new model? Feel free to create an issue.
- Support for more input formats
- CSV
- Parquet
- FASTA
- JSON
- Support for more output formats
- LMDB
- HDF5
- DataFrame
- Pickle
- Support for large models
- Model offloading
- Sharding
- FlashAttention (via Kernl?)
- Support for more protein language models
- Whole ProtTrans family
- Whole ESM family
- AlphaFold (?)
- Implement all remaining TODOs in code
- Evaluation
- Demos
- Distributed inference
- Maybe support some sort of optimized inference such as quantization
- This may be up to the model providers
- Improve documentation
- Support translation of gene sequences
- Add tests. We need tests!!!