ProtEnc: generate protein embeddings the easy way

ProtEnc aims to simplify extraction of protein embeddings from various pre-trained models by providing simple APIs and bulk generation scripts for the ever-growing landscape of protein language models (pLMs). Currently, supported models are:

ProtTrans family
ESM
CARP
AlphaFold (coming soon™)
OmegaPLM (coming soon™)

Usage

Installation

pip install protenc

Python API

import protenc

# List available models
print(protenc.list_models())

# Load encoder model
encoder = protenc.get_encoder('esm2_t30_150M_UR50D', device='cuda')

proteins = [
  'MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG',
  'KALTARQQEVFDLIRDHISQTGMPPTRAEIAQRLGFRSPNAAEEHLKALARKGVIEIVSGASRGIRLLQEE'
]

for embed in encoder(proteins, return_format='numpy'):
  # Embeddings have shape [L, D] where L is the sequence length and D the  embedding dimensionality.
  print(embed.shape)
  
  # Derive a single per-protein embedding vector by averaging along the sequence dimension
  embed.mean(0)

Command-line interface

After installation, use the protenc shell command for bulk generation and export of protein embeddings.

python -m protenc.tools.extract --help

run example:

one worker per GPU
batch size 128
4 workers
use data parallel
subsitute amino acid wildcards by possible substitutes
lmdb_writer.flush_after 1000
lmdb_writer.map_size 100 GiB

python -m protenc.tools.extract sequences.fasta  embeddings.lmdb --model_name esm2_t33_650M_UR50D --data_parallel --batch_size 128  --num_workers 4 --substitute_wildcards

By default, input and output formats are inferred from the file extensions. Run

protenc --help

for a detailed usage description.

Example

Generate protein embeddings using the ESM2 650M model for sequences provided in a FASTA file and write embeddings to an LMDB:

protenc proteins.fasta embeddings.lmdb --model_name=esm2_t33_650M_UR50D

The generated embeddings will be stored in a lmdb key-value store and can be easily accessed using the read_from_lmdb utility function:

from protenc.utils import read_from_lmdb

for label, embed in read_from_lmdb('embeddings.lmdb'):
    print(label, embed)

Features

Input formats:

CSV
JSON
FASTA

Output format:

LMDB
HDF5 (coming soon)

General:

Multi-GPU inference with (--data_parallel)
FP16 inference (--amp)

Development

Clone the repository:

git clone git+https://github.com/kklemon/protenc.git

Install dependencies via Poetry:

poetry install

Contribution

Have feature ideas or found a bug? Love to see support for a new model? Feel free to create an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
protenc		protenc
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProtEnc: generate protein embeddings the easy way

Usage

Installation

Python API

Command-line interface

Development

Contribution

Todo

About

Releases

Packages

Languages

License

jderiz/ProtEnc

Folders and files

Latest commit

History

Repository files navigation

ProtEnc: generate protein embeddings the easy way

Usage

Installation

Python API

Command-line interface

Development

Contribution

Todo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages