BioVossEncoder

Encoding biological sequences into Voss representation

BioVossEncoder

A Julia package for encoding biological sequences into Voss representations

Can encode DNA, RNA, and Protein sequences.
Provides the fastest encoding of biological sequences into Voss representations (aka. OneHot vectors).
Can handle ambiguous nucleotides and amino acids.
Provides a simple and intuitive API for encoding biological sequences.
Includes a dedicated type VossEncoder that match the BioSequences types.
Can be used for single nucletide encoding vv = vossvector(dna"ACGT", DNA_A).

Warning

This package uses internals from BioSequences and BitMatrix types, which might not be stable. Use with caution.

Installation

BioVossEncoder is a Julia Language package. To install BioVossEncoder, please open Julia's interactive session (known as REPL) and press ] key in the REPL to use the package mode, then type the following command

pkg> add BioVossEncoder

Encoding BioSequences

This package provides a simple and fast way to encode biological sequences into Voss representations. The main struct provided by this package is VossEncoder which is a wrapper of BitMatrix that encodes a biological sequence into a bit matrix and its corresponding alphabet. The following example shows how to encode a DNA sequence into a Voss matrix.

julia> using BioSequences, BioVossEncoder

julia> seq = dna"ACGT"

julia> VossEncoder(seq)

4×4 Voss Matrix of DNAAlphabet{4}():
 1  0  0  0
 0  1  0  0
 0  0  1  0
 0  0  0  1

For simplicity the VossEncoder struct provides a property bitmatrix that returns the BitMatrix representation of the sequence.

julia> VossEncoder(seq).bitmatrix

4×4 BitMatrix:
 1  0  0  0
 0  1  0  0
 0  0  1  0
 0  0  0  1

Similarly another function that makes use of the VossEncoder structure is vossmatrix which returns the BitMatrix representation of a sequence directly.

julia> vossmatrix(seq)

4×4 BitMatrix:
 1  0  0  0
 0  1  0  0
 0  0  1  0
 0  0  0  1

Creating a Voss vector of a sequence

Sometimes it proves to be useful to encode a sequence into a Voss vector representation (i.e a bit vector of the sequence from the corresponding molecule alphabet).

This package provides a function vossvector that returns a Voss vector of a sequence given a BioSequence and the specific molecule (BioSymbol) that could be DNA or AA.

julia> vossvector(seq, DNA_A)

4-element view(::BitMatrix, 1, :) with eltype Bool:
 1
 0
 0
 0

Note that the output is actually using behind the scenes a view of the BitMatrix representation of the sequence. This is done for performance reasons.

Related Ideas

BioSequences.jl direct OneHot recipe:

using BioSymbols, BioSequences

function onehot(s::NucSeq)
    M = falses(4, length(s))
    for (i, s) in enumerate(s)
        bits = compatbits(s)
        while !iszero(bits)
            M[trailing_zeros(bits) + 1, i] = true
            bits &= bits - one(bits) # clear lowest bit
        end
    end
    M
end

julia> onehot(dna"TGNTKCTW-T")

4×10 BitMatrix:
 0  0  1  0  0  0  0  1  0  0
 0  0  1  0  0  1  0  0  0  0
 0  1  1  0  1  0  0  0  0  0
 1  0  1  1  1  0  1  1  0  1

Using BioMarkovChains.jl reinterpret trick:

julia> function onehot_reinterpretator(seq::BioSequence)
    seqlen = length(seq)
    modvect = Vector{Int8}(undef, seqlen)
    modifier(value) = (value == DNA_G) ? DNA_M : (value == DNA_T) ? DNA_G : value
    reinterpreter = seq -> reinterpret.(Int8, modifier.(seq))[1]
    @inbounds for i in 1:seqlen
        modvect[i] = reinterpreter(seq[i])
    end
    return 1:4 .== permutedims(modvect)
end

SequenceTokenizers.jl: A Julia package for tokenizing biological sequences, providing efficient and flexible tokenization methods for various sequence types. Focused on String type.

julia> function onehot_tokenizer(str::String)
    alphabet = ['A', 'C', 'G', 'T'] 
    tokenizer = SequenceTokenizer(alphabet, 'N')
    tokens = tokenizer([str]) 
    return onehot_batch(tokenizer, tokens)
end # 5×N×1 Array{Float32, 3}

julia> onehot_tokenizer("ACATCAGCATC")

5×11×1 Array{Float32, 3}:
[:, :, 1] =
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0

OneHotArrays.jl from FluxML

julia> using OneHotArrays

onehotbatch(str, ('A', 'C', 'G','T'))

4×1000000 OneHotMatrix(::Vector{UInt32}) with eltype Bool:
 ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  1  ⋅  ⋅  1  ⋅  ⋅  1  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  …  ⋅  ⋅  ⋅  1  1  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅
 1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  1  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  1  ⋅  1     ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  1  1  ⋅  ⋅  ⋅  ⋅  ⋅  1  1  ⋅  1
 ⋅  1  ⋅  1  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  1  ⋅  ⋅  1  ⋅  1  ⋅     ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  1  1  ⋅  ⋅  ⋅  ⋅  1  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅
 ⋅  ⋅  1  ⋅  ⋅  1  1  1  1  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅     1  ⋅  1  ⋅  ⋅  ⋅  1  1  ⋅  1  1  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  ⋅  1  ⋅  ⋅  ⋅  ⋅  1  ⋅

Fasta2onehot.jl: A Julia package for converting FASTA sequences into one-hot encoded matrices.
A Discourse thread for OneHot for String:

With StatsBase.jl

julia> using StatsBase

function onehot_indicator(str::String)::Vector{BitVector}
    codeunits(str) |> indicatormat
end # returns 4-element Vector{BitVector}:

With collect: The output is a Vector{BitVector} which is somehow disorganized, but it is a valid one-hot encoding.

julia> function onehot_collector(str::String)::Vector{BitVector}
    [collect(str) .== x for x ∈ ['A', 'C', 'G', 'T']]
end # retuns 4-element Vector{BitVector}:

With permutedims and reinterpret:

julia> function onehot_permutator(seq::BioSequence)
    modifier(value) = (value == DNA_G) ? DNA_M : (value == DNA_T) ? DNA_G : value
    reinterpreter = seq -> reinterpret.(Int8, modifier.(seq))[1]
    return 1:4 .== permutedims(reinterpreter.(seq))
end

A more efficient version of the previous function With codeunits and permutedims:

julia> function onehot_codeunits(str::String)
                # A     C     G     T  
    return UInt8[0x41, 0x43, 0x47, 0x54] .== permutedims(codeunits(str))
end

Benchmarks

julia> using BenchmarkTools

str = rand(codeunits("ACGT"),10^6) |> String
seq = randdnaseq(10^6)

# VossEncoder.jl
@btime vossmatrix($seq); # 32.056 μs (4 allocations: 488.42 KiB)

# Others
@btime onehot($seq); # 4.408 ms (4 allocations: 488.42 KiB)
@btime onehot_codeunits($str); # 8.124 ms (6 allocations: 488.48 KiB)
@btime onehot_reinterpretator($seq); # 10.140 ms (7 allocations: 1.43 MiB)
@btime onehot_permutator(seq); # 9.670 ms (10 allocations: 2.38 MiB)
@time onehot_indicator($str); # 17.413 ms (14 allocations: 3.82 MiB)
@btime onehot_collector($str); # 12.659 ms (32 allocations: 15.74 MiB)
@btime onehot_tokenizer(str) # 22.816 ms (19 allocations: 26.70 MiB)

# From the special FluxML ecosystem
@btime onehotbatch($str, ('A', 'C', 'G','T')); # 11.418 ms (3 allocations: 3.81 MiB)

versioninfo()

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, skylake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github		.github
docs		docs
src		src
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BioVossEncoder

Installation

Encoding BioSequences

Creating a Voss vector of a sequence

Related Ideas

Benchmarks

About

Releases 5

Packages

Contributors 2

Languages

License

camilogarciabotero/BioVossEncoder.jl

Folders and files

Latest commit

History

Repository files navigation

BioVossEncoder

Installation

Encoding BioSequences

Creating a Voss vector of a sequence

Related Ideas

Benchmarks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 2

Languages

Packages