SPDX-FileCopyrightText | SPDX-License-Identifier |
---|---|
2024 PyThaiNLP Project |
Apache-2.0 |
Rust String
(and &str
) is actually a slice of valid UTF-8 bytes which is
variable-length. It has no way of accessing a random index UTF-8 "character"
with O(1) time complexity.
This means any algorithm with operations based on "character" index position will be horribly slow on Rust String.
Hence, fixed_bytes_str
which is transformed from a slice of valid UTF-8
bytes into a slice of 4-bytes length - padded left with 0.
Consequently, regular expressions must be padded with \x00
for each Unicode
character to have 4 bytes.
Thai characters are 3-bytes length, so every Thai char in regex is padded
with \x00
one time.
For "space" in regex, it is padded with \x00\x00\x00
.
- Rust String indexing and internal representation
- Read more about UTF-8 at Wikipedia.