Skip to content

Latest commit

 

History

History
29 lines (20 loc) · 1.05 KB

NOTE_ON_STRING.md

File metadata and controls

29 lines (20 loc) · 1.05 KB
SPDX-FileCopyrightText SPDX-License-Identifier
2024 PyThaiNLP Project
Apache-2.0

Why Use Handroll Bytes Slice As "CustomString" Instead of Rust String?

Rust String (and &str) is actually a slice of valid UTF-8 bytes which is variable-length. It has no way of accessing a random index UTF-8 "character" with O(1) time complexity.

This means any algorithm with operations based on "character" index position will be horribly slow on Rust String.

Hence, fixed_bytes_str which is transformed from a slice of valid UTF-8 bytes into a slice of 4-bytes length - padded left with 0.

Consequently, regular expressions must be padded with \x00 for each Unicode character to have 4 bytes.

Thai characters are 3-bytes length, so every Thai char in regex is padded with \x00 one time.

For "space" in regex, it is padded with \x00\x00\x00.

References