Optimize Matching by Avoiding UTF-8 Decoding for ASCII Input #13

New issue

Closed

opened 2026-03-11 18:09:45 +03:00 by NiXTheDev · 0 comments

NiXTheDev commented

2026-03-11 18:09:45 +03:00

(Migrated from github.com)

Goal: Improve performance for ASCII-heavy patterns by working directly on byte slices instead of Vec<char>.

Currently, NfaSimulator converts the input &str to Vec<char>, which allocates and decodes UTF-8. For patterns that only match ASCII (most regexes), this overhead is unnecessary.

Proposed Solution:
Detect at compile time whether the pattern contains any non-ASCII literals or Unicode features. If not, the engine can operate on &[u8] and treat bytes as characters (assuming ASCII). If the pattern has Unicode, fall back to the char-based approach.

Implementation Plan:

Add a flag to Nfa indicating whether the pattern is ASCII-only.
In engine.rs, have two matching implementations: one for Vec<char> and one for &[u8]. Choose based on the flag.
For the byte-based engine, characters are u8. Character classes and ranges would need to be adapted (ASCII ranges only).
Ensure that case-insensitive matching works correctly for ASCII (e.g., to_ascii_lowercase).

Benefits:

Reduced memory allocation and decoding time.
Faster matching for common use cases.

Challenges:

Patterns that mix ASCII and Unicode would need to use the Unicode path.
Ensure correctness with multi-byte characters; they should never appear in ASCII-only mode.

**Goal**: Improve performance for ASCII-heavy patterns by working directly on byte slices instead of `Vec<char>`. Currently, `NfaSimulator` converts the input `&str` to `Vec<char>`, which allocates and decodes UTF-8. For patterns that only match ASCII (most regexes), this overhead is unnecessary. **Proposed Solution**: Detect at compile time whether the pattern contains any non-ASCII literals or Unicode features. If not, the engine can operate on `&[u8]` and treat bytes as characters (assuming ASCII). If the pattern has Unicode, fall back to the char-based approach. **Implementation Plan**: 1. Add a flag to `Nfa` indicating whether the pattern is ASCII-only. 2. In `engine.rs`, have two matching implementations: one for `Vec<char>` and one for `&[u8]`. Choose based on the flag. 3. For the byte-based engine, characters are `u8`. Character classes and ranges would need to be adapted (ASCII ranges only). 4. Ensure that case-insensitive matching works correctly for ASCII (e.g., `to_ascii_lowercase`). **Benefits**: - Reduced memory allocation and decoding time. - Faster matching for common use cases. **Challenges**: - Patterns that mix ASCII and Unicode would need to use the Unicode path. - Ensure correctness with multi-byte characters; they should never appear in ASCII-only mode.