Optimize Matching by Avoiding UTF-8 Decoding for ASCII Input #13
Labels
No labels
Epic
GHA
Release
bug
dependencies
documentation
duplicate
enhancement
good first issue
help wanted
invalid
major
question
rust
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
NiXTheDev/Ogex#13
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Goal: Improve performance for ASCII-heavy patterns by working directly on byte slices instead of
Vec<char>.Currently,
NfaSimulatorconverts the input&strtoVec<char>, which allocates and decodes UTF-8. For patterns that only match ASCII (most regexes), this overhead is unnecessary.Proposed Solution:
Detect at compile time whether the pattern contains any non-ASCII literals or Unicode features. If not, the engine can operate on
&[u8]and treat bytes as characters (assuming ASCII). If the pattern has Unicode, fall back to the char-based approach.Implementation Plan:
Nfaindicating whether the pattern is ASCII-only.engine.rs, have two matching implementations: one forVec<char>and one for&[u8]. Choose based on the flag.u8. Character classes and ranges would need to be adapted (ASCII ranges only).to_ascii_lowercase).Benefits:
Challenges: