Replace Linear Scan of Character Class Items with O(1) Lookup #11

New issue

Closed

opened 2026-03-11 18:09:12 +03:00 by NiXTheDev · 0 comments

NiXTheDev commented

2026-03-11 18:09:12 +03:00

(Migrated from github.com)

Goal: Speed up matching of character classes by precomputing a membership table.

Currently, try_transition in engine.rs iterates through all items in a character class to check if a character matches. For large classes (e.g., [a-zA-Z0-9_]), this is inefficient.

Proposed Solution:
During NFA construction (or even earlier), convert each character class into a bitset (for ASCII) or a boolean lookup table using a [bool; 128] for ASCII, and for Unicode, use a RangeSet or a perfect hash. For simplicity, start with ASCII optimization.

Implementation Plan:

In nfa.rs, when processing Expr::CharacterClass, precompute a CharClassSet struct:

struct CharClassSet {
    negated: bool,
    ascii_lookup: [bool; 128], // true if char matches
    // maybe also store ranges for non-ASCII?
}

2. For each item, fill the lookup table accordingly.
3. In `Transition::CharClass`, store the precomputed set instead of the raw items.
4. In `try_transition`, use `if set.matches(c)` (O(1)) instead of linear scan.

**Considerations**:
- Handle Unicode: if pattern contains non-ASCII ranges, fall back to linear scan or implement a Unicode range set.
- Ensure negated classes are handled correctly.

**Benchmark**:
- Compare matching speed before and after on patterns with large character classes.

**Goal**: Speed up matching of character classes by precomputing a membership table. Currently, `try_transition` in `engine.rs` iterates through all items in a character class to check if a character matches. For large classes (e.g., `[a-zA-Z0-9_]`), this is inefficient. **Proposed Solution**: During NFA construction (or even earlier), convert each character class into a **bitset** (for ASCII) or a **boolean lookup table** using a `[bool; 128]` for ASCII, and for Unicode, use a `RangeSet` or a perfect hash. For simplicity, start with ASCII optimization. **Implementation Plan**: 1. In `nfa.rs`, when processing `Expr::CharacterClass`, precompute a `CharClassSet` struct: ```rust struct CharClassSet { negated: bool, ascii_lookup: [bool; 128], // true if char matches // maybe also store ranges for non-ASCII? } ``` 2. For each item, fill the lookup table accordingly. 3. In `Transition::CharClass`, store the precomputed set instead of the raw items. 4. In `try_transition`, use `if set.matches(c)` (O(1)) instead of linear scan. **Considerations**: - Handle Unicode: if pattern contains non-ASCII ranges, fall back to linear scan or implement a Unicode range set. - Ensure negated classes are handled correctly. **Benchmark**: - Compare matching speed before and after on patterns with large character classes.