Implement Memoization Cache in NFA Simulator to Prevent Exponential Blowup #4

New issue

Closed

opened 2026-03-11 18:02:30 +03:00 by NiXTheDev · 0 comments

NiXTheDev commented

2026-03-11 18:02:30 +03:00

(Migrated from github.com)

Goal: Improve worst-case performance for patterns with heavy backtracking (e.g., nested quantifiers and backreferences).

The current NFA simulation in engine.rs uses a straightforward backtracking approach for backreferences. This can lead to exponential runtime for certain patterns (ReDoS vulnerability).

Proposed Solution:
Add a memoization cache (often called "thread factoring") to the NfaSimulator. The cache stores results for (state_id, position) pairs, indicating whether that combination can lead to a match. When the same state and position are encountered again via a different path, the cached result can be reused, avoiding redundant work.

Design Considerations:

The cache should be a HashMap<(StateId, usize), Option<...>> where the value indicates match success and possibly the best capture groups.
Because capture groups may differ, you might need to store multiple possibilities or use a "width" mechanism (like PCRE). Alternatively, you could store the best (longest) match.
This technique is similar to the "PikeVM" with caching, but with backtracking you must ensure correctness with backreferences.

Implementation Steps:

Add a cache: HashMap<(StateId, usize), Option<MatchResult>> to NfaSimulator.
In run() and step_with_backrefs, before processing a state, check the cache.
When a state leads to a match, store the result in the cache.
Ensure cache invalidation if capture groups affect future matches (maybe key by group state too, but that's complex). Start simple: if a state/position pair is visited with the same group configuration, cache the outcome.

Testing:

Add tests with pathological patterns like (a*)* to ensure runtime remains reasonable.
Use benchmarks to measure performance improvements.

**Goal**: Improve worst-case performance for patterns with heavy backtracking (e.g., nested quantifiers and backreferences). The current NFA simulation in `engine.rs` uses a straightforward backtracking approach for backreferences. This can lead to exponential runtime for certain patterns (ReDoS vulnerability). **Proposed Solution**: Add a **memoization cache** (often called "thread factoring") to the `NfaSimulator`. The cache stores results for `(state_id, position)` pairs, indicating whether that combination can lead to a match. When the same state and position are encountered again via a different path, the cached result can be reused, avoiding redundant work. **Design Considerations**: - The cache should be a `HashMap<(StateId, usize), Option<...>>` where the value indicates match success and possibly the best capture groups. - Because capture groups may differ, you might need to store multiple possibilities or use a "width" mechanism (like PCRE). Alternatively, you could store the best (longest) match. - This technique is similar to the "PikeVM" with caching, but with backtracking you must ensure correctness with backreferences. **Implementation Steps**: 1. Add a `cache: HashMap<(StateId, usize), Option<MatchResult>>` to `NfaSimulator`. 2. In `run()` and `step_with_backrefs`, before processing a state, check the cache. 3. When a state leads to a match, store the result in the cache. 4. Ensure cache invalidation if capture groups affect future matches (maybe key by group state too, but that's complex). Start simple: if a state/position pair is visited with the same group configuration, cache the outcome. **Testing**: - Add tests with pathological patterns like `(a*)*` to ensure runtime remains reasonable. - Use benchmarks to measure performance improvements.