Implement Memoization Cache in NFA Simulator to Prevent Exponential Blowup #4

Closed
opened 2026-03-11 18:02:30 +03:00 by NiXTheDev · 0 comments
NiXTheDev commented 2026-03-11 18:02:30 +03:00 (Migrated from github.com)

Goal: Improve worst-case performance for patterns with heavy backtracking (e.g., nested quantifiers and backreferences).

The current NFA simulation in engine.rs uses a straightforward backtracking approach for backreferences. This can lead to exponential runtime for certain patterns (ReDoS vulnerability).

Proposed Solution:
Add a memoization cache (often called "thread factoring") to the NfaSimulator. The cache stores results for (state_id, position) pairs, indicating whether that combination can lead to a match. When the same state and position are encountered again via a different path, the cached result can be reused, avoiding redundant work.

Design Considerations:

  • The cache should be a HashMap<(StateId, usize), Option<...>> where the value indicates match success and possibly the best capture groups.
  • Because capture groups may differ, you might need to store multiple possibilities or use a "width" mechanism (like PCRE). Alternatively, you could store the best (longest) match.
  • This technique is similar to the "PikeVM" with caching, but with backtracking you must ensure correctness with backreferences.

Implementation Steps:

  1. Add a cache: HashMap<(StateId, usize), Option<MatchResult>> to NfaSimulator.
  2. In run() and step_with_backrefs, before processing a state, check the cache.
  3. When a state leads to a match, store the result in the cache.
  4. Ensure cache invalidation if capture groups affect future matches (maybe key by group state too, but that's complex). Start simple: if a state/position pair is visited with the same group configuration, cache the outcome.

Testing:

  • Add tests with pathological patterns like (a*)* to ensure runtime remains reasonable.
  • Use benchmarks to measure performance improvements.
**Goal**: Improve worst-case performance for patterns with heavy backtracking (e.g., nested quantifiers and backreferences). The current NFA simulation in `engine.rs` uses a straightforward backtracking approach for backreferences. This can lead to exponential runtime for certain patterns (ReDoS vulnerability). **Proposed Solution**: Add a **memoization cache** (often called "thread factoring") to the `NfaSimulator`. The cache stores results for `(state_id, position)` pairs, indicating whether that combination can lead to a match. When the same state and position are encountered again via a different path, the cached result can be reused, avoiding redundant work. **Design Considerations**: - The cache should be a `HashMap<(StateId, usize), Option<...>>` where the value indicates match success and possibly the best capture groups. - Because capture groups may differ, you might need to store multiple possibilities or use a "width" mechanism (like PCRE). Alternatively, you could store the best (longest) match. - This technique is similar to the "PikeVM" with caching, but with backtracking you must ensure correctness with backreferences. **Implementation Steps**: 1. Add a `cache: HashMap<(StateId, usize), Option<MatchResult>>` to `NfaSimulator`. 2. In `run()` and `step_with_backrefs`, before processing a state, check the cache. 3. When a state leads to a match, store the result in the cache. 4. Ensure cache invalidation if capture groups affect future matches (maybe key by group state too, but that's complex). Start simple: if a state/position pair is visited with the same group configuration, cache the outcome. **Testing**: - Add tests with pathological patterns like `(a*)*` to ensure runtime remains reasonable. - Use benchmarks to measure performance improvements.
Sign in to join this conversation.
No description provided.