Replace HashMap with Dense Vector for Capture Group Storage #5

Closed
opened 2026-03-11 18:02:44 +03:00 by NiXTheDev · 0 comments
NiXTheDev commented 2026-03-11 18:02:44 +03:00 (Migrated from github.com)

Goal: Reduce memory usage and improve performance by optimizing how capture groups are stored in SimState.

Currently, each SimState contains a HashMap<u32, (usize, usize)> for group captures. For patterns with many groups and many simulation states, this is memory-intensive and slows down cloning.

Proposed Alternatives:

  1. Dense Vector: Store groups in a Vec<Option<(usize, usize)>> where index = group ID (1-based). This is cache-friendly and avoids hashing overhead. Group 0 is unused, so allocate length = max_group_id + 1.
  2. Group Snapshots: Instead of copying the entire group map for each new state, maintain a global "group snapshot" system where states reference a snapshot ID and only changes are stored. This is more complex but can drastically reduce memory.

Implementation Plan (Dense Vector):

  • In nfa.rs, compute the maximum group ID during NFA construction (including numbered and named groups).
  • In engine.rs, change SimState.groups to Vec<Option<(usize, usize)>>.
  • When cloning a state, clone the vector (which may still be heavy, but better than HashMap).
  • Update GroupStart and GroupEnd transitions to modify the vector at the appropriate index.
  • For named groups, maintain a separate mapping from name to ID, used in Match finalization.

Benchmark:

  • Compare memory usage and speed before and after on patterns with many groups.
**Goal**: Reduce memory usage and improve performance by optimizing how capture groups are stored in `SimState`. Currently, each `SimState` contains a `HashMap<u32, (usize, usize)>` for group captures. For patterns with many groups and many simulation states, this is memory-intensive and slows down cloning. **Proposed Alternatives**: 1. **Dense Vector**: Store groups in a `Vec<Option<(usize, usize)>>` where index = group ID (1-based). This is cache-friendly and avoids hashing overhead. Group 0 is unused, so allocate length = `max_group_id + 1`. 2. **Group Snapshots**: Instead of copying the entire group map for each new state, maintain a global "group snapshot" system where states reference a snapshot ID and only changes are stored. This is more complex but can drastically reduce memory. **Implementation Plan (Dense Vector)**: - In `nfa.rs`, compute the maximum group ID during NFA construction (including numbered and named groups). - In `engine.rs`, change `SimState.groups` to `Vec<Option<(usize, usize)>>`. - When cloning a state, clone the vector (which may still be heavy, but better than HashMap). - Update `GroupStart` and `GroupEnd` transitions to modify the vector at the appropriate index. - For named groups, maintain a separate mapping from name to ID, used in `Match` finalization. **Benchmark**: - Compare memory usage and speed before and after on patterns with many groups.
Sign in to join this conversation.
No description provided.