I’d appreciate help with ensuring thread safety and improving the efficiency of a global string interning map in Rust, particularly when using key-specific locks to manage concurrent access and prevent race conditions during string deduplication.
The Text module snippet shown below is designed to manage strings in a memory-efficient and performant way by deduplicating them. It uses a global interning map to store unique string instances, leveraging Arc for reference counting and DashMap for thread-safe, concurrent access. Key features are:
- Global Consistency: The interning map (STRINGS) ensures that every active Text corresponds to an entry in the map.
- String Deduplication: Only one instance of a given string exists in memory.
- Pointer-Based Equality and Fast Hashing: As there are no duplicates, string equality and hashing are based on the Arc memory address, providing extremely fast comparisons and hash insertions/lookups.
The challenge is ensuring the interning map (STRINGS) remains thread-safe and efficient while preventing duplicate strings in a multi-threaded context. When a Text instance is dropped, its associated string should be removed from the interning map (not necessarily immediately) only when no other references exist. However, there could be race conditions as other threads could clone the Arc during the Drop process, creating a reference-count mismatch. I considered using a queue to manage removals but found it insufficient to resolve race conditions (although could help with other issues like removing a large number of strings).
My current approach is to use key-specific locks (KEY_LOCKS) to ensure atomic access during Text creation, access and deletion of each string. I'd appreciate any feedback or advice on:
- Correctness: Is this implementation sound from a multi-threaded perspective?
- Efficiency: Can this approach be improved, especially regarding the overhead of key-specific locks?
This is the relevant code from the module:
use dashmap::{DashMap, DashSet};
use std::{
sync::{Arc, LazyLock, Mutex},
hash::{Hash, Hasher},
};
static STRINGS: LazyLock<DashSet<Arc<String>>> = LazyLock::new(|| DashSet::new());
static KEY_LOCKS: LazyLock<DashMap<Arc<String>, Arc<Mutex<()>>>> = LazyLock::new(|| DashMap::new());
fn get_key_lock(key: Arc<String>) -> Arc<Mutex<()>> {
KEY_LOCKS
.entry(key)
.or_insert_with(|| Arc::new(Mutex::new(())))
.clone()
}
#[derive(Debug, Clone)]
pub struct Text(Arc<String>);
impl Text {
fn create_or_get_text(arc: Arc<String>) -> Self {
let key_lock = get_key_lock(arc.clone());
let _lock = key_lock.lock().unwrap();
if let Some(existing) = STRINGS.get(&arc) {
existing.clone()
} else {
STRINGS.insert(arc.clone());
Text(arc)
}
}
}
impl Drop for Text {
fn drop(&mut self) {
let key_lock = get_key_lock(self.0.clone());
let _lock = key_lock.lock().unwrap();
if Arc::strong_count(&self.0) == 2 {
STRINGS.remove(&self.0);
}
}
}
impl Hash for Text {
fn hash<H: Hasher>(&self, state: &mut H) {
let addr = Arc::as_ptr(&self.0) as usize as u64;
state.write_u64(addr);
}
}
impl PartialEq for Text {
fn eq(&self, other: &Self) -> bool {
Arc::ptr_eq(&self.0, &other.0)
}
}
impl Eq for Text {}
Notes:
STRINGS is used for fast lookups and deduplication of Text instances.
KEY_LOCKS is intended to ensure thread-safe access for creating, accessing, or dropping Text objects.
STRINGS and KEY_LOCKS are maintained as separate structures mainly so I don't have the overhaad of a mutex for every interned string.
Drop Semantics: When a Text instance is dropped, it check to see if it should be removed its corresponding entry from the global interning map. The Arc reference count (strong_count) is being tested to be 3: When ready to discard from the interning map, drop will be passed a Text with 2 references: one is held by the STRINGS interning map and one reference is held by the Text instance being dropped. An additional reference is created during the drop function when using KEY_LOCKS.
arcstr. It doesn't do interning specifically, but it aims to be a "betterArc<str>orArc<String>".Arcs, while strings are completely out of equation.Strings with identical content to this interner does not de duplicate them.