1

I’d appreciate help with ensuring thread safety and improving the efficiency of a global string interning map in Rust, particularly when using key-specific locks to manage concurrent access and prevent race conditions during string deduplication.

The Text module snippet shown below is designed to manage strings in a memory-efficient and performant way by deduplicating them. It uses a global interning map to store unique string instances, leveraging Arc for reference counting and DashMap for thread-safe, concurrent access. Key features are:

  • Global Consistency: The interning map (STRINGS) ensures that every active Text corresponds to an entry in the map.
  • String Deduplication: Only one instance of a given string exists in memory.
  • Pointer-Based Equality and Fast Hashing: As there are no duplicates, string equality and hashing are based on the Arc memory address, providing extremely fast comparisons and hash insertions/lookups.

The challenge is ensuring the interning map (STRINGS) remains thread-safe and efficient while preventing duplicate strings in a multi-threaded context. When a Text instance is dropped, its associated string should be removed from the interning map (not necessarily immediately) only when no other references exist. However, there could be race conditions as other threads could clone the Arc during the Drop process, creating a reference-count mismatch. I considered using a queue to manage removals but found it insufficient to resolve race conditions (although could help with other issues like removing a large number of strings).

My current approach is to use key-specific locks (KEY_LOCKS) to ensure atomic access during Text creation, access and deletion of each string. I'd appreciate any feedback or advice on:

  • Correctness: Is this implementation sound from a multi-threaded perspective?
  • Efficiency: Can this approach be improved, especially regarding the overhead of key-specific locks?

This is the relevant code from the module:

use dashmap::{DashMap, DashSet};
use std::{
    sync::{Arc, LazyLock, Mutex},
    hash::{Hash, Hasher},
};

static STRINGS: LazyLock<DashSet<Arc<String>>> = LazyLock::new(|| DashSet::new());

static KEY_LOCKS: LazyLock<DashMap<Arc<String>, Arc<Mutex<()>>>> = LazyLock::new(|| DashMap::new());

fn get_key_lock(key: Arc<String>) -> Arc<Mutex<()>> {
    KEY_LOCKS
        .entry(key)
        .or_insert_with(|| Arc::new(Mutex::new(())))
        .clone()
}

#[derive(Debug, Clone)]
pub struct Text(Arc<String>);

impl Text {
    fn create_or_get_text(arc: Arc<String>) -> Self {
        let key_lock = get_key_lock(arc.clone());
        let _lock = key_lock.lock().unwrap();

        if let Some(existing) = STRINGS.get(&arc) {
            existing.clone()
        } else {
            STRINGS.insert(arc.clone());
            Text(arc)
        }
    }
}

impl Drop for Text {
    fn drop(&mut self) {
        let key_lock = get_key_lock(self.0.clone());
        let _lock = key_lock.lock().unwrap();
        if Arc::strong_count(&self.0) == 2 {
            STRINGS.remove(&self.0);
        }
    }
}

impl Hash for Text {
    fn hash<H: Hasher>(&self, state: &mut H) {
        let addr = Arc::as_ptr(&self.0) as usize as u64;
        state.write_u64(addr);
    }
}

impl PartialEq for Text {
    fn eq(&self, other: &Self) -> bool {
        Arc::ptr_eq(&self.0, &other.0)
    }
}

impl Eq for Text {}

Notes:

STRINGS is used for fast lookups and deduplication of Text instances.

KEY_LOCKS is intended to ensure thread-safe access for creating, accessing, or dropping Text objects.

STRINGS and KEY_LOCKS are maintained as separate structures mainly so I don't have the overhaad of a mutex for every interned string.

Drop Semantics: When a Text instance is dropped, it check to see if it should be removed its corresponding entry from the global interning map. The Arc reference count (strong_count) is being tested to be 3: When ready to discard from the interning map, drop will be passed a Text with 2 references: one is held by the STRINGS interning map and one reference is held by the Text instance being dropped. An additional reference is created during the drop function when using KEY_LOCKS.

5
  • Consider taking a look at arcstr. It doesn't do interning specifically, but it aims to be a "better Arc<str> or Arc<String>". Commented Dec 20, 2024 at 16:35
  • I don't see how this is interning at all. All you achieve is deduplicating Arcs, while strings are completely out of equation. Commented Dec 20, 2024 at 22:34
  • @IvanC I'm curious what definition of interning you are using, because the approach given here meets every definition I can find. I suppose you could argue that it's not "true" interning if it's not automatic across every string in the application, but even e.g. .NET's interning is (at least sometimes) opt-in just like this approach. Commented Dec 21, 2024 at 1:15
  • @cdhowie what I mean is that passing two Strings with identical content to this interner does not de duplicate them. Commented Dec 21, 2024 at 5:05
  • @IvanC What? Yes it does. Commented Dec 21, 2024 at 6:42

1 Answer 1

2

It seems like you should be able to remove the mutexes altogether if you use remove_if to handle eviction. Under the hood, this takes out a write lock on the shard containing the bucket and holds it until the entire operation completes, which makes concurrent access to the set-owned Arc impossible. Therefore, you should be able to do something like this:

impl Drop for Text {
    fn drop(&mut self) {
        STRINGS.remove_if(&self.0, |s| Arc::strong_count(s) == 2);
    }
}

However, after making this change (and removing the mutexes) there is a race condition on insertion: if two calls to create_or_get_text are made concurrently and both do not find the string present, they will both try to insert it, resulting in two Text values that contain equal but different Strings.

We can correct this by detecting the case where insertion fails and restarting the operation. In other words, we'll try getting the string from the map, then try inserting it, then start over until one of the two succeeds. This is essentially optimistic locking.

fn create_or_get_text(arc: Arc<String>) -> Self {
    Self(loop {
        if let Some(existing) = STRINGS.get(&arc) {
            break existing.clone();
        }
        if STRINGS.insert(arc.clone()) {
            break arc;
        }
    })
}

Side note: you can make your Hash implementation more resilient against different pointer sizes by just hashing the pointer itself.

impl Hash for Text {
    fn hash<H: Hasher>(&self, state: &mut H) {
        Arc::as_ptr(&self.0).hash(state);
    }
}

Complete code after the above changes:

use dashmap::DashSet;
use std::{
    hash::{Hash, Hasher},
    sync::{Arc, LazyLock},
};

static STRINGS: LazyLock<DashSet<Arc<String>>> = LazyLock::new(DashSet::new);

#[derive(Debug, Clone)]
pub struct Text(Arc<String>);

impl Text {
    fn create_or_get_text(arc: Arc<String>) -> Self {
        Self(loop {
            if let Some(existing) = STRINGS.get(&arc) {
                break existing.clone();
            }
            if STRINGS.insert(arc.clone()) {
                break arc;
            }
        })
    }
}

impl Drop for Text {
    fn drop(&mut self) {
        STRINGS.remove_if(&self.0, |s| Arc::strong_count(s) == 2);
    }
}

impl Hash for Text {
    fn hash<H: Hasher>(&self, state: &mut H) {
        Arc::as_ptr(&self.0).hash(state);
    }
}

impl PartialEq for Text {
    fn eq(&self, other: &Self) -> bool {
        Arc::ptr_eq(&self.0, &other.0)
    }
}

impl Eq for Text {}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.