ecton.dev

Writing a HashMap in Rust without unsafe

In my last post, I introduced a new project of mine: Bud (BudLang). It's goal was simple: explore how fast a #![forbid(unsafe_code)] language written in Rust could be. Earlier this week, I was working on it again and thought about how I wanted loops to work.

One problem with having no unsafe code is dealing with the lifetimes of the Value types in the virtual machine. I cannot store borrowed data in the virtual machine, because it would introduce lifetime requirements, and the only way to iterate over Rust's built-in HashMap requires borrowing data. The only way I could think to create iterators for the Map type in Bud would be to collect the HashMap's keys into a Vec. But this requires an extra allocation and data cloning.

I realized what I wanted was a little different: an indexable hash map. Rust has a sorted map using BTreeMap, but that has completely different performance and usage implications when compared against a hash map.

I searched and found a popular crate indexmap that would solve my need perfectly. I couldn't bring myself to use it, however, as one of the things I loved about the budlang crate is that it has no dependencies. Additionally, since the primary goal of Bud was to not use any unsafe, bringing in indexmap feels against the spirit of what I wanted to accomplish with Bud.

So, I set out to see how fast an indexable hash map implementation that uses no unsafe could be.

But you are still using unsafe because std uses it! Why avoid a single, well-trusted dependency if it uses unsafe?

This is a valid question. When I adopted Rust in 2019, one of the reasons I felt it was well designed was the concept of unsafe. Anyone who has built complex systems knows the benefits of writing a solid foundation and abstracting atop that solid foundation. That's exactly what unsafe allows: encapsulating code where you take over for Rust's safety guarantees. This design decision enables end users of most Rust libraries to often not ever need to use unsafe. Yet, as a Rust user, I know the power is there if I ever need it.

Despite loving the language feature, I see it as a double-edged sword. As I was learning Rust, I saw many newer Rustaceans reach for unsafe in basic situations. For example, using Slice::get_unchecked instead of Slice::get to avoid the extra bounds check. In some cases, it might be a good optimization. But most of the time I've personally profiled and optimized code, I found that performance rarely changed.

How can removing extra checks not yield faster code? Optimizers are scary good these days, and they keep getting better. In many cases, the extra bounds checks are already being optimized away by the compiler. It's beacuse of the power of optimizers that I hope that most Rust users never need to reach for unsafe when optimizing their own code.

To put that hope to the test: Can a completely safe implementation of an indexable hash map compete on performance with indexmap, which is powered by hashbrown?

Surprisingly, the answer is: yes!

How does an indexable hash map work?

In a typical indexable hash map, the entries are stored in a list separate from a hash map which allows retrieiving the index of the entry from a given key. A naive approach would look something like this:

pub struct IndexMap<Key, Value> {
  entries: Vec<(Key, Value)>
  by_key: HashMap<Key, usize>,
}

This approach has one glaring downside: it requires storing each Key twice: once in the entries list and once in the HashMap. This may not sound like a huge deal, but consider the effect when using a String as a key: a separate copy of each heap-allocated String would be needed. Since this would yield extremely bad performance, I decided to write my own hash map from the ground up.

Hash maps are relatively simple data structures to understand. At their core is a list of "bins". When attempting to insert a key/value pair or find a key, the key is hashed to producing an integer value. The integer is converted to an index into the list of bins (traditionally using modulus or bit masking).

The complexity arises when two entries end up needing the same bin. These collisions can happen for two reasons, even with distinct keys: 1) two distinct values can produce the same hash, and 2) the two distinct hashes can convert to the same bin index.

How does BudMap work?

I went through many design iterations, and here is my current set of data structures:

pub struct BudMap<Key, Value, HashBuilder = RandomState> {
    /// A dense list of all entries in this map. This is where the actual
    /// key/value data is stored.
    entries: Vec<RawEntry<Key, Value>>,
    /// A list of bins that is at least as big as the bin count
    /// (`bin_mask.into_count() + 1`). All entries beyond the bin count are
    /// colliding entries.
    bins: Vec<Bin>,
    /// A bit mask that can be applied to a hash to produce an index into
    /// `bins`. This means all valid bin counts are powers of 2.
    bin_mask: BinMask,
    /// The index inside of `bins` of the last collision that was freed. This is
    /// the start of a singly-linked list that points to all currently free
    /// collision bins.
    free_collision_head: OptionalIndex,
    /// The [`BuildHasher`] implementor via which keys are hashed.
    hash_builder: HashBuilder,
}

/// A single key/value entry in a [`BudMap`].
struct RawEntry<Key, Value> {
    /// The computed hash of `key`.
    hash: u64,
    /// The entry's key.
    key: Key,
    /// The entry's value.
    value: Value,
}

/// A hash-map bin.
struct Bin {
    /// The index inside of the [`BudMap::entries`] vec, if present.
    entry_index: OptionalIndex,
    /// If present, an index inside of [`BudMap::entries`] for the next bin that
    /// collides with this bin. This forms a singly-linked list of bins.
    collision_index: OptionalIndex,
}

/// A wrapper for a `usize` which reserves `usize::MAX` as a marker indicating
/// None. `std::mem::size_of::<Option<usize>>()` is 2x usize, while
/// `size_of::<OptionalIndex>()` remains 1x usize.
struct OptionalIndex(usize);

/// A bitmask for a 2^n quantity of bins.
struct BinMask(usize);

In this design, I avoid the Key duplication problem by building my own hash map where Bin::entry_index is used to find the entry as needed.

Collisions cause extra bins to be appended to the end of the bins list. By using the same storage buffer, we prevent a significant number of extra allocations other collision strategies ended up requiring. The bin_mask identifies how many entries in the Vec<Bin> are used as normal bins.

Because the entries are simply stored in a Vec, guaranteeing iteration order or allowing accessing entries by index is efficient and simple. To perform a lookup by key, the key is hashed and then masked using bin_mask. The bin is loaded at that calculated index. If the key of the associated RawEntry for that Bin is the one being searched for, the RawEntry's value can be returned. If it isn't the correct key, the bin at Bin::collision_index is checked. The process is repeated until either the correct key is found or Bin::collision_index is None.

How does BudMap perform?

There are countless ways to measure hash map performance. I wanted to look at three common cases: retrieving a value out of a map of a given size, filling an empty map with x entries, and removing and then reinserting a record in a map of a given size. These measurements should give a decent indicator of overall performance.

These benchmarks all are tested with link time optimization enabled and use the fnv crate to provide consistent hashing across benchmark runs for hashing with all map types. A seeded random number generator is used to generate random u32 keys and u32 values.

Even though HashMap isn't a viable option for my use case, I included it as a reference point.

get by key results

The above benchmark tests retrieving any entry by key at various map sizes. BudMap actually outperforms IndexMap until the map size reaches 5,000 keys. As the map continues to get more keys, the performance scales roughly in-line with IndexMap, although IndexMap is a clear winner with the larger maps.

remove and reinsert results

The second benchmark removes an entry by key and then reinserts it into maps of various sizes. BudMap outperforms IndexMap across the board.

fill results

This final benchmark is designed to test how fast "rebinning" happens. It starts with an empty map with default capacity and inserts X number of rows. The time reported is the time of inserting all records. Because the time goes up for each measurement, I opted to display this graph in logarithmic scale. Without this decision, it would be impossible to view this graph and glean any useful information out of the smaller measurements.

In this benchmark, BudMap stays neck and neck with IndexMap until the 50,000 entry benchmark. After that, BudMap becomes slower than IndexMap -- up to 50.3% slower.

To attempt to summarize my general findings, BudMap appears to be as fast or faster than IndexMap on maps with fewer than 10,000 keys while IndexMap generally outperforms BudMap with larger maps. Given my anticipated use cases of Bud, this meets my goals.

Another observation is complexity. BudMap is implemented in a single file containing just 638 lines of code (LOC). It also has an additional 256 LOC of tests in another file, which to be honest is not enough. IndexMap contains over 5k LOC and depends on hashbrown which has another 10k LOC. BudMap doesn't implement a Set type (yet?) and is missing many APIs that the other crates have. For many users of Bud, my choice to roll my own rather than bring in an extra dependency will yield better compile times.

My implementation is able to remain so simple because Vec provides the heavy lifting for me. This is what I love about Rust -- I was able to create a completely memory safe indexed hash map using only Rust's standard library and no unsafe code that does a reasonable job of competing against an unsafe-weilding alternative.

My takeaways

There are some problems in which unsafe is unavoidable. Outside of that, I prefer to find approaches that don't require unsafe. In general, I find that I can create code that performs similarly to the unsafe code, and I don't have the mental overhead of verifying all of my unsafe operations don't introduce undefined behavior.

However, this experiment also shows that as the maps get to larger sizes, the unsafe-utilizing libraries are able to end up with better performance.

I find that the most impactful optimizations are usually aglorithmic in nature. It's rare to find that removing a single bounds check will drastically increase performance of the measured code. You're more likely to find significant performance gains by changing the approach to the problem rather than doing micro-optimizations.

If one of Bud's primary goals wasn't to avoid unsafe, I would have used indexmap in a heartbeat. It's wonderful being able to use a highly optimized library that has been battle-hardened and reviewed by many sets of eyes. Because it is a great, well-maintained crate, I consider Bud unique in its "need" of an alternate crate and don't currently plan on releasing BudMap separately from Bud.

How goes BonsaiDb's storage rewrite?

As some people who will be reading this are primarily interested in my main project, BonsaiDb, I wanted to provide a brief update on that front. I uploaded a rewrite of my new write-ahead log two weeks ago. Since then, I have been reworking Sediment to use it.

I ended up starting fresh on that attempt earlier this week because I decided my first approach wasn't ideal. It was during that context switch that I decided to tackle something quick on Bud... "quick" -- you can guess what I spent the rest of the week working on!

Despite getting distracted by this project, I'm excited to finish hooking up the WAL to Sediment and getting new benchmark results. Even after I finish this phase, there will still be one more significant chunk of work to allow Nebari to reap the rewards of the WAL. My current goal is to have the new storage layer hooked up to BonsaiDb by the end of the year.