Introduction
In our exploration of DBMS, we have learned about storage architectures and how pages are stored on disk. We also explored how DBMS manages memory using a buffer pool manager.
Now, we are at a crucial point where we will start discussing the connection between pages in the buffer pool and the execution engine and what data structures are used, and how are used to fit the goals of DBMS.
We will focus on two important data structures used by DBMS: Hash Tables and B+ trees that we can build on top of our tables for table indexes, auxiliary data structures and joins, and other stuff.
Disclaimer: Explaining such data structure and details in a written format is very hard as it requires some visuals to help illustrate the picture. That's why I encourage you to watch the lecture and do your own search after reading the article to benefit as much as you can from the topic.
In this article, we are going to focus on Hash Tables exploring their components and the various techniques used for implementing the hashing schemes.
Data Structures in DBMS
DBMS utilizes various data structures for different parts of its internal operations. Some examples include:
Internal Meta-Data: This type of data keeps track of essential information about the database and the system state. Examples include page tables(add a link to page table article ) and page directories, which help manage memory and keep track of data storage.
Core Data Storage: Data structures serve as the foundation for storing tuples (data entries) in the database. These structures ensure efficient storage and retrieval of data.
Temporary Data Structures: DBMS can construct data structures on the fly during query processing to enhance execution speed. For instance, hash tables can be dynamically built to facilitate joins between tables.
Table Indices: Auxiliary data structures are employed to simplify the retrieval of specific tuples from tables. These indices enhance the efficiency of searching and accessing data.
Design Decisions for DBMS Data Structures
When implementing data structures for a DBMS, two primary design considerations are taken into account:
Data Organization: Determining the memory layout and deciding which information to store within the data structure to enable efficient data access.
Concurrency: Ensuring that multiple threads can access the data structure simultaneously without causing conflicts or data inconsistencies.
Hash Table
A hash table implements an unordered associative array abstract data type that maps keys to values, in this article we don't care what is the type of keys nor values but we make an assumption that all the keys and values are fixed length.
The hash tables data structure provides:
Space Complexity: On paper its
O(n)
(if you have a collision in inserting, you search in all the table to find an empty slot).- But in practice the actual size maybe
2
to4
n because we may allocate 2x to 4x the number of keys we want to have based on the number of hashing schemes we have.
- But in practice the actual size maybe
Time Complexity:
O(1)
on average case.O(n)
in the worst case.
Note that even with
O(1)
operation complexity on average, there are constant factor optimizations that are important to consider in the real world.
A hash table implementation is comprised of two parts:
Hash Function
Hashing Scheme
Hash Function This tells us how to map a large key space into a smaller domain.
It is used to compute an index into an array of buckets or slots. We need to consider the trade-off between fast execution
and collision rate
.
On one extreme, we have a hash function that always returns a constant (very fast, but everything is a collision). On the other extreme, we have a “perfect” hashing function where there are no collisions but would take extremely long to compute. The ideal design is somewhere in the middle.
A collision is what occurs when two keys are hashed to the same index in a hash table.
Hashing Scheme This tells us how to handle key collisions after hashing. Here, we need to consider the trade-off between allocating a large hash table to reduce collisions and having to execute additional instructions when a collision occurs.
Hash Function
The hash function is a computational algorithm that takes any given key as input and produces an integer representation known as the "hash."
In DBMS like any other system the function’s output must be deterministic (this means the same key should always generate the same hash output), but in DBMS we don't need to use a cryptographically secure hash function like SHA-256, because we do not need to worry about protecting the contents of keys. These hash functions are primarily used internally by the DBMS and thus information is not leaked outside of the system.
There are a number of hash functions out there like MurmurHash, Google CityHash, and its newer version FarmHash, and Facebook XXHash. The current state-of-the-art hash function is Facebook XXHash3.
Static Hashing Schemes
In a static hashing scheme, the size of the hash table is fixed. This means that once the hash table reaches its storage limit, the DBMS needs to rebuild a larger hash table from the beginning, which can be quite costly. Normally, the new hash table is created with a size that is twice as large as the original.
To minimize inefficient comparisons, it is important to avoid collisions between hashed keys. One common approach is to allocate double the number of slots in the hash table compared to the expected number of elements. This provides additional space to accommodate potential collisions and reduces the likelihood of wasted comparisons during key retrieval.
The following assumptions usually do not hold in reality:
The number of elements is known ahead of time.
Keys are unique.
There exists a perfect hash function. Therefore, we need to choose the hash function and hashing schema appropriately
Linear Probe Hashing
This is the fastest and most basic hashing scheme, also most DBMSs implement it. It uses a circular buffer of array slots and stores both key and value in the slot.
How does it deal with collisions?
The hash function maps keys to slots. When a collision occurs, we linearly search the adjacent slots until an open one is found.
Lookups
For lookups, We can check the slot the key hashes to (remember we store the key and the value), and search linearly until we find the desired entry. If we reach an empty slot or we iterated over every slot in the hashtable, then the key is not in the table.
Deletions
Deletions are more tricky. We have to be careful about just removing the entry from the slot, as this may prevent future lookups from finding entries that have been put below the now-empty slot.
There are two solutions to this problem:
The first option is to shift the adjacent data after deleting an entry to fill the now empty slot. However, we must be careful to only move the entries which were originally shifted. This is rarely implemented in practice.
The other and most common approach is to use
“tombstones”
. Instead of deleting the entry, we replace it with a“tombstone”
entry that tells future lookups to keep scanning.
Non-unique Keys
In the case where the same key may be associated with multiple different values or tuples, there are two approaches to take:
Separate Linked List: Instead of storing the values with the keys, we store a pointer to a separate storage area that contains a linked list of all the values.
Redundant Keys: The more common approach is to simply store the same key multiple times in the table. Everything with linear probing still works even if we do this.
Robin Hood Hashing
This technique is an extension of linear probe hashing that aims to minimize the maximum distance between keys and their ideal positions in the hash table.
This strategy steals slots from “rich”
keys and gives them to “poor”
keys. In this variant, each entry also records the “distance” they are from their optimal position.
Then, on each insert, if the key being inserted would be farther away from their optimal position at the current slot than the current entry’s distance, we replace the current entry and continue trying to insert the old entry farther down the table.
Cuckoo Hashing
Instead of using a single hash table, this approach maintains multiple hashtables with different hash functions.
The hash functions are the same algorithm (e.g., XXHash, CityHash); they generate different hashes for the same key by using different seed values.
When we insert, we check every table and choose one that has a free slot (if multiple have one, we can compare things like load factor, or more commonly, just choose a random table).
If no table has a free slot, we choose (typically a random one) and evict the old entry. We then rehash the old entry into a different table.
In rare cases, we may end up in a cycle. If this happens, we can rebuild all of the hash tables with new hash function seeds (less common) or rebuild the hash tables using larger tables (more common).
Cuckoo hashing guarantees O(1)
lookups and deletions, but insertions may be more expensive
.
Dynamic Hashing Schemes
The previous hash tables require the DBMS to know the number of elements it wants to store. Otherwise, it must rebuild the table if it needs to grow/shrink in size.
Dynamic hashing schemes are able to resize the hash table on demand without needing to rebuild the entire table. The schemes perform this resizing in different ways that can either maximize reads or writes.
Chained Hashing
This is the most common dynamic hashing scheme. The DBMS maintains a linked list of buckets for each slot in the hash table. Keys that hash to the same slot are simply inserted into the linked list for that slot.
Extendible Hashing
An improved variant of chained hashing that splits buckets instead of letting chains grow forever. This approach allows multiple slot locations in the hash table to point to the same bucket chain.
The core idea behind re-balancing the hash table is to move bucket entries on split and increase the number of bits to examine to find entries in the hash table. This means that the DBMS only has to move data within the buckets of the split chain; all other buckets are left untouched.
The DBMS maintains global and local depth bit counts that determine the number of bits needed to find buckets in the slot array.
When a bucket is full, the DBMS splits the bucket and reshuffles its elements. If the local depth of the split bucket is less than the global depth, then the new bucket is just added to the existing slot array. Otherwise, the DBMS doubles the size of the slot array to accommodate the new bucket and increments the global depth counter.
Linear Hashing
Instead of immediately splitting a bucket when it overflows, this scheme maintains a split pointer that keeps track of the next bucket to split. No matter whether this pointer is pointing to a bucket that overflowed, the DBMS always splits.
The overflow criterion is left up to the implementation.
When any bucket overflows, split the bucket at the pointer location by adding a new slot entry and creating a new hash function.
If the hash function maps to a slot that has previously been pointed to by a pointer, apply the new hash function.
When the pointer reaches the last slot, delete the original hash function and replace it with a new hash function.
Conclusion
In conclusion, this article provided an overview of Hash Tables and their importance in DBMS. We discussed different hashing schemes, including static hashing schemes like Linear Probe Hashing and Robin Hood Hashing, as well as dynamic hashing schemes such as Chained Hashing, Extendible Hashing, and Linear Hashing. We also emphasized the role of hash functions in mapping keys to values. By implementing efficient data structures like Hash Tables, DBMS can enhance data storage, retrieval, and organization, resulting in improved performance and functionality.