CS 2C Discussion: Hash Tables

Hello everyone,

This week I we are going to be looking into a lab that investigates the implementation of Hash Tables. This lab is what I consider a little bit more challenging compared to the other labs I have touched on. Without further adieu, lets start with learning what a hash table is.

What is a Hash Table?

A Hash Table is an ADT that uses a key to determine where in an array to store data. A good example of this is a phone book, where each person's name (the key) corresponds to their phone number (the data). The Hash Table would turn the key into a hash, using a hashing function, to determine where in an array is data located.

Hash Tables are used mainly for quick access to data, with a worst case of O(1). However, hashing functions would sometimes give an index that is already in use, causing a collision. There are some solutions to the collision problem, with some common ones being separate chaining, linear probing, and quadratric probing.

A Hash Table that uses separate chaining contains an array of linked lists. The idea is that if the hashing function gives a location already occupied with a piece of data, it adds a new node to the linked list in that location (as shown in the image below)

The issue with separate chaining occurs when there is clustering: where data starts gathering is a portion of the array. Clustering causes the traversal times of linked lists to grow and also wastes some memory due to unused slots in the array.

A better alternative to separate chaining is called open addressing, where we have an array with no linked lists within them. This is where linear and quadratic probing comes into play. With open addressing, if a spot given by the hash function is taken, the hash table will check the spot next to it and see if it is empty until it finds a vacant spot (solving both access time and collisions). Linear probing causes our hash table to check the next spot if there is a collision, while quadratic probing checks the index + k^2th spot.

Linear and Quadratic Probing

Linear probing causes more clustering than quadratic probing, since, as shown in the image above, linear probing have smaller probing intervals than quadratic probes.

There is a lot more things behind hash tables, some will be mentioned in the implementation section, but this is a general idea of what a hash table is and the different implementations of hash tables. Here are two sites that does a better job at describing linear and quadratic probing. Now lets get into the implementation.

Linear probing: http://www.cs.rmit.edu.au/online/blackboard/chapter/05/documents/contribute/chapter/05/linear-probing.html
Quadratic probing: https://courses.cs.washington.edu/courses/cse326/00wi/handouts/lecture16/sld019.htm

The Implementation

In this lab we implemented two hash tables: one that uses linear probing and another that uses quadratic probing. Since the two probing methods utilizes similar functions, our quadratic probing hash table is inherited from our linear probing hash table.

Both hash tables have an inner class called Entry and a vector of these Entry's to be used to store data. Each Entry will store the data given to it and the state of the entry: active, vacant, or deleted. States are important when it comes to inserting and removing data from the table.

Our hash tables will calculate a value known as a load factor whenever it is approriate; the load factor is found by dividing load_size with the size of the vector. The load factor determines when to resize our hash tables to make sure that our linear/quadratic probing finds us a vacant spot.

Our hash function uses the C++ STL "hash" class template, so we do not have to spend time figuring out how to make an effective hashing function

Besides the usual insert() and remove() methods, we have a set of methods that are very important to our hash tables:
hash_and_index() and find_pos(): hash_and_index() calls our hashing function and returns the result that our hashing function provides to use. This method is called within find_pos(), which is where our linear or quadratic probing occurs (depends on how find_pos() is overriden).

rehash() and grow_capacity(): rehash() is called when the load factor we mentioned earlier exceeds a set max load factor. Within rehash(), it will call grow_capacity() to grow the vector and moves all non-deleted Entry's from the old vector to the newly resized vector.

default_max_load_factor(): Essentially returns the default max load factor value whenever the user decides to set a max load factor that may end up breaking our hash table. 0.75 is typically used for linear probing hash tables while quadratic probing hash tables uses 0.5.

The methods we override in our quadratic probing hash table class were just default_max_load_factor(), find_pos(), and grow_capacity().

Within our quadratic probing hash table implementation, we added a helper method called next_prime() which returns the smallest prime number after a given value we pass into the method. These prime numbers are used as the size of the vector in a quadratic probing hash table to ensure that that empty spots are being found and used quickly.

I believe this is pretty much it when it comes to implementing linear and quadratic hash tables. I would suggest watching videos about them because reading will cause confusion for some (especially for me the first time I read about them!).

Lab Questions

Q: How can you be sure that if the QP probing strategy won't get stuck in a cycle of fruitless examinations (i.e. there exist vacant cells in your table, but you never encounter them)?
A: I would say that the QP probing strategy won't get stuck in a cycle because of the size of the table are prime numbers. I forgot to mention that within hash_and_index(), there is a line of code that takes the result of C++ STL's hash functor and applies a modulus to it of the table size (Hash(x) % vector.size()). As mentioned in these set of notes, the prime numbers are used to make sure that the even value indices are not the only ones being used, allowing for more spots to be used. The thing with prime numbers is that they can get big pretty fast, meaning more space within our vectors to act as a buffer to prevent our quadratic probing from being stuck in a cycle.

Q: How is the average element access time affected by the probing strategy?
A: I conducted some tests involving inserting up to 1 million elements into a hash table, and based on the results it seems that both linear and quadratic probing have similar insertion times. As I was increasing the number of elements inserted into the hash table, the average times were increasing by small amounts. Same thing occurs when I was decreasing the max load factor value. I even tested to see what happens if I set my max load factor close to 0 (such as 0.00000000001), with the result being that it hangs the program.

Q: What happens when you delete an element by setting its VACANT flag instead of the DELETED flag?
A: Deletion in hash tables is essentially similar to the lazy deletion mentioned in the Lazy BST post. How deletion work in hash tables is that you would use find_pos() to find the item we want to delete and mark it's associated Entry object as DELETED. If we were to set the state of the entry to VACANT, it would produce a gap in between our data that can result in elements to the right of that location to "go missing" in our algorithm.

Conclusion

Hash tables are great when it comes to quick data access. However, implementing them is very tricky and confusing as we see here. Though, we can overcome this trickiness and confusion with the vast amount of resources available to use. One site I posted, which belongs to RMIT in Australia, does a pretty good job in explaining what linear probing is and provide the source code for the methods associated with linear probing, with some of the code being highlighted in different colors to be more visual friendly. Overall, hash tables are complicated but fast ADTs when it comes to accessing data. I hope you enjoyed reading this post, and, as always, feel free to leave a comment and have a good day!

CS 2C Discussion

Saturday, May 25, 2019

Hash Tables

What is a Hash Table?

The Implementation

Lab Questions

Conclusion

No comments:

Post a Comment