Saturday, May 25, 2019

Hash Tables

Hello everyone,

This week I we are going to be looking into a lab that investigates the implementation of Hash Tables. This lab is what I consider a little bit more challenging compared to the other labs I have touched on. Without further adieu, lets start with learning what a hash table is.

What is a Hash Table?

A Hash Table is an ADT that uses a key to determine where in an array to store data. A good example of this is a phone book, where each person's name (the key) corresponds to their phone number (the data). The Hash Table would turn the key into a hash, using a hashing function, to determine where in an array is data located.

Hash Tables are used mainly for quick access to data, with a worst case of O(1). However, hashing functions would sometimes give an index that is already in use, causing a collision. There are some solutions to the collision problem, with some common ones being separate chaining, linear probing, and quadratric probing.

A Hash Table that uses separate chaining contains an array of linked lists. The idea is that if the hashing function gives a location already occupied with a piece of data, it adds a new node to the linked list in that location (as shown in the image below)
The issue with separate chaining occurs when there is clustering: where data starts gathering is a portion of the array. Clustering causes the traversal times of linked lists to grow and also wastes some memory due to unused slots in the array.

A better alternative to separate chaining is called open addressing, where we have an array with no linked lists within them. This is where linear and quadratic probing comes into play. With open addressing, if a spot given by the hash function is taken, the hash table will check the spot next to it and see if it is empty until it finds a vacant spot (solving both access time and collisions). Linear probing causes our hash table to check the next spot if there is a collision, while quadratic probing checks the index + k^2th spot.
Linear and Quadratic Probing
Linear probing causes more clustering than quadratic probing, since, as shown in the image above, linear probing have smaller probing intervals than quadratic probes.

There is a lot more things behind hash tables, some will be mentioned in the implementation section, but this is a general idea of what a hash table is and the different implementations of hash tables. Here are two sites that does a better job at describing linear and quadratic probing. Now lets get into the implementation.

Linear probing: http://www.cs.rmit.edu.au/online/blackboard/chapter/05/documents/contribute/chapter/05/linear-probing.html
Quadratic probing: https://courses.cs.washington.edu/courses/cse326/00wi/handouts/lecture16/sld019.htm

The Implementation

In this lab we implemented two hash tables: one that uses linear probing and another that uses quadratic probing. Since the two probing methods utilizes similar functions, our quadratic probing hash table is inherited from our linear probing hash table.

Both hash tables have an inner class called Entry and a vector of these Entry's to be used to store data. Each Entry will store the data given to it and the state of the entry: active, vacant, or deleted. States are important when it comes to inserting and removing data from the table.

Our hash tables will calculate a value known as a load factor whenever it is approriate; the load factor is found by dividing load_size with the size of the vector. The load factor determines when to resize our hash tables to make sure that our linear/quadratic probing finds us a vacant spot.

Our hash function uses the C++ STL "hash" class template, so we do not have to spend time figuring out how to make an effective hashing function

Besides the usual insert() and remove() methods, we have a set of methods that are very important to our hash tables:
hash_and_index() and find_pos(): hash_and_index() calls our hashing function and returns the result that our hashing function provides to use. This method is called within find_pos(), which is where our linear or quadratic probing occurs (depends on how find_pos() is overriden).

rehash() and grow_capacity(): rehash() is called when the load factor we mentioned earlier exceeds a set max load factor. Within rehash(), it will call grow_capacity() to grow the vector and moves all non-deleted Entry's from the old vector to the newly resized vector.

default_max_load_factor(): Essentially returns the default max load factor value whenever the user decides to set a max load factor that may end up breaking our hash table. 0.75 is typically used for linear probing hash tables while quadratic probing hash tables uses 0.5.

The methods we override in our quadratic probing hash table class were just default_max_load_factor(), find_pos(), and grow_capacity().

Within our quadratic probing hash table implementation, we added a helper method called next_prime() which returns the smallest prime number after a given value we pass into the method. These prime numbers are used as the size of the vector in a quadratic probing hash table to ensure that that empty spots are being found and used quickly.

I believe this is pretty much it when it comes to implementing linear and quadratic hash tables. I would suggest watching videos about them because reading will cause confusion for some (especially for me the first time I read about them!).

Lab Questions

Q: How can you be sure that if the QP probing strategy won't get stuck in a cycle of fruitless examinations (i.e. there exist vacant cells in your table, but you never encounter them)?
A: I would say that the QP probing strategy won't get stuck in a cycle because of the size of the table are prime numbers. I forgot to mention that within hash_and_index(), there is a line of code that takes the result of C++ STL's hash functor and applies a modulus to it of the table size (Hash(x) % vector.size()). As mentioned in these set of notes, the prime numbers are used to make sure that the even value indices are not the only ones being used, allowing for more spots to be used. The thing with prime numbers is that they can get big pretty fast, meaning more space within our vectors to act as a buffer to prevent our quadratic probing from being stuck in a cycle.

Q: How is the average element access time affected by the probing strategy?
A: I conducted some tests involving inserting up to 1 million elements into a hash table, and based on the results it seems that both linear and quadratic probing have similar insertion times. As I was increasing the number of elements inserted into the hash table, the average times were increasing by small amounts. Same thing occurs when I was decreasing the max load factor value. I even tested to see what happens if I set my max load factor close to 0 (such as 0.00000000001), with the result being that it hangs the program.

Q: What happens when you delete an element by setting its VACANT flag instead of the DELETED flag?
A: Deletion in hash tables is essentially similar to the lazy deletion mentioned in the Lazy BST post. How deletion work in hash tables is that you would use find_pos() to find the item we want to delete and mark it's associated Entry object as DELETED. If we were to set the state of the entry to VACANT, it would produce a gap in between our data that can result in elements to the right of that location to "go missing" in our algorithm.

Conclusion

Hash tables are great when it comes to quick data access. However, implementing them is very tricky and confusing as we see here. Though, we can overcome this trickiness and confusion with the vast amount of resources available to use. One site I posted, which belongs to RMIT in Australia, does a pretty good job in explaining what linear probing is and provide the source code for the methods associated with linear probing, with some of the code being highlighted in different colors to be more visual friendly. Overall, hash tables are complicated but fast ADTs when it comes to accessing data. I hope you enjoyed reading this post, and, as always, feel free to leave a comment and have a good day!

Friday, May 10, 2019

Top Down Splaying

Hello everyone,

This week we will be talking about Splay Trees, a varient of binary search trees that uses some algorithms associated usually with AVL trees. In this lab, we used the BST class we coded last week to make our Splay_BST lab this week. As always, feel free to leave a comment and lets get into this.

What are AVL and Splay Trees?

Before we begin, we need to know what an AVL tree is. AVL trees are self-balancing binary search trees. AVL trees are seen as unbalanced when the height of each node in the left and right subtrees differ more than 1. AVL trees are designed to keep the worst case of operations done on the tree (such as searching) closer to O(log N), since most operations of binary search trees are dependent on the height of the tree. AVL trees balance themselves automatically when operations such as insert() and remove() are called, using a technique called "rotation". I suggest doing your own personal study on rotation as it is a tricky topic to understand, but it is used in our Splay Trees.

Now, Splay Trees are a form of binary search trees that have the most recently accessed or inserted element as the root of the tree. The idea behind Splay Trees is to have faster access times for recently used items, since they will be either at or close to the top of the tree. Downside of Splay Trees is that operations like insert and remove will be a bit slower, as the tree needs to be restructured every time it is splayed.

The Implementation

Our Splay_BST class will be inherited from our BST class, since Splay Trees IS-A binary search tree. There are many different kinds of implementations of a Splay Tree out there. What was different about this lab is that we were given the pseudo code of the splay() algorithm and we had to write the code for it.

The algorithm we had to code up utilizes Top-Down Splaying, which essentially is splaying from the top of the tree to the bottom. There is another method called Bottom-Up Splaying, but it is more difficult to code than Top-Down Splaying. 

The splay() algorithm utilizes the same kind of rotation as AVL trees, except we are not worry about the height of the nodes in our Splay_BST class. splay() also utilizes three kinds of trees: the main tree, and a left and right tree. The left and right trees are used in the restructuring of the tree, which is assembled at the end of the splaying. 

Now, splay() will be a private method that is used in operations such as insert(), remove(), and find(). The reason for it is that the user does not need to know or use splay(), and the functions I just listed need to call splay() to insure that our Splay Tree stays as a Splay Tree.

Lab Questions

Q: How can you tell if splaying has resulted in any benefits? What kind of experiments can you do to test for it?
A: You can probably tell if splaying has resulted in any benefits by comparing the runtimes of insert(), remove(), find() in binary search trees, AVL trees, and the Splay trees. If you were to do tests on these three trees, AVL trees and Splay trees will be somewhat close to each other since the balancing in both of them will result in worst cases being close to O(log N). However, Splay trees will be a little bit faster in tests that result in multiple accesses of the same item, since recently accessed items are at the top of the tree.

Q: Does splaying preserve tree balance according to the AVL criterion? Or any other criterion?
A: I forgot to mention this above, but Splay Trees are essentially AVL trees with another condition. AVL trees have three conditions: a structure condition (from binary search trees), an ordering condition (also from binary search trees), and the AVL balancing condition (which states that the height of each node in the left and right subtrees differ by at most 1). Splay Trees just add one more condition to AVL trees, which is the Splay condition (the most recently inserted or accessed node is at the root of the tree). Since Splay Trees just add one more condition to AVL trees, and the fact that Splay Trees IS-A AVL tree, then splaying should preserve tree balance according to the AVL criterion.

Q: Since find() and contains() will trigger splaying, they cannot be const methods anymore.
A: This is essentially true because since splaying modifies the tree object itself and const methods do not allow the modification of the object calling the method, it will always result in compiler errors.

Conclusion

The lab was focused discovering how Splay Trees operate, using inheritance on generics, and getting a feel of implementing an algorithm for clients (something that is done in the world of software). In short, Splay Trees are great for fast access of recently used items. The downside of Splay trees is that there will be a little more overhead in some operations due to splaying. Hope you guys enjoyed reading about Splay Trees and such. Leave a comment if you like and thank you for reading!

Saturday, May 4, 2019

Lazy BST

Hello everyone,

I am bringing you another lab this week, this one working closely with binary search trees. This lab involves the implementation of two types of binary search trees; a general binary search trees and a lazy deletion binary search trees. As always feel free to leave a comment at the end and without further adieu, lets get into this.

What is a Binary Search Tree?

A binary search tree is an abstract data structure that stores items in a "sorted" format. Binary search trees are made up of nodes, with each node containing two pointers, a left child pointer and a right child pointer, and the data.

Source: GeeksforGeeks.com

Each node has an ordering condition imposed: the left sub tree of a parent node is less than the data in the parent node, while the right sub tree of a parent node is greater than the data in the parent node.
Based on the programming language you are implementing a binary search trees in, it is assumed that the data being inserted into the tree is comparable.

What makes binary search trees useful is that it allows for fast operations, such as searching, because searching through a binary search tree involves you going down either the left subtree or the right subtree over and over again, each time reducing the search space by at most half.

The Lab

The lab itself is essentially the implementation of a regular binary search tree (class name = BST) and a "lazy" binary search tree (class name = LazyBST). LazyBST is built off of BST, but with a few minor additions.

The LazyBST class utilizes a deletion technique known as "lazy deletion," which is essentially marking a node as deleted instead of going through the process of actually deleting the node and restructuring the tree. The advantage of lazy deletion is that it allows for fast deletion of nodes, but the downside is that it increases search and insert times (since the nodes marked as "deleted" are not actually gone) and wastes memory (again, because the marked nodes are not gone).

The Implementation

Starting off with BST, almost all of BST's methods will be utilizing recursion to traverse through the BST and do things like inserting nodes, deleting nodes, deleting the entire tree, etc. BST will include its own dedicated node class, which will store the pointers to the left and right children and the user data. BST's protected members include a size member, which represents the amount of nodes in the tree, and a pointer to the root node. BST also have methods such as insert(), remove(), find(), findMin(), getters, and much more; which uses recursion.

LazyBST is essentially BST, but has a few more things to take care off. LazyBST's nodes now contain a "is_deleted" flag in order to implement "lazy deletion." LazyBST itself also has a "real_size" member, which stores the real amount of nodes within LazyBST (valid and "deleted" nodes) while "size" stores the amount of valid nodes. You can essentially recycle the methods within BST, but you have to modify them to keep the lazy deletion in mind.

Lab Questions

Q: What changes will you have to make to turn LazyBST into a subclass of BST?
A: In LazyBST, we would have to have its node contain a flag of sorts to allow for lazy deletion and implement another "size" member that tells us the actually number of nodes within a LazyBST object. The methods inherited from BST will need to be overloaded in LazyBST such that it not only checks for null pointers, but also checks for marked nodes. To deallocate some memory after marking many nodes as "deleted," we would also require a garbage collection system of sorts.

Q: What is the tradeoff in using LazyBST? What are you trying to minimize?
A: With LazyBST, we are trying to minimize the time for deleting nodes within the tree, but we end up having to increasing the amount of time it takes to find nodes, insert nodes, and also waste memory.

Q: Does Laziness help change the Big-O of any of the other common operations?
A: I would say laziness does not really help change the Big-O of other common operations, since most operations involves "splitting" the tree in half in each iteration, resulting in a worst case O(log N).

Q: BST<T>::find() returns a reference to the payload. But remember you need to pass it the payload in order to find() it. Why do you need to invoke find() on the tree if you already have the payload in your hands?
A: Looking into this, I believe the reason why we need to invoke find on the tree, even if we already have what we want, is because we want to change something about the element that is stored within the tree itself. The payload we have in our hands may contain the bare minimum to find the data we want to change that is stored in the tree itself (I believe this occurs with pairs within C++ STL).

Conclusion

Overall, this lab is mainly about implementing the binary search tree abstract data type. Most of the methods within both BST and LazyBST involves the use of recursion. The methods themselves usually have a worst case runtime of O(log N) since some operations involves splitting the tree in half in each iteration. LazyBST is essentially BST, but requires a "deleted" flag in each of its nodes and the methods need to be modified to handle the situation where it lands on a "deleted" node.

That is pretty much all for this week. As always please leave a comment and thank you for reading!