Friday, June 7, 2019

SpecialHeap

Hello everyone,

This week we are going to be looking into binary min-heaps, and how they can be used to find the k smallest elements within an given array of numbers and a given k value. As always, feel free to leave a comment and lets get right into this.

What is a Binary (min) Heap?

A binary heap is a data structure that takes the form of a binary tree. A binary heap has two different properties that defines what it is: a structure property and a heap-order property.

Structure wise, a binary heap takes the form of a complete binary tree. The heap-order property depends on the kind of binary heap we want to implement.

Since we are going to using a binary min-heap in our program, our heap-order property will be as follows: the root of the tree (or the subtree) is the smallest node in the tree (or subtree). Basically, the smallest value in the entire is at the root of the tree.

We will be able to take advantage of this heap-order property when finding the k smallest elements in a given array. In our implementation, we will be returning an array with the k smallest elements at the k greatest indices (for example we have an array [1, 2, 3, 4, 5, 6] and k = 2, then the result array would be [3, 4, 5, 6, 2, 1]).

The Implementation

In this lab, we will be implementing to kinds of template classes: a Heap class and a SpecialHeap class. Our Heap class will be our base binary min-heap class, while the SpecialHeap class inherits our Heap class. SpecialHeap will contain one unique method called get_k_smallest(), which will find the k smallest elements the array and format it as mentioned in the last paragraph of the "What is a Binary (min) Heap?" section.

Our Heap class will represent a complete binary tree in the form of a vector. This is possible by having our root node be at index 1 of the vector, our parent nodes being at k/2, our left child at 2k, and our right child at 2k+1. This can be seen below:
The reason why we are using a vector in our implementation of a binary min-heap instead of a binary tree is because it is simpler than implementing a binary tree, while also mimicking what a binary tree does. We can not use a binary search tree either because the heap-order property will violate the binary search tree order property.

Our vector in our Heap class will have a size of (2^N), where N is some value, so we have enough room to store a complete binary tree. Our vector will be initialized with the appropriate size in our default and parameterized constructor. Within our Heap class we will also have a "size" data member, which will keep track of the number of items that exist within our heap.

Our Heap class will have two private methods called _heapify() and _percolate_down(). _percolate_down() takes a vector index as input and ensures that the given element at the given index is in the correct place in the heap according to the heap-order property. _heapify() will be used in our parameterized constructor, and it ensures that an unordered heap becomes ordered. This is done by calling _percolate_down() starting from index N/2 and ending at 0.

We have two constructors: one default constructor and one parameterized constructor that takes in a vector as input. The default constructor initializes our internal vector to a default value of 128 (because of the 2^N requirement) and our size as 0. The parameterized constructor will set the size data member to the size of the input vector, set our internal vector to have a size of the input vector + 1, copy all the items from the input vector into our internal vector, and then call _heapify() to order our heap.

In both of the constructors, since the root starts at position 1, we fill position 0 with a sentinel value. The sentinel value is the lowest possible value for the data type that our template class is initialized for.

The rest of the methods are as follows:
  • is_empty(): Tells user if the heap is empty
  • insert(): Inserts an element into the heap
  • delete_min(): Deletes the smallest element in the heap (basically the root)
  • peek_min(): Returns the smallest element from the heap
  • to_string(): Returns a string representation of the heap.
Now for our SpecialHeap. Since our SpecialHeap class inherits from our Heap class, we don't really have much to implement. All we have to do is implement our get_k_smallest() method, which is pretty easy to implement. Within our get_k_smallest() method, we do the following steps k times:
  1. Get the smallest element through peek_min().
  2. Delete the smallest element through delete_min().
  3. Copy the smallest element we have to the just vacated cell of the heap (index size + 1).
Once we are done with this, we set the size to 0 and return our internal heap vector. Now we are able to return an array (or vector in this case) where the k smallest elements are at the k greatest indices.

Lab Questions

Q: Why you would use Quickselect (or Quicksort) at all instead of simply using a special heap (or Heapsort)?
A: I would say you would use Quickselect (or Quicksort) instead of using a special heap (or Heapsort) because it uses less memory while still achieving similar times alongside Heapsort. Although both our SpecialHeap and our Quicksort method from last week both do in-place operations, SpecialHeap uses more memory because it has to store the heap, data members, and class methods, while Quicksort can be implemented using two methods (as seen from last week's post).

Q: Why no use a balanced binary search tree for your backing store? Which of our original requirements (on page 1) would this implementation satisfy? What is the tradeoff, if any?
A: We do not use a balanced binary search tree because our heap-order property will conflict with the order property of a binary search tree. If it were possible to use this kind of implementation would probably satisfy the run-time complexity requirement of achieving O(k log N), but not the space requirement of O(1). The tradeoff would probably be same speed as the vector implementation, but uses more memory.

Conclusion

As we can see, a binary min-heap allows us to find the k smallest elements in an array, given a value of k. Although we only talked about binary min-heaps,  we can modify our Heap class in such a way that allows for the create of binary max-heaps, which have a heap-order property that states the root of a tree (or subtree) is the biggest node in the tree (or subtree). I hope you enjoyed this week's post. Feel free to leave a comment and I will see you next week!

Saturday, June 1, 2019

K'th Smallest Element

Hello everyone,

This week we will be looking into the K'th Smallest Element problem. This week's lab gives us a break from implementing data types, allowing us to look into some sorting algorithms. The sorting algorithm we will be discussing is the Quick Sort algorithm. As always, feel free to leave a comment below and now lets get right into it.

The K'th Smallest Element Problem

Given an array and a value of K, we need to find the K'th smallest element within the array. An example would be if I have a K = 4, then I need to find the 4th smallest element within the array. The solution to this problem would be to sort the array in ascending order and return the K'th element in that array (since it is our K'th smallest element).

However, the run time of this solution will be slow because we are doing a complete sort before returning the K'th smallest value in the array. We can speed this up by not doing a complete sort. We will see this in the implementation section. For now lets get into what Quick Sort is, since we are going to use it to solve this problem.

What is Quick Sort?

Quick Sort is a sorting method where we choose a number call a pivot, and we split the array of elements into two sub arrays; the left sub array containing elements < pivot and the right sub array containing elements > pivot.
Quick Sort Visualized
The act of splitting a single array into two sub arrays in quick sort is called partitioning. The idea behind quick sort is that it is a recursive sorting function. Each quick sort calls results in one call to a partition function, and two more quick sort calls; each one address the left and right sub arrays we created. With each recursive call, we get smaller and smaller sub arrays but also getting closer and closer to a sorted array. Once all the recursive calls return back to the very first quick sort call, we get a sorted array.

Now, there are different ways you can go about partitioning the given array. The method we are going to be using is called the Lomuto Partitioning Scheme, and it is considered to be one of the simplest partitioning methods out there for quick sort.

What is the Lomuto Partitioning Scheme?

The Lomuto Partitioning Scheme starts out with a pivot, a cursor, and a scanner. Our cursor and scanner both start on the left side of a given array, with the scanner moving from the left side to the right side of the array. When the scanner finds a value less than the pivot, it will swap its position with that of the element at the cursor. After the swap, the cursor moves one index up and the scanner keeps moving along the array. After the scanner is done traversing the array, we do a final swap of the element located at the cursor with our pivot.

The idea behind this method is that we create the two sub arrays as mentioned before. The final swap in the end makes sure that our pivot is in a spot where the elements on its left is less than the pivot and the elements on its right is greater than the pivot.

Here is a nice video that does a pretty good job at explaining and showing what the Lomuto Partitioning Scheme is and how it behaves:

The Implementation

This lab we are going to implement three different functions; one will be our partitioning function (we will call partition()) while the other two will be used to find the K'th smallest element (both will be called find_kth_least() but each have different parameters.

Our partition() function will take in three parameters: a vector, and a start and end input. We have a start and end parameter because of the two sub arrays created with every call to the partition function. The partition() function will be an implementation of the Lomuto Partitioning Scheme. The only difference is that the pivot will be the element in the middle of each sub array (meaning (start + end) / 2) instead of the right most element of the sub array as seen with typical implementations of the Lomuto Partitioning Scheme. However, to keep our partition() inline with the typical Lomuto Partitioning Scheme, we will swap the location of our pivot with the element at the right of the sub array.

Now for our two find_kth_least() functions. One of them will be considered as our "public API" function, which takes in a vector and a k value. Within this "public API" function, it will filter the inputs and then make a call to our second find_kth_least() function. This second find_kth_least() function utilizes recursion to determine what the K'th smallest element is in the array.

Our recursive find_kth_least() function takes in four parameters, a vector, a start value, an end value, and a k value. Within our recursive find_kth_least(), we first check that a valid sub array was passed in by checking that our start value is less than or equal to our end value. From there, we call our partition() function, which will return the location of the pivot. If the location of the pivot is equal to k, then we found our kth smallest element. Otherwise, we see if the location of the pivot is bigger or smaller than k, which determines what the next recursive call will do.

The reason we want our pivot to be in the kth index of the array is because of the properties of the pivot. All the elements on the left of the pivot is less than the pivot itself. By having the pivot be in the kth location of the array, it tells us that it is the kth smallest element of the entire array without sorting the entire array.

Lab Questions

Q: Suppose you want to calculate the median of a given array of integers, how would you use your pivoting strategy to do it?
A: To calculate the median of a given array of integers, I would probably partition the array until the location of the pivot is in the middle of the given array. The reason being that the median of a list of numbers can be seen as a pivot, since the numbers to the left of the median is smaller than the median and everything to the right of the median is bigger than the median. So, we just need to partition our array until the location of the pivot is in the middle of the array.

Q: Does your strategy yield the k smallest elements or the kth smallest element?
A: This strategy yields both the k smallest element and the K'th smallest element. The K'th smallest element, as said before, is a pivot whose location is kth index of the array. Everything to the left of that pivot is smaller than the pivot itself, meaning they too are the k smallest elements in the array.

Q: We saw that this algorithm is O(N) in the number of elements. Can you do better (meaning sub-linear)? Why or why not?
A: I would say that is it not possible to have an algorithm that can do this with a sub-linear time complexity. The reason I have at this moment is is because of our partition() function, which requires a linear traversal of the entire sub array when partitioning, even though the sub arrays themselves get smaller and smaller by some amount.

Q: Since sorting solves the problem in O(N log N) time, your algorithm probably doesn't sort the input (otherwise you can get a sub-log-linear sort algorithm, right)? What is the trade-off? What are you giving up to get the run-time advantage? Looking at this from the other direction, ask what more a sorting algorithm does that goes unused by you (and is thus safe to omit)? What is the time complexity of just this "omitted" operation in a full fledged sort?
A: I would say the trade-off of not doing a complete sort to increase speed is a unsorted array. Our algorithm at the moment will return the K'th smallest element of the array and a partially sorted array as well (since our vector parameters are actually references). The draw back of this is that we could be completing two tasks at the same time, finding a K'th smallest element and sorting the array. If we were to find another K'th smallest element on the same array while it is partially sorted, it will take more time than finding another K'th smallest element on the same array when it is completely sorted.

Q: How can you get the k smallest elements in the array? What is the time complexity of that procedure? What if you wanted these elements in sorted order?
A: With our current implementation, to find the k smallest elements in an array you would have to find the K'th smallest element in the array (as discussed in the second question of this list). The time complexity of this procedure would be around the same as our current implementation of find_kth_least(). If you wanted these elements in sorted order, the easy way would just to sort those elements only; however, it could affect the overall run time.

Q: Why is it important to make sure that Quicksort does not degrade to O(n^2) on an array of identical values?
A: It is important to make sure that Quicksort does not degrade to O(n^2) on an array of identical values because we are wasting time and resources on an array that is technically already sorted. We can figure out if an array is filled with the same value by doing a quick O(n) traversal check before starting Quicksort.

Conclusion

 The solution to the K'th Smallest Element problem in this case is similar to the implementation of Quicksort. The only difference is that we are not looking to sort an array given to us, rather we are looking for a pivot that is the K'th smallest element. I hope you guys enjoyed reading this post. Feel free to leave a comment and I will see you guys later!