diff --git a/lectures/22_hash_tables_I/README.md b/lectures/22_hash_tables_I/README.md new file mode 100644 index 0000000..dbda5ae --- /dev/null +++ b/lectures/22_hash_tables_I/README.md @@ -0,0 +1,347 @@ +# Lecture 23 --- Hash Tables + +## Today’s Lecture + +- Hash Tables, Hash Functions, and Collision Resolution +- Performance of: Hash Tables vs. Binary Search Trees +- Collision resolution: separate chaining vs open addressing +- STL’s unordered_set and unordered_map +- Using a hash table to implement a set/map + + +## 23.1 Definition: What’s a Hash Table? + +- A table implementation with constant time access. + - Like a set, we can store elements in a collection. Or like a map, we can store key-value pair associations in the hash table. But it’s even faster to do find, insert, and erase with a hash table! However, hash tables do not store the data in sorted order. +- A hash table is implemented with an array at the top level. +- Each element or key is mapped to a slot in the array by a hash function. + +## 23.2 Definition: What’s a Hash Function? + +- A simple function of one argument (the key) which returns an integer index (a bucket or slot in the array). +- Ideally the function will “uniformly” distribute the keys throughout the range of legal index values (0 → k-1). +- What’s a collision? + - When the hash function maps multiple (different) keys to the same index. +- How do we deal with collisions? + - One way to resolve this is by storing a linked list of values at each slot in the array. + +## 23.3 Example: Caller ID + +- We are given a phonebook with 50,000 name/number pairings. Each number is a 10 digit number. We need to +create a data structure to lookup the name matching a particular phone number. Ideally, name lookup should +be O(1) time expected, and the caller ID system should use O(n) memory (n = 50,000). +- Note: In the toy implementations that follow we use small datasets, but we should evaluate the system scaled +up to handle the large dataset. +- The basic interface: + +```cpp +// add several names to the phonebook +add(phonebook, 1111, "fred"); +add(phonebook, 2222, "sally"); +add(phonebook, 3333, "george"); +// test the phonebook +std::cout << identify(phonebook, 2222) << " is calling!" << std::endl; +std::cout << identify(phonebook, 4444) << " is calling!" << std::endl; +``` + + + +## 23.4 Caller ID with an STL Vector + +```cpp +// create an empty phonebook +std::vector phonebook(10000, "UNKNOWN CALLER"); + +void add(std::vector &phonebook, int number, std::string name) { + phonebook[number] = name; +} + +std::string identify(const std::vector &phonebook, int number) { + return phonebook[number]; +} +``` + +Exercise: What’s the memory usage for the vector-based Caller ID system? +What’s the expected running time for identify, insert, and erase? + +## 23.5 Caller ID with an STL Map + +```cpp +// create an empty phonebook +std::map phonebook; +void add(std::map &phonebook, int number, std::string name) { + phonebook[number] = name; +} + +std::string identify(const std::map &phonebook, int number) { + map::const_iterator tmp = phonebook.find(number); + if (tmp == phonebook.end()){ + return "UNKNOWN CALLER"; + }else{ + return tmp->second; + } +} +``` + +Exercise: What’s the memory usage for the map-based Caller ID system? +What’s the expected running time for identify, add, and erase? + +## 23.6 Now let’s implement Caller ID with a Hash Table + +![alt text](phonebook.png "phonebook") + +```cpp +#define PHONEBOOK_SIZE 10 + +class Node { +public: + int number; + string name; + Node* next; +}; + +// create the phonebook, initially all numbers are unassigned +Node* phonebook[PHONEBOOK_SIZE]; +for (int i = 0; i < PHONEBOOK_SIZE; i++) { + phonebook[i] = NULL; +} + +// corresponds a phone number to a slot in the array +int hash_function(int number) { + + + + + +} + +// add a number, name pair to the phonebook +void add(Node* phonebook[PHONEBOOK_SIZE], int number, string name) { + + + + + + +} + +// given a phone number, determine who is calling +std::string identify(Node* phonebook[PHONEBOOK_SIZE], int number) { + + + + + + +} +``` + +## 23.7 Exercise: Choosing a Hash Function + +- What’s a good hash function for this application? + +- What’s a bad hash function for this application? + +## 23.8 Exercise: Hash Table Performance + +- What’s the memory usage for the hash-table-based Caller ID system? + +- What’s the expected running time for identify, insert, and erase? + +## 23.9 What makes a Good Hash Function? + +- Goals: fast O(1) computation and a random, uniform distribution of keys throughout the table, +despite the actual distribution of keys that are to be stored. +- For example, using: f(k) = abs(k)%N as our hash function satisfies the first requirement, but may not +satisfy the second. +- Another example of a dangerous hash function on string keys is to add or multiply the ascii values of each char: +```cpp +unsigned int hash(string const& k, unsigned int N) { +unsigned int value = 0; +for (unsigned int i=0; i1j + c2j2) mod N where c1 and c2 are constants. + + – Secondary hashing: when a collision occurs a second hash function is applied to compute a new table location. This is repeated until an empty location is found. + +- For each of these approaches, the find operation follows the same sequence of locations as the insert operation. The key value is determined to be absent from the table only when an empty location is found. +- When using open addressing to resolve collisions, the erase function must mark a location as “formerly occupied”. If a location is instead marked empty, find may fail to return elements in the table. Formerly occupied locations may (and should) be reused, but only after the find operation has been run to completion. +- Problems with open addressing: + + – Slows dramatically when the table is nearly full (e.g. about 80% or higher). This is particularly problematic for linear probing. + + – Fails completely when the table is full. + + – Cost of computing new hash values. + +## 23.12 Hash Table in STL? + +- The Standard Template Library standard and implementation of hash table have been slowly evolving over +many years. Unfortunately, the names “hashset” and “hashmap” were spoiled by developers anticipating the +STL standard, so to avoid breaking or having name clashes with code using these early implementations... +- STL’s agreed-upon standard for hash tables: unordered_set and unordered_map. +- You can use std::unordered_set the same way as you use std::set, even though the internal of these two are different, the external interface are the same. +- You can use std::unordered_map the same way as you use std::map, even though the internal of these two are different, the external interface are the same. + + + + +## 23.13 Leetcode Exercises + +- [Leetcode problem 1: Two Sum](https://leetcode.com/problems/two-sum/). Solution: [p1_twosum_hash_table.cpp](../../leetcode/p1_twosum_hash_table.cpp). + +**Note**: make sure you understand this longest consecutive sequence problem and its solution, because you will re-write this function in the lab. +- [Leetcode problem 128: Longest Consecutive Sequence](https://leetcode.com/problems/longest-consecutive-sequence/). Solution: [p128_longest_consecutive_sequence.cpp](../../leetcode/p128_longest_consecutive_sequence.cpp). + diff --git a/lectures/22_hash_tables_I/hash_phonebook_code.cpp b/lectures/22_hash_tables_I/hash_phonebook_code.cpp new file mode 100644 index 0000000..314fedd --- /dev/null +++ b/lectures/22_hash_tables_I/hash_phonebook_code.cpp @@ -0,0 +1,26 @@ + + +int hash_function(int number) { + //return 5; /// BAD: always the same + //return number / 1000000000; /// BAD: first bad + return number % PHONEBOOK_SIZE; // REASONABLY GOOD: +} + +void add(Node* phonebook[PHONEBOOK_SIZE], int number, const std::string& name) { + int index = hash_function(number) % PHONEBOOK_SIZE; + Node* tmp = new Node; + tmp->name = name; + tmp->number = number; + tmp->next = phonebook[index]; + phonebook[index] = tmp; + // what about duplicate / repeated add? +} + +std::string identify(Node* phonebook[PHONEBOOK_SIZE], int number) { + Node* current = phonebook[ hash_function(number) % PHONEBOOK_SIZE ]; + while ( current != NULL && current->number != number ) { + current = current->next; + } + if (current == NULL) return "UNKNOWN CALLER"; + return current->name; +} diff --git a/lectures/22_hash_tables_I/phonebook.png b/lectures/22_hash_tables_I/phonebook.png new file mode 100644 index 0000000..cc3b9e8 Binary files /dev/null and b/lectures/22_hash_tables_I/phonebook.png differ