Summary of "Хэш-таблицы за 10 минут"

Summary of “Хэш-таблицы за 10 минут”

This video provides a concise and clear explanation of hash tables, their purpose, how they work, and common issues and solutions related to them. It also discusses important properties of hash functions and compares collision resolution methods.

Main Ideas and Concepts

What is a Hash Table? A hash table is a data structure that allows fast retrieval of information by a key, regardless of the data size. It is widely used in programming languages, database indexing, compilers, and associative arrays (dictionaries/maps).
Basic Problem: Searching by Key Searching by simple enumeration (linear search) is slow for large datasets. Hash tables solve this by calculating an index (position) from the key using a hash function, allowing near-instant access.
Hash Function
- Converts a key (e.g., a string like a name) into a numeric index.
- Example method: sum the numeric codes of characters in the string, then take the remainder modulo the table size to get a valid index.
- Acts like a “black box” converting text keys to numbers.
- Different hash functions exist and can be evaluated by specific criteria.
Collision Occurs when two different keys produce the same index. Collisions are inevitable and must be handled.

Collision Resolution Methods

Open Addressing (Linear Probing and Variants)
- If a collision occurs, find the next free cell by applying a sequence of transformations (called a probe sequence).
- Variants include linear probing, quadratic probing, double hashing, etc.
- When inserting, if the computed index is occupied, apply the probe sequence until an empty cell is found.
- Deletion is handled by marking cells as deleted (lazy deletion).
- Over time, deleted cells accumulate and slow down operations; retouching (rehashing into a new table) is used to clean up.
- Pros:
  - All data stored in one array (cache-friendly)
  - Less memory overhead (no pointers)
- Cons:
  - Performance depends heavily on probe sequence and table size
  - Table resizing is necessary to maintain efficiency
Chaining (Linked Lists)
- Each cell in the hash table stores a pointer/link to a list of entries that hash to the same index.
- Collisions are resolved by appending conflicting entries to the linked list.
- Searching involves traversing the linked list at the hashed index.
- Pros:
  - Simple to implement
  - Flexible table size
  - No clustering issues like in open addressing
- Cons:
  - Extra memory overhead for pointers
  - Less cache-friendly due to scattered memory access

Properties of a Good Hash Function

Determinism The same key must always produce the same index.
Uniformity Keys should be distributed evenly across the table to minimize collisions.
Efficiency The hash function should compute the index quickly to maintain fast access.
Range Limitation The output index must always be within the bounds of the table size.

Additional Notes

The video briefly mentions that there are many hash functions and collision resolution strategies beyond those discussed, encouraging further reading.
The video concludes by encouraging viewers to like and comment for feedback.

Detailed Methodology / Instructions

Creating a Hash Table Entry:
1. Take the key (e.g., a name).
2. Convert each character to its numeric code.
3. Sum these codes.
4. Take the remainder of this sum modulo the table size to get the index.
5. Store the value at the calculated index.
Handling Collisions (Open Addressing):
- If the calculated index is occupied:
  - Apply a probing function to calculate a new index (e.g., move one cell forward, skip cells, or rehash).
  - Repeat until an empty cell is found.
- Insert the value in the free cell.
Handling Collisions (Chaining):
- If the calculated index is occupied:
  - Store the new value in a separate memory area.
  - Add a pointer/reference to this new value in the linked list at the hash table cell.
Deletion in Open Addressing:
- Mark the cell as deleted instead of empty.
- During search, skip deleted cells but treat them as occupied for insertion.
- Periodically rehash to remove deleted cells and improve performance.