External Sorting
Quiz: Given a file containing 10 million integers. The memory can only hold 1 million integers at a time. Sort this file.
Approach: N-way merge.
-
Split and sort: Split the file into 10 chunks, each of which fits in memory. Sort each chunk in memory.
-
N-way merge: Load the first integer of each chunk into memory and organize them in a heap. Pop the minimum integer from the heap, which is also the smallest integer in the file, and write it to the output file. Load the next integer from the corresponding chunk into the heap. Repeat this process.
To reduce disk accesses, keep a window of each chunk in memory and move to the next window only when the current window is consumed.
External Aggregation
Quiz: Given a file containing 10 million key-value pairs. The memory can only hold 1 million key-value paris at a time. Compute the sum of values for each unique key.
Approach #1: Merge-based aggregation. The steps are almost the same as N-way merge. During heap operations, pop all the pairs with the same key and perform the aggregation. Since the pairs are sorted, no values will be missed.
Approach #2: Hash-based aggregation.
- Split and pre-aggregation: Split the file into smaller chunks that fit into memory. For each chunk, perform pre-aggregation.
- Hash-based partitioning: Write the aggregated results of each chunk to different partitions based on hash values. If a partition is too large to fit in memory, re-partition it.
- Intra-partition aggregation: For each partition, perform post-aggregation. Since pairs are partitioned by their key hash, the pairs with the same key are always in the same partition (i.e., disjoint).
External Algorithms
Both external sorting and external aggregation belong to external algorithms, which extend simple algorithms from the unbounded memory model to the bounded memory model. In practice, these algorithms also consider other factors, such as memory affinity (e.g., NUMA in modern CPUs) and disk affinity (e.g., local versus remote disks in distributed systems).
External algorithms are very important and widely used in databases, as databases are designed to handle large amounts of data, often exceeding the available memory. Additionally, dividing simple algorithms into parallel sub-tasks enables intra-operator parallelism in databases, which significantly improves database performance in multi-core or distributed environments.
- External Memory Algorithms and Data Structures is a comprehensive review of external algorithms.
- External Aggregation in DuckDB and Parallel Grouped Aggregation in DuckDB introduce how DuckDB implements a fully parallel aggregate operator.