- idea of the talk - what is AIO - what is DIO - why it's hard, perhaps apparent via the next slides - what's already been merged: - infrastructure for multiple buffers in progress - bulk relation extension - required for DIO, important for AIO - very important bottleneck for existing code - prequisite: redesign buffer replacement - benchmark: concurrent copy into one relation - slow disk: - 15: - 32: 24.754540 - 128: 20.213224 - 16: - 32: 33.116238 - 128: 36.723445 - AIO: - 32: 35.249262 - 128: 45.469138 - fast cloud disk - 15: - 32: 78.926325 - 128: 76.325778 - 16: - 32: 99.871601 - 128: 105.861666 - AIO: - 32: 181.513296 - 128: 176.029554 - fast local disk - very basic direct IO support - benchmark? - related: - WAL prefetching - next step: streaming read infrastructure, without AIO - explain task - explain readv style API - explain paralleling work - benchmark? - hopefully 17 - next step: use streaming read infrastructure - explained in more detail later - sequential scans - heap vacuum - btree vacuum - hopefully 17 - possible next step: streaming write infrastructure - checkpointer - bgwriter - maybe later? - next step: AIO infrastructure - different engines - fallback to worker mode - goal: can use AIO infrastructure without loss of performance, even when no AIO support present, to avoid duplicating code - plenty cleanups required - maybe 17, possibly not - next step: change streaming read infrastructure to use AIO - together with prior - next step: sequential scans - problem exposure - OS readahead gets confused - skipped blocks - segment boundaries - OS readahead doesn't exist - OS readahead doesn't know workload / not aggressive enough - OS readahead requires buffered IO - slower - double buffering - benchmark: read performance via sequential scan - cloud network storage - 12GB table - clean OS and PG cache - master: 58.840 - AIO: 16.673 - DIO: 15.766 - fast ssds - 35GB, medium wide tuples, SELECT count(*), 4 parallel workers - cold OS, PG cache - master: - AIO: 6385.097 - AIO + DIO: 5407.045 - prewarm: - master: 19.760 - AIO: 13.450 - AIO + DIO: 5.902 - next step: checkpoints, bgwriter - benchmark, checkpoint 35GB of dirty data onto stripe of 2 PCIe 3x SSDs - 16: 1.601 GB/s - AIO: 2.069 GB/s - AIO + DIO: 4.922 GB/s - next step: WAL - benefit #1: Do something else during WAL write / flush - benefit #2: Multiple WAL flushes concurrently - benchmark: fio results for flushes with different IO depths - currently: automatic "group flush" - hardest piece of patchset - explain - problem: walsender / walreceiver solution: buffering - benchmark: vacuum - improved read performance - improved *write* performance, due to asynchronous WAL flushing - pgbench scale 200 run for 1M tx, autovacuum disabled - create new database with that template - create "vacuum" database using the new database - time for vacuum (verbose) pgbench_accounts - slow disk - master: - vacuum: 94.673 s - AIO: - vacuum: 12.349 s - fast disk - master: - vacuum: 33.370 s - AIO: - vacuum: 7.737 s - benchmark: pgbench - fio benchmark showing concurrent WAL flush benefits - small WAL write problem - padding of WAL records - other prototyped steps: - SyncDataDirectory() - bitmap heap scans - WAL replay prefetching - future: - various index scans - mention Tomas Vondra's patch - much more possible - more work around vacuuming - temp table support - lower-level operations - create database - vacuum full - on startup cleanups - filesystem directory iteration - many more