- idea of the talk
- what is AIO
- what is DIO
- why it's hard, perhaps apparent via the next slides
- what's already been merged:
  - infrastructure for multiple buffers in progress
  - bulk relation extension
    - required for DIO, important for AIO
    - very important bottleneck for existing code
  - prequisite: redesign buffer replacement
  - benchmark: concurrent copy into one relation
    - slow disk:
      - 15:
        - 32: 24.754540
        - 128: 20.213224
      - 16:
        - 32: 33.116238
        - 128: 36.723445
      - AIO:
        - 32: 35.249262
        - 128: 45.469138
    - fast cloud disk
      - 15:
        - 32: 78.926325
        - 128: 76.325778
      - 16:
        - 32: 99.871601
        - 128: 105.861666
      - AIO:
        - 32: 181.513296
        - 128: 176.029554
    - fast local disk
  - very basic direct IO support
    - benchmark?
  - related:
    - WAL prefetching


- next step: streaming read infrastructure, without AIO
  - explain task
  - explain readv style API
  - explain paralleling work
  - benchmark?
  - hopefully 17

- next step: use streaming read infrastructure
  - explained in more detail later
  - sequential scans
  - heap vacuum
  - btree vacuum
  - hopefully 17

- possible next step: streaming write infrastructure
  - checkpointer
  - bgwriter
  - maybe later?

- next step: AIO infrastructure
  - different engines
  - fallback to worker mode
  - goal: can use AIO infrastructure without loss of performance, even when no
    AIO support present, to avoid duplicating code
  - plenty cleanups required
  - maybe 17, possibly not

- next step: change streaming read infrastructure to use AIO
  - together with prior

- next step: sequential scans
  - problem exposure
    - OS readahead gets confused
      - skipped blocks
      - segment boundaries
    - OS readahead doesn't exist
    - OS readahead doesn't know workload / not aggressive enough
    - OS readahead requires buffered IO
      - slower
      - double buffering

  - benchmark: read performance via sequential scan
    - cloud network storage
    - 12GB table
    - clean OS and PG cache
    - master: 58.840
    - AIO: 16.673
    - DIO: 15.766

    - fast ssds
      - 35GB, medium wide tuples, SELECT count(*), 4 parallel workers
      - cold OS, PG cache
      - master:
      - AIO: 6385.097
      - AIO + DIO: 5407.045

      - prewarm:
      	- master: 19.760
      	- AIO: 13.450
	- AIO + DIO: 5.902

- next step: checkpoints, bgwriter
  - benchmark, checkpoint 35GB of dirty data onto stripe of 2 PCIe 3x SSDs
    - 16: 1.601 GB/s
    - AIO: 2.069 GB/s
    - AIO + DIO: 4.922 GB/s

- next step: WAL
  - benefit #1: Do something else during WAL write / flush
  - benefit #2: Multiple WAL flushes concurrently
    - benchmark: fio results for flushes with different IO depths
  - currently: automatic "group flush"
  - hardest piece of patchset
  - explain
  - problem: walsender / walreceiver
    solution: buffering

- benchmark: vacuum
  - improved read performance
  - improved *write* performance, due to asynchronous WAL flushing

  - pgbench scale 200 run for 1M tx, autovacuum disabled
  - create new database with that template
  - create "vacuum" database using the new database
  - time for vacuum (verbose) pgbench_accounts
  - slow disk
    - master:
      - vacuum: 94.673 s
    - AIO:
      - vacuum: 12.349 s
  - fast disk
    - master:
      - vacuum: 33.370 s
    - AIO:
      - vacuum:  7.737 s

- benchmark: pgbench
  - fio benchmark showing concurrent WAL flush benefits
  - small WAL write problem
  - padding of WAL records

- other prototyped steps:
  - SyncDataDirectory()
  - bitmap heap scans
  - WAL replay prefetching

- future:
  - various index scans
    - mention Tomas Vondra's patch
    - much more possible
  - more work around vacuuming
  - temp table support
  - lower-level operations
    - create database
    - vacuum full
    - on startup cleanups
    - filesystem directory iteration
  - many more