- Why AIO? - Three main reasons: - synchronous blocking IO major throughput bottleneck (bgwriter, startup process, checkpointer, bitmap heap scans, ...) - IO overhead of buffered IO too high / can't fully utilize CPU (no DMA, kernel pagecache management expensive) - DIO, DIO -> AIO - buffered kernel IO, especially for writes, is hard to schedule from userspace. - kernel writes back data at some fairly random time, unless its hand gets forced -> latency spikes - kernel readahead doesn't reliably happen when we want it - tracks information per fd, but our data files are segmented - in some places we issue readahead requests with posix_fadvise, but that doesn't guarantee page stays around, requires multiple buffercache lookups, and can cause unnecessary blocking - => DIO, and synchronous DIO is way too slow -> AIO - What is AIO - pipelined graphic - multiple concurrent IOs, without waiting for completion - What is DIO - avoid kernel buffering as much as possible - kernel copies data to/from storage directly from userspace buffers, most of the time anyway - often can be done using DMA to/from storage device, thereby avoiding expensive copying - userspace has much much more control / responsibility for direct IO - Why doesn't postgres have AIO support yet? - linux AIO didn't support buffered DIO - large project / sanity - adds complexity - Why doesn't postgres support DIO yet? - synchronous DIO is very very slow for small IOs (latency) - io_uring primer - new interface, added in 5.1 - graphic with two ringbuffers - pretty generic, supports asynchronously executing a significant number of operations - allows to significantly reduce syscall overhead by submitting several "syscalls" at once - things that the kernel can't execute asynchronously today are automatically made asynchronous by executing them in kernel threads - https://kernel.dk/io_uring.pdf - AIO in Postgres, Difficulties - no existing infrastructure for efficient streaming reads / writes - some weaknesses in our cache replacement logic make kernel page cache very useful - postgres is process based, can't share IO threads as easily as a threaded implementation - Postgres Architecture Points - Process Based -> file descriptors, buffers not shared - Shared Buffers / Locking crucial for in-progress IOs (at least writes) - Deadlocks due to locking multiple buffers and then waiting -> different backends need to be able to complete IOs - Platform details need to be abstracted away - Need a fallback implementation, with helper processes - Slightly higher layer: - Completion callback based - One "shared memory" callback that needs to be able to unlock resources if issuing process is busy. There's limits to how badly these may fail - One "local" callback, that triggers only in issuing process (or not at all, if that has since "released" the IO"). This is used to to perform actions based on the completion. E.g. look for more pages to write back, process the page - Important helper infrastructure: - "streaming read" - converted seqscan, vacuum (1st scan only), pg_prewarm - "streaming write" - converted checkpoint, background writer - Graphic - State: - prototype, passes all tests - performance considerably better than current for some workloads, worse for some - https://github.com/anarazel/postgres/tree/aio - Open questions - feasibility / performance of process based AIO emulation - ... windows tpch scale 100: dio: 27316.717 aio: