- goal of talk - explain to others that the topic is important - shared understanding - have other people work on the problem - fixing numa stuff will often help on non-numa systems too - implemented approaches in hacky way - not claiming topcis - what is numa - increased latency, decreased bandwidth - "official" part - "inofficial" part - why is it more important than it used to be - #cores increase - core-to-core latency is also increasing - chiplet design - how does numa work on linux - default location: local - location determined on first use! -> using prewarm or such will drastically change location of memory - numa balancing (costly, but better than nothing) - problems with numa in postgres - no insights -> can't explain performance behaviour - buffer related state is the worst problem - buffer replacement not numa aware: everything is interspersed - also bad for non-numa! - TLB misses - mdreadv() - example: - N sequential scans on independent tables - 1 [parallel] sequential scan on one table - solution: - #cpus freelists - - numa aware clock sweep - heavy heavy contention for frequently accessed read-only buffers - one numa node faster than all numa nodes! - use different database on each numa node: - scales linearly - duplicating data reduces cache efficiency - solution: fastpath locking for buffers - hard part: When to use. - without numactl --interleave=all: often all memory on one node - with interleave=all: - local memory also allocated in interleaved fashion ~10-20% perf loss - other stuff: - read-mostly and frequently changing data on same cacheline - example: TransamVariablesData - 50% faster concurrent xid assignment with subxids - procarray probably "too dense" - solution: - have per-numa node freelists? - how to identify - perf c2c can be helpful - perf stat -e mem_load_l3_miss_retired.remote_dram,mem_load_l3_miss_retired.remote_fwd,mem_load_l3_miss_retired.remote_hitm,mem_load_l3_miss_retired.local_dram,uncore_imc/cas_count_read/,uncore_imc/cas_count_write/