Re: Handing off SLRU fsyncs to the checkpointer - Mailing list pgsql-hackers
From | Jakub Wartak |
---|---|
Subject | Re: Handing off SLRU fsyncs to the checkpointer |
Date | |
Msg-id | VI1PR0701MB69603BF4F1EC2AF6E6ECF5C1F6550@VI1PR0701MB6960.eurprd07.prod.outlook.com Whole thread Raw |
In response to | Re: Handing off SLRU fsyncs to the checkpointer (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Handing off SLRU fsyncs to the checkpointer
Re: Handing off SLRU fsyncs to the checkpointer |
List | pgsql-hackers |
Hi Thomas / hackers, >> The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache(but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all): >> >> 29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string >> ---copy_user_enhanced_fast_string >> |--17.98%--copyin >> [..] >> | __pwrite_nocancel >> | FileWrite >> | mdwrite >> | FlushBuffer >> | ReadBuffer_common >> | ReadBufferWithoutRelcache >> | XLogReadBufferExtended >> | XLogReadBufferForRedoExtended >> | --17.57%--btree_xlog_insert > > To move these writes out of recovery's way, we should probably just > run the bgwriter process during crash recovery. I'm going to look > into that. Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic - maybe we should change subject? However,to add some data to the separate bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way, bettersorting), I've measured the same configuration as last time with still the same append-only WAL workload on NVMe andcompared with various shared_buffers settings (and buffers description sizing from pg_shmem_allocations which as You statedis wrongly reported(?) which I'm stating only for reference just in case): shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a] shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a] shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a] shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929 shared_buffers=4GB buffers_desc=32MB 27.185, 1.166 shared_buffers=8GB buffers_desc=64MB 27.649, 1.088 shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b] shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b] shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217 shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349 shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284 shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413 The amount of WAL to be replayed was ~2.8GB. To me it looks like that a) flushing dirty buffers by StartupXLog might be a real problem but please read-on. b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's not that big and it's not degrading linearlyas one would expect (??) L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine: 32K/32K/256K/46080K.Maybe the DB size is too small. I've double-checked that in condition [a] (shared_buffers=128MB) there was a lot of FlushBuffer() invocations per second(perf stat -e probe_postgres:FlushBuffer -I 1000), e.g: # time counts unit events 1.000485217 79,494 probe_postgres:FlushBuffer 2.000861366 75,305 probe_postgres:FlushBuffer 3.001263163 79,662 probe_postgres:FlushBuffer 4.001650795 80,317 probe_postgres:FlushBuffer 5.002033273 79,951 probe_postgres:FlushBuffer 6.002418646 79,993 probe_postgres:FlushBuffer while at 1GB shared_buffers it sits nearly at zero all the time. So there is like 3x speed-up possible when StartupXLogwouldn't have to care too much about dirty buffers in the critical path (bgwriter would as you say?) at leastin append-only scenarios. But ... I've checked some real systems (even older versions of PostgreSQL not doing thatmuch of replication, and indeed it's e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge (>100GB,0.5TB RAM) but they are *system-wide* numbers, work is really performed by bgwriter not by startup/recovering asin this redo-bench case when DB is open for reads and replicating-- so it appears that this isn't optimization for hotstandby case , but for the DB-closed startup recovery/restart/disaster/PITR scenario). As for the 24GB shared_buffers scenario where dirty buffers are not at all a problem with given top profile (output trimmed),again as expected: 13.41% postgres postgres [.] hash_search_with_hash_value |--8.31%--BufTableLookup <- ReadBuffer_common <- ReadBufferWithoutRelcache --5.11%--smgropen |--2.77%--XLogReadBufferExtended --2.34%--ReadBufferWithoutRelcache 7.88% postgres postgres [.] MarkBufferDirty I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported orzeros. However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would expectBufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day. >> __pread_nocancel >> --11.44%--FileRead >> mdread >> ReadBuffer_common >> ReadBufferWithoutRelcache >> XLogReadBufferExtended >> XLogReadBufferForRedoExtended > > For these reads, the solution should be WAL prefetching,(..) But... when combined with Andres's work-in-progress AIO stuff(..) Yes, I've heard a thing or two about those :) I hope I'll be able to deliver some measurements sooner or later of those twotogether (AIO+WALprefetch). -Jakub Wartak.
pgsql-hackers by date: