Home > mailing lists

Re: Handing off SLRU fsyncs to the checkpointer - Mailing list pgsql-hackers

From	Jakub Wartak
Subject	Re: Handing off SLRU fsyncs to the checkpointer
Date	August 31, 2020 08:49:54
Msg-id	VI1PR0701MB69600D6FBF72A92CD6E18AF8F6510@VI1PR0701MB6960.eurprd07.prod.outlook.com Whole thread Raw
In response to	Re: Handing off SLRU fsyncs to the checkpointer (Thomas Munro <thomas.munro@gmail.com>)
Responses	Re: Handing off SLRU fsyncs to the checkpointer
List	pgsql-hackers

Tree view

Hi Thomas, hackers,

>> ... %CPU ... COMMAND
>> ... 97.4 ... postgres: startup recovering 000000010000000000000089
> So, what else is pushing this thing off CPU, anyway?  For one thing, I
> guess it might be stalling while reading the WAL itself, because (1)
> we only read it 8KB at a time, relying on kernel read-ahead, which
> typically defaults to 128KB I/Os unless you cranked it up, but for
> example we know that's not enough to saturate a sequential scan on
> NVME system, so maybe it hurts here too (2) we keep having to switch
> segment files every 16MB.  Increasing WAL segment size and kernel
> readahead size presumably help with that, if indeed it is a problem,
> but we could also experiment with a big POSIX_FADV_WILLNEED hint for a
> future segment every time we cross a boundary, and also maybe increase
> the size of our reads.

All of the above (1,2) would make sense and the effects IMHO are partially possible to achieve via ./configure compile
options,but from previous correspondence [1] in this particular workload, it looked like it was not WAL reading, but
readingrandom DB blocks into shared buffer: in that case I suppose it was the price of too many syscalls to the OS/VFS
cacheitself as the DB was small and fully cached there - so problem (3): copy_user_enhanced_fast_string <-
17.79%--copyout(!) <- __pread_nocancel <- 16.56%--FileRead / mdread / ReadBuffer_common (!). Without some
micro-optimizationor some form of vectorized [batching] I/O in recovery it's dead end when it comes to small changes.
Thingthat come to my mind as for enhancing recovery: 
- preadv() - works only for 1 fd, while WAL stream might require reading a lot of random pages into s_b (many
relations/fds,even btree inserting to single relation might put data into many 1GB [default] forks). This would only
micro-optimizeINSERT (pk) SELECT nextval(seq) kind of processing on recovery side I suppose. Of coruse provided that
StartupXLOGwould be more working in a batched way: (a) reading a lot of blocks from WAL at once (b) then issuing
preadv()to get all the DB blocks into s_b going from the same rel/fd (c) applying WAL. Sounds like a major refactor
justto save syscalls :( 
- mmap() - even more unrealistic
- IO_URING - gives a lot of promise here I think, is it even planned to be shown for PgSQL14 cycle ? Or it's more like
PgSQL15?

-Jakub Wartak

[1] -
https://www.postgresql.org/message-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520%40VI1PR0701MB6960.eurprd07.prod.outlook.com
pleasesee profile after "0001+0002 s_b(at)128MB: 85.392"

pgsql-hackers by date:

From: Michael Paquier
Date: 31 August 2020, 07:56:06
Subject: Re: Switch to multi-inserts for pg_depend

From: Kyotaro Horiguchi
Date: 31 August 2020, 08:56:58
Subject: Re: "cert" + clientcert=verify-ca in pg_hba.conf?

Re: Handing off SLRU fsyncs to the checkpointer - Mailing list pgsql-hackers

Previous

Next