Re: Add a log message on recovery startup before syncing datadir - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Add a log message on recovery startup before syncing datadir |
Date | |
Msg-id | CA+hUKG+aiKFcyH4H-zd2JSOTgd10PoPwk=g4Jyg1bQvoMO_aSw@mail.gmail.com Whole thread Raw |
In response to | Re: Add a log message on recovery startup before syncing datadir (Michael Banck <michael.banck@credativ.de>) |
List | pgsql-hackers |
On Wed, Oct 7, 2020 at 10:18 PM Michael Banck <michael.banck@credativ.de> wrote: > Am Mittwoch, den 07.10.2020, 21:11 +1300 schrieb Thomas Munro: > > On Wed, Oct 7, 2020 at 8:58 PM Michael Banck <michael.banck@credativ.de> wrote: > > > "starting archive recovery" appears) took 20 minutes and during the > > Nice data point. > In the thread you pointed to below, Robert also mentions "tens of > minutes". I've also heard anecdata about much worse cases running for hours, leading people to go and comment that code out in a special build. > > https://www.postgresql.org/message-id/flat/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com > > Thanks, I've re-read that now. As a summary, Tom said that the syncs > can't go away on the relations files/tablespaces. Robert suggested some > periodic status update that we're still doing work. You proposed using > syncfs. There was a discussion about ommitting some files/directories, > but I think those don't matter much for people who see extreme delays > here because those are very likely due to relation files. Actually I proposed two different alternatives, and am hoping to get thoughts, support, objections to help me figure out which was to go. In summary: 1. syncfs() is simple, and doesn't require much analysis to see that it is about as good/bad at the job as the current coding while being vastly more efficient. It only works on Linux. 2. Precise fsync() based on WAL contents is a more complicated rethink from first principles, and requires careful analysis of the arguments I've made about its safety. It works on all systems, and is more efficient than the big-syncfs-hammer, because it delays writing dirty data that'll be redirtied by redo, so it's possible that it only goes out to disk once. > I had a quick look yesterday and it seems to me the code that actually > does the syncing has no notion of how many files we synced already or > will sync in total, and adding that along with book-keeping would > complicate the calling pattern considerably. So I don't think a status > update like "finished syncing 10000 out of 23054 files" is going to (i) > be back-patchable or (ii) would not be even slower than now due to > having to figure out how much work is to be done first. Actually I think you could find out how many files will be synced quite quickly! Reading directory contents is fast, because directory entries hopefully live in sequential blocks and are potentially prefetchable by the kernel (well, details vary from filesystem to filesystem, but the point is that you don't need to examine the potentially randomly scattered inodes just to count the files). For example, ResetUnloggedRelations() already does a full scan of relation directory entries on crash startup, and I know from experiments with 1 million relations that that is fast, whereas SyncDataDirectory() is very very slow even with no dirty data, only because it has to open every file. But this will be a moot question if I can get rid of the problem completely, so I'm not personally going to investigate making a progress counter. > So I would suggest to go with a minimal message before starting to sync > as proposed, which I think should be back-patchable, and try to (also) > make it faster (e.g. via syncfs) for v14. Yeah, I am only proposing to change the behaviour in PG14.
pgsql-hackers by date: