Re: Weird failure with latches in curculio on v15 - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Weird failure with latches in curculio on v15 |
Date | |
Msg-id | CA+TgmobQ5iWVCoPokW1jBrRb3V_x0PUXD05vvgS3FqQecvXHMw@mail.gmail.com Whole thread Raw |
In response to | Re: Weird failure with latches in curculio on v15 (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Weird failure with latches in curculio on v15
|
List | pgsql-hackers |
On Fri, Feb 10, 2023 at 12:59 AM Andres Freund <andres@anarazel.de> wrote: > I'm somewhat concerned about that too, but perhaps from a different > angle. First, I think we don't do our users a service by defaulting the > in-core implementation to something that doesn't scale to even a moderately > busy server. +1. > Second, I doubt we'll get the API for any of this right, without > an acutual user that does something more complicated than restoring one-by-one > in a blocking manner. Fair. > I don't think it's that hard to imagine problems. To be reasonably fast, a > decent restore implementation will have to 'restore ahead'. Which also > provides ample things to go wrong. E.g. > > - WAL source is switched, restore module needs to react to that, but doesn't, > we end up lots of wasted work, or worse, filename conflicts > - recovery follows a timeline, restore module doesn't catch on quickly enough > - end of recovery happens, restore just continues on I don't see how you can prevent those things from happening. If the restore process is working in some way that requires an event loop, and I think that will be typical for any kind of remote archiving, then either it has control most of the time, so the event loop can be run inside the restore process, or, as Nathan proposes, we don't let the archiver have control and it needs to run that restore process in a separate background worker. The hazards that you mention here exist either way. If the event loop is running inside the restore process, it can decide not to call the functions that we provide in a timely fashion and thus fail to react as it should. If the event loop runs inside a separate background worker, then that process can fail to be responsive in precisely the same way. Fundamentally, if the author of a restore module writes code to have multiple I/Os in flight at the same time and does not write code to cancel those I/Os if something changes, then such cancellation will not occur. That remains true no matter which process is performing the I/O. > > I don't quite see how you can make asynchronous and parallel archiving > > work if the archiver process only calls into the archive module at > > times that it chooses. That would mean that the module has to return > > control to the archiver when it's in the middle of archiving one or > > more files -- and then I don't see how it can get control back at the > > appropriate time. Do you have a thought about that? > > I don't think archiver is the hard part, that already has a dedicated > process, and it also has something of a queuing system already. The startup > process imo is the complicated one... > > If we had a 'restorer' process, startup fed some sort of a queue with things > to restore in the near future, it might be more realistic to do something you > describe? Some kind of queueing system might be a useful part of the interface, and a dedicated restorer process does sound like a good idea. But the archiver doesn't have this solved, precisely because you have to archive a single file, return control, and wait to be invoked again for the next file. That does not scale. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: