Thread: Force update_process_title=on in crash recovery?
Hi, Based on a couple of independent reports from users with no idea how to judge the progress of a system recovering from a crash, Christoph and I wondered if we should override update_process_title for the "recovering ..." message, at least until connections are allowed. We already do that to set the initial titles. Crash recovery is a rare case where important information is reported through the process title that isn't readily available anywhere else, since you can't log in. If you want to gauge progress on a system that happened to crash with update_process_title set to off, your best hope is probably to trace the process or spy on the files it has open, to see which WAL segment it's accessing, but that's not very nice.
Thomas Munro <thomas.munro@gmail.com> writes: > Based on a couple of independent reports from users with no idea how > to judge the progress of a system recovering from a crash, Christoph > and I wondered if we should override update_process_title for the > "recovering ..." message, at least until connections are allowed. We > already do that to set the initial titles. > Crash recovery is a rare case where important information is reported > through the process title that isn't readily available anywhere else, > since you can't log in. If you want to gauge progress on a system > that happened to crash with update_process_title set to off, your best > hope is probably to trace the process or spy on the files it has open, > to see which WAL segment it's accessing, but that's not very nice. Seems like a good argument, but you'd have to be careful about the final state when you stop overriding update_process_title --- it can't be left looking like it's still-in-progress on some random WAL file. (Compare my nearby gripes about walsenders being sloppy about their pg_stat_activity and process title presentations.) regards, tom lane
On Tue, Sep 15, 2020 at 10:01:18AM -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Based on a couple of independent reports from users with no idea how > > to judge the progress of a system recovering from a crash, Christoph > > and I wondered if we should override update_process_title for the > > "recovering ..." message, at least until connections are allowed. We > > already do that to set the initial titles. > > > Crash recovery is a rare case where important information is reported > > through the process title that isn't readily available anywhere else, > > since you can't log in. If you want to gauge progress on a system > > that happened to crash with update_process_title set to off, your best > > hope is probably to trace the process or spy on the files it has open, > > to see which WAL segment it's accessing, but that's not very nice. > > Seems like a good argument, but you'd have to be careful about the > final state when you stop overriding update_process_title --- it can't > be left looking like it's still-in-progress on some random WAL file. > (Compare my nearby gripes about walsenders being sloppy about their > pg_stat_activity and process title presentations.) Related: https://commitfest.postgresql.org/29/2688/ I'm not sure I understood Michael's recent message, but I think maybe refers to promotion of a standby. -- Justin
On Tue, Sep 15, 2020 at 10:01:18AM -0400, Tom Lane wrote: > Seems like a good argument, but you'd have to be careful about the > final state when you stop overriding update_process_title --- it can't > be left looking like it's still-in-progress on some random WAL file. > (Compare my nearby gripes about walsenders being sloppy about their > pg_stat_activity and process title presentations.) Another thing to be careful here is WIN32, see 0921554. And slowing down recovery is never a good idea. -- Michael
Attachment
On Wed, Sep 16, 2020 at 2:30 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Sep 15, 2020 at 10:01:18AM -0400, Tom Lane wrote: > > Seems like a good argument, but you'd have to be careful about the > > final state when you stop overriding update_process_title --- it can't > > be left looking like it's still-in-progress on some random WAL file. > > (Compare my nearby gripes about walsenders being sloppy about their > > pg_stat_activity and process title presentations.) > > Another thing to be careful here is WIN32, see 0921554. And slowing > down recovery is never a good idea. Right, that commit makes a lot of sense because it suppresses many system calls that happen for each query. The same problem existed on older FreeBSD versions and I saw that costing ~10% of TPS on read-only pgbench. In other commits I've been removing system calls that happen for every WAL record. But in this thread I'm talking about an update per 16MB WAL file, which seems like an acceptable ratio to me.
Thomas Munro <thomas.munro@gmail.com> writes: > On Wed, Sep 16, 2020 at 2:30 PM Michael Paquier <michael@paquier.xyz> wrote: >> Another thing to be careful here is WIN32, see 0921554. And slowing >> down recovery is never a good idea. > Right, that commit makes a lot of sense because it suppresses many > system calls that happen for each query. The same problem existed on > older FreeBSD versions and I saw that costing ~10% of TPS on read-only > pgbench. In other commits I've been removing system calls that happen > for every WAL record. But in this thread I'm talking about an update > per 16MB WAL file, which seems like an acceptable ratio to me. Hmm ... the thread leading up to 0921554 indicates that the performance penalty of update_process_title=on is just ridiculously large on Windows. Maybe those numbers are not relevant to crash recovery WAL-application, but it might be smart to actually measure that not just assume it. In any case, I'd recommend setting up any patch you create for this to be easily "ifndef WIN32"'d in case we change our minds on the point later. regards, tom lane
On Wed, 16 Sep 2020 at 17:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Hmm ... the thread leading up to 0921554 indicates that the performance > penalty of update_process_title=on is just ridiculously large on Windows. > Maybe those numbers are not relevant to crash recovery WAL-application, > but it might be smart to actually measure that not just assume it. I had a go at measuring this on Windows and couldn't really detect any slowdown from running update_process_title on vs off. Average over 3 runs with update_process_title = off was 94.38 s, switched on the average was 93.81 s. (Some noise there) Adding a bit of logging shows that the process title was set 225 times. Once setting it to an empty string then once for each of the 224 segments replayed. Also, from a pgbench -s test with update_process_title on and again with off I see 9343 tps vs 11969 tps. The process title is changed twice for each query, once to set it to the query and once to set it to "idle". Doing a bit of maths there is seems that setting the process title takes about 15 microseconds per call. So it would have taken about 3.38 milliseconds to set the process title 225 times for recovery, or if you prefer, 0.003609% additional overhead. I don't think we'll notice. David