Re: Fix checkpoint skip logic on idle systems by tracking LSN progress - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Fix checkpoint skip logic on idle systems by tracking LSN progress |
Date | |
Msg-id | CAB7nPqT2m1aq2QRODVzWeatYyESCV0pKuEpzD3QOOowc8gH+EA@mail.gmail.com Whole thread Raw |
In response to | Re: Fix checkpoint skip logic on idle systems by tracking LSN progress (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
List | pgsql-hackers |
On Tue, Sep 27, 2016 at 7:16 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I apologize in advance that the comments in this message might > one of the ideas discarded in the past thread.. I might not grasp > the discussion completely X( No problem. > At Wed, 18 May 2016 17:57:49 -0400, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com> > >> A couple of months back is has been reported to pgsql-bugs that WAL >> segments were always switched with a low value of archive_timeout even >> if a system is completely idle: >> http://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org >> In short, a closer look at the problem has showed up that the logic in >> charge of checking if a checkpoint should be skipped or not is >> currently broken, because it completely ignores standby snapshots in >> its calculation of the WAL activity. So a checkpoint always occurs >> after checkpoint_timeout on an idle system since hot_standby has been >> introduced as wal_level. This did not get better from 9.4, since >> standby snapshots are logged every 15s by the background writer >> process. In 9.6, since wal_level = 'archive' and 'hot_standby' >> actually has the same meaning, the skip logic that worked with >> wal_level = 'archive' does not do its job anymore. >> >> One solution that has been discussed is to track the progress of WAL >> activity when doing record insertion by being able to mark some >> records as not updating the progress of WAL. Standby snapshot records >> enter in this category, making the checkpoint skip logic more robust. >> Attached is a patch implementing a solution for it, by adding in > > If I understand the old thread correctly, this still doesn't > solve the main issue, excessive checkpoint and segment > switching. The reason is the interaction between XLOG_SWITCH and > checkpoint as mentioned there. I think we may treat XLOG_SWITCH > as NO_PROGRESS, since we can recover to the lastest state without > it. If it is not wrong, the second patch attached (v12-2) inserts > XLOG_SWITCH as NO_PROGRESS and skips segment switching when no > progress took place for the round. Possibly. That's a second problem I did not want to tackle now. I was going to study that more precisely after this set of patches gets done. There is already enough complication in them, and they solve a large portion of the problem. >> WALInsertLock a new field that gets updated for each record to track >> the LSN progress. This allows to reliably skip the generation of >> standby snapshots in the bgwriter or checkpoints on an idle system. > > WALInsertLock doesn't seem to me to be a good place for > progressAt even considering the discussion about adding few > instructions (containing a branch) in the > hot-section. BackgroundWriterMain blocks all running > XLogInsertRecord every 200 ms, not 15 or 30 seconds (only for > replica, though). If this is correct, the Amit's suggestion below > will have more significance, and I rather agree with it. XLogCtl > is more suitable place for progressAt for the case. Based on my past look at the problem and memories, having a variable in WALInsertLock allows use to not have to touch the hottest spinlock code path in WAL insertion and PG: ReserveXLogInsertLocation(). I'd rather still avoid that. >> > Also, I think it is worth to once take the performance data for >> > write tests (using pgbench 30 minute run or some other way) with >> > minimum checkpoint_timeout (i.e 30s) to see if the additional locking >> > has any impact on performance. I think taking locks at intervals >> > of 15s or 30s should not matter much, but it is better to be safe. >> >> I don't think performance will be impacted, but there are no reasons >> to not do any measurements either. I'll try to get some graphs >> tomorrow with runs on my laptop, mainly looking for any effects of >> this patch on TPS when checkpoints show up. > > I don't think the impact is measurable on a laptop, where 2 to 4 > cores available at most. Yeah, I couldn't either.. Still I would like to keep the hot spinlock section as small as possible if we can. >> Per discussion with Andres at PGcon, we decided that this is an >> optimization, only for 9.7~ because this has been broken for a long >> time. I have also changed XLogIncludeOrigin() to use a more generic >> routine to set of status flags for a record being inserted: >> XLogSetFlags(). This routine can use two flags: >> - INCLUDE_ORIGIN to decide if the origin should be logged or not >> - NO_PROGRESS to decide at insertion if a record should update the LSN >> progress or not. >> Andres mentioned me that we'd want to have something similar to >> XLogIncludeOrigin, but while hacking I noticed that grouping both >> things under the same umbrella made more sense. > > This looks reasonable. Thanks. That would be fine as a first, independent patch, but that would mean that XLogSetFlags has only only value, which is a bit pointless so I grouped them. And this makes the exiting interface cleaner as well for replication origins. -- Michael
pgsql-hackers by date: