Thread: BUG #14416: checkpoints never completed

BUG #14416: checkpoints never completed

From
jdnelson@dyn.com
Date:
VGhlIGZvbGxvd2luZyBidWcgaGFzIGJlZW4gbG9nZ2VkIG9uIHRoZSB3ZWJz
aXRlOgoKQnVnIHJlZmVyZW5jZTogICAgICAxNDQxNgpMb2dnZWQgYnk6ICAg
ICAgICAgIEpvbiBOZWxzb24KRW1haWwgYWRkcmVzczogICAgICBqZG5lbHNv
bkBkeW4uY29tClBvc3RncmVTUUwgdmVyc2lvbjogOS40LjkKT3BlcmF0aW5n
IHN5c3RlbTogICBMaW51eApEZXNjcmlwdGlvbjogICAgICAgIAoKUG9zdGdy
ZVNRTCA5LjQuOSBvbiBDZW50T1MtNywgeDg2XzY0LCBvbiBleHQ0Lg0KDQoN
CldlIGVuY291bnRlcmVkIGEgcHJvYmxlbSB3aGVyZSBjaGVja3BvaW50cyBh
cHBlYXJlZCB0byBxdWV1ZSB1cCBhbmQgbmV2ZXIKZmluaXNoLg0KDQoNClRo
ZSBsb2dzIHNob3cgdGhhdCBjaGVja3BvaW50cyB3ZXJlIGhhcHBlbmluZyBu
b3JtYWxseSwgb25jZSBwZXIgbWludXRlCih3aGljaCBpcyBvdXIgZGVzaXJl
ZCBjb25maWd1cmF0aW9uIGluIHRoaXMgY2FzZSkuIA0KDQoNCjAxOjA4OjI3
IHRoZSBzeXN0ZW0gY2xvY2sgd2FzIHN5bmNocm9uaXplZC4gSXQgd2FzIGJl
aGluZCBieSBhYm91dCBvbmUKaG91ci4NCjAxOjIzOjMxIGEgYmFja3VwIChw
Z19zdGFydF9iYWNrdXApIGlzIHN0YXJ0ZWQNCjAzOjQwOjE0IHdlIHNhdyBv
dXIgZmlyc3Qg4oCcb3V0IG9mIGRpc2sgc3BhY2XigJ0gbWVzc2FnZQ0KMDM6
NDc6NTEgd2Ugc2F3IOKAnGNoZWNrcG9pbnQgc3RhcnRpbmc6IHRpbWXigJ0g
KHdpdGggbm8gY29ycmVzcG9uZGluZwrigJxjaGVja3BvaW50IGNvbXBsZXRl
4oCdIG1lc3NhZ2VzKSAqb25jZSBwZXIgc2Vjb25kKi4gV2UgYXJlIGFsc28g
b3V0IG9mIGRpc2sKc3BhY2UgYXQgdGhpcyB0aW1lLg0KMDQ6MDA6MDEgd2Ug
c2VlIGEg4oCYY2hlY2twb2ludCBjb21wbGV0ZeKAmSBtZXNzYWdlIChkZXNw
aXRlIHN0aWxsIGJlaW5nIG91dCBvZgpkaXNrIHNwYWNlISkNCg0KDQpDaGVj
a3BvaW50cyBhcHBlYXIgbm9ybWFsIHVudGlsIDQ6MDc6MDEgYXQgd2hpY2gg
cG9pbnQgdGhlIHRoZSDigJxjaGVja3BvaW50CnN0YXJ0OiB0aW1l4oCdIG1l
c3NhZ2Ugb2NjdXJzIG9uY2UgcGVyICpzZWNvbmQqLiAgU3RpbGwgb3V0IG9m
IGRpc2sgc3BhY2UuDQoNCg0KMDU6NTE6MTMgd2Ugc2VlIHRoZSBmaXJzdCDi
gJxjaGVja3BvaW50IGNvbXBsZXRl4oCdLg0KMDU6NTE6MjIgKDcgc2Vjb25k
cyBsYXRlcikgd2Ugc2VlIGEg4oCcY2hlY2twb2ludCBzdGFydGluZzogdGlt
ZeKAnSBtZXNzYWdlLiAgTm8KbWVzc2FnZXMgY29udGFpbmluZyAiY2hlY2tw
b2ludCIgYXBwZWFyIGFmdGVyIHRoaXMuDQoNCg0KMDc6NDM6NTQgYW4gZXJy
b3IgbWVzc2FnZSBpbmRpY2F0aW5nIHRoYXQg4oCccGdfc3RvcF9iYWNrdXAo
KeKAnSBzYXlzIGEgYmFja3VwCmlzbuKAmXQgaW4gcHJvZ3Jlc3MuIE5PVEU6
IHdlIGRvICpub3QqIHNlZSBhIHN1Y2Nlc3NmdWwgcGdfc3RvcF9iYWNrdXAo
KSwgYnV0Cml04oCZcyBwb3NzaWJsZSB0aGF0IGl0IHRvb2sgbGVzcyB0aGFu
IDUwMG1zIHdoaWNoIGlzIG91cgpsb2dfbWluX2R1cmF0aW9uX3N0YXRlbWVu
dCB0aHJlc2hvbGQuDQoNCg0KMjM6MTk6MzIgb3VyIGF0dGVtcHRzIGF0IGEg
bWFudWFsIENIRUNLUE9JTlQgYWxsIGhhbmcsIGFuZCB0aGUgb25seSBtZXNz
YWdlcwp3aXRoIOKAmGNoZWNrcG9pbnTigJkgaW4gdGhlbSBhcmUgbm90aWNl
cyBvZiBvdXIgY2FuY2VsbGF0aW9ucy4NCg0KDQpXZSBjaG9zZSB0byBzaHV0
IHRoZSBpbnN0YW5jZSBkb3duLiAgT3ZlciA0MDBHQiBvZiBXQUwgZmlsZXMg
aGFkIGFjY3J1ZWQgZHVlCnRvIG5vIGNoZWNrcG9pbnQgY29tcGxldGluZy4g
VGhlIGluc3RhbmNlIGNhbWUgdXAgd2l0aG91dCBpc3N1ZSBhbmQgcmVzdW1l
ZApvcGVyYXRpb25zLg0KCgo=

Re: BUG #14416: checkpoints never completed

From
Tom Lane
Date:
jdnelson@dyn.com writes:
> We encountered a problem where checkpoints appeared to queue up and never
> finish.
> ...
> 01:08:27 the system clock was synchronized. It was behind by about one
> hour.

I wonder if this is a problem similar to the autovacuum issue we fixed
in da1a9d0f5, ie perhaps moving the system clock setting confuses the
checkpoint timing logic.

What checkpoint-related settings are you using?  Can you be more specific
about the system clock change?

            regards, tom lane

Re: BUG #14416: checkpoints never completed

From
Jonathon Nelson
Date:
On Mon, Nov 7, 2016 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> jdnelson@dyn.com writes:
> > We encountered a problem where checkpoints appeared to queue up and never
> > finish.
> > ...
> > 01:08:27 the system clock was synchronized. It was behind by about one
> > hour.
>
> I wonder if this is a problem similar to the autovacuum issue we fixed
> in da1a9d0f5, ie perhaps moving the system clock setting confuses the
> checkpoint timing logic.
>

That is more or less what we were thinking as well.


> What checkpoint-related settings are you using?  Can you be more specific
> about the system clock change?
>

checkpoint_completion_target = 0.85
checkpoint_segments = 2048
checkpoint_timeout = 60

We have noticed that the system clock has not been the most reliable. In
this case, though, there is only evidence of a single clock correction at
the noted time. No other corrections or drift during the actual event were
detected, but that doesn't mean that there weren't any.

--
Jon Nelson

Re: BUG #14416: checkpoints never completed

From
Tom Lane
Date:
Jonathon Nelson <jdnelson@dyn.com> writes:
> On Mon, Nov 7, 2016 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder if this is a problem similar to the autovacuum issue we fixed
>> in da1a9d0f5, ie perhaps moving the system clock setting confuses the
>> checkpoint timing logic.

> That is more or less what we were thinking as well.

Looking at the logic around this in checkpointer.c, it's pretty obvious
that it would not behave nicely if system time goes backwards after a
checkpoint starts; it would think it was ahead of schedule and would
just loaf, basically, until the clock catches up to where it had been.
There's no sanity check to notice a negative elapsed-time reading.
But if system time goes forwards, it would think it was very far behind
schedule and would do a burst of work, which doesn't seem to match your
symptom.

Please confirm the sign of the system clock correction that happened
on your machine?

            regards, tom lane

Re: BUG #14416: checkpoints never completed

From
Jonathon Nelson
Date:
On Mon, Nov 7, 2016 at 4:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Jonathon Nelson <jdnelson@dyn.com> writes:
> > On Mon, Nov 7, 2016 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I wonder if this is a problem similar to the autovacuum issue we fixed
> >> in da1a9d0f5, ie perhaps moving the system clock setting confuses the
> >> checkpoint timing logic.
>
> > That is more or less what we were thinking as well.
>
> Looking at the logic around this in checkpointer.c, it's pretty obvious
> that it would not behave nicely if system time goes backwards after a
> checkpoint starts; it would think it was ahead of schedule and would
> just loaf, basically, until the clock catches up to where it had been.
> There's no sanity check to notice a negative elapsed-time reading.
> But if system time goes forwards, it would think it was very far behind
> schedule and would do a burst of work, which doesn't seem to match your
> symptom.
>
> Please confirm the sign of the system clock correction that happened
> on your machine?
>

Before responding, I triple checked everything I have. I did make a
mistake, but it's one of scale:  the time went forward 1d, 57m, and 1.7s
(earlier I said it was about an hour). Prior to the event, the system clock
was all over the place, however I cannot find evidence of any further time
corrections. This is a busy system and easily logs more than once a second,
so I chose to identify time jumps by taking the logs (in the order they
appeared in!) and subtracting the previous log's timestamp. If the
difference was either negative or greater than 2 seconds, I set it aside. I
did not find any such events during this timeframe.



--
Jon Nelson