Thread: BUG #14416: checkpoints never completed
VGhlIGZvbGxvd2luZyBidWcgaGFzIGJlZW4gbG9nZ2VkIG9uIHRoZSB3ZWJz aXRlOgoKQnVnIHJlZmVyZW5jZTogICAgICAxNDQxNgpMb2dnZWQgYnk6ICAg ICAgICAgIEpvbiBOZWxzb24KRW1haWwgYWRkcmVzczogICAgICBqZG5lbHNv bkBkeW4uY29tClBvc3RncmVTUUwgdmVyc2lvbjogOS40LjkKT3BlcmF0aW5n IHN5c3RlbTogICBMaW51eApEZXNjcmlwdGlvbjogICAgICAgIAoKUG9zdGdy ZVNRTCA5LjQuOSBvbiBDZW50T1MtNywgeDg2XzY0LCBvbiBleHQ0Lg0KDQoN CldlIGVuY291bnRlcmVkIGEgcHJvYmxlbSB3aGVyZSBjaGVja3BvaW50cyBh cHBlYXJlZCB0byBxdWV1ZSB1cCBhbmQgbmV2ZXIKZmluaXNoLg0KDQoNClRo ZSBsb2dzIHNob3cgdGhhdCBjaGVja3BvaW50cyB3ZXJlIGhhcHBlbmluZyBu b3JtYWxseSwgb25jZSBwZXIgbWludXRlCih3aGljaCBpcyBvdXIgZGVzaXJl ZCBjb25maWd1cmF0aW9uIGluIHRoaXMgY2FzZSkuIA0KDQoNCjAxOjA4OjI3 IHRoZSBzeXN0ZW0gY2xvY2sgd2FzIHN5bmNocm9uaXplZC4gSXQgd2FzIGJl aGluZCBieSBhYm91dCBvbmUKaG91ci4NCjAxOjIzOjMxIGEgYmFja3VwIChw Z19zdGFydF9iYWNrdXApIGlzIHN0YXJ0ZWQNCjAzOjQwOjE0IHdlIHNhdyBv dXIgZmlyc3Qg4oCcb3V0IG9mIGRpc2sgc3BhY2XigJ0gbWVzc2FnZQ0KMDM6 NDc6NTEgd2Ugc2F3IOKAnGNoZWNrcG9pbnQgc3RhcnRpbmc6IHRpbWXigJ0g KHdpdGggbm8gY29ycmVzcG9uZGluZwrigJxjaGVja3BvaW50IGNvbXBsZXRl 4oCdIG1lc3NhZ2VzKSAqb25jZSBwZXIgc2Vjb25kKi4gV2UgYXJlIGFsc28g b3V0IG9mIGRpc2sKc3BhY2UgYXQgdGhpcyB0aW1lLg0KMDQ6MDA6MDEgd2Ug c2VlIGEg4oCYY2hlY2twb2ludCBjb21wbGV0ZeKAmSBtZXNzYWdlIChkZXNw aXRlIHN0aWxsIGJlaW5nIG91dCBvZgpkaXNrIHNwYWNlISkNCg0KDQpDaGVj a3BvaW50cyBhcHBlYXIgbm9ybWFsIHVudGlsIDQ6MDc6MDEgYXQgd2hpY2gg cG9pbnQgdGhlIHRoZSDigJxjaGVja3BvaW50CnN0YXJ0OiB0aW1l4oCdIG1l c3NhZ2Ugb2NjdXJzIG9uY2UgcGVyICpzZWNvbmQqLiAgU3RpbGwgb3V0IG9m IGRpc2sgc3BhY2UuDQoNCg0KMDU6NTE6MTMgd2Ugc2VlIHRoZSBmaXJzdCDi gJxjaGVja3BvaW50IGNvbXBsZXRl4oCdLg0KMDU6NTE6MjIgKDcgc2Vjb25k cyBsYXRlcikgd2Ugc2VlIGEg4oCcY2hlY2twb2ludCBzdGFydGluZzogdGlt ZeKAnSBtZXNzYWdlLiAgTm8KbWVzc2FnZXMgY29udGFpbmluZyAiY2hlY2tw b2ludCIgYXBwZWFyIGFmdGVyIHRoaXMuDQoNCg0KMDc6NDM6NTQgYW4gZXJy b3IgbWVzc2FnZSBpbmRpY2F0aW5nIHRoYXQg4oCccGdfc3RvcF9iYWNrdXAo KeKAnSBzYXlzIGEgYmFja3VwCmlzbuKAmXQgaW4gcHJvZ3Jlc3MuIE5PVEU6 IHdlIGRvICpub3QqIHNlZSBhIHN1Y2Nlc3NmdWwgcGdfc3RvcF9iYWNrdXAo KSwgYnV0Cml04oCZcyBwb3NzaWJsZSB0aGF0IGl0IHRvb2sgbGVzcyB0aGFu IDUwMG1zIHdoaWNoIGlzIG91cgpsb2dfbWluX2R1cmF0aW9uX3N0YXRlbWVu dCB0aHJlc2hvbGQuDQoNCg0KMjM6MTk6MzIgb3VyIGF0dGVtcHRzIGF0IGEg bWFudWFsIENIRUNLUE9JTlQgYWxsIGhhbmcsIGFuZCB0aGUgb25seSBtZXNz YWdlcwp3aXRoIOKAmGNoZWNrcG9pbnTigJkgaW4gdGhlbSBhcmUgbm90aWNl cyBvZiBvdXIgY2FuY2VsbGF0aW9ucy4NCg0KDQpXZSBjaG9zZSB0byBzaHV0 IHRoZSBpbnN0YW5jZSBkb3duLiAgT3ZlciA0MDBHQiBvZiBXQUwgZmlsZXMg aGFkIGFjY3J1ZWQgZHVlCnRvIG5vIGNoZWNrcG9pbnQgY29tcGxldGluZy4g VGhlIGluc3RhbmNlIGNhbWUgdXAgd2l0aG91dCBpc3N1ZSBhbmQgcmVzdW1l ZApvcGVyYXRpb25zLg0KCgo=
jdnelson@dyn.com writes: > We encountered a problem where checkpoints appeared to queue up and never > finish. > ... > 01:08:27 the system clock was synchronized. It was behind by about one > hour. I wonder if this is a problem similar to the autovacuum issue we fixed in da1a9d0f5, ie perhaps moving the system clock setting confuses the checkpoint timing logic. What checkpoint-related settings are you using? Can you be more specific about the system clock change? regards, tom lane
On Mon, Nov 7, 2016 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > jdnelson@dyn.com writes: > > We encountered a problem where checkpoints appeared to queue up and never > > finish. > > ... > > 01:08:27 the system clock was synchronized. It was behind by about one > > hour. > > I wonder if this is a problem similar to the autovacuum issue we fixed > in da1a9d0f5, ie perhaps moving the system clock setting confuses the > checkpoint timing logic. > That is more or less what we were thinking as well. > What checkpoint-related settings are you using? Can you be more specific > about the system clock change? > checkpoint_completion_target = 0.85 checkpoint_segments = 2048 checkpoint_timeout = 60 We have noticed that the system clock has not been the most reliable. In this case, though, there is only evidence of a single clock correction at the noted time. No other corrections or drift during the actual event were detected, but that doesn't mean that there weren't any. -- Jon Nelson
Jonathon Nelson <jdnelson@dyn.com> writes: > On Mon, Nov 7, 2016 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wonder if this is a problem similar to the autovacuum issue we fixed >> in da1a9d0f5, ie perhaps moving the system clock setting confuses the >> checkpoint timing logic. > That is more or less what we were thinking as well. Looking at the logic around this in checkpointer.c, it's pretty obvious that it would not behave nicely if system time goes backwards after a checkpoint starts; it would think it was ahead of schedule and would just loaf, basically, until the clock catches up to where it had been. There's no sanity check to notice a negative elapsed-time reading. But if system time goes forwards, it would think it was very far behind schedule and would do a burst of work, which doesn't seem to match your symptom. Please confirm the sign of the system clock correction that happened on your machine? regards, tom lane
On Mon, Nov 7, 2016 at 4:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jonathon Nelson <jdnelson@dyn.com> writes: > > On Mon, Nov 7, 2016 at 1:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I wonder if this is a problem similar to the autovacuum issue we fixed > >> in da1a9d0f5, ie perhaps moving the system clock setting confuses the > >> checkpoint timing logic. > > > That is more or less what we were thinking as well. > > Looking at the logic around this in checkpointer.c, it's pretty obvious > that it would not behave nicely if system time goes backwards after a > checkpoint starts; it would think it was ahead of schedule and would > just loaf, basically, until the clock catches up to where it had been. > There's no sanity check to notice a negative elapsed-time reading. > But if system time goes forwards, it would think it was very far behind > schedule and would do a burst of work, which doesn't seem to match your > symptom. > > Please confirm the sign of the system clock correction that happened > on your machine? > Before responding, I triple checked everything I have. I did make a mistake, but it's one of scale: the time went forward 1d, 57m, and 1.7s (earlier I said it was about an hour). Prior to the event, the system clock was all over the place, however I cannot find evidence of any further time corrections. This is a busy system and easily logs more than once a second, so I chose to identify time jumps by taking the logs (in the order they appeared in!) and subtracting the previous log's timestamp. If the difference was either negative or greater than 2 seconds, I set it aside. I did not find any such events during this timeframe. -- Jon Nelson