Thread: WAL Shipping + checkpoint
Hi All, I've a cluster ( Pg 8.3.7 ) with WAL Shipping, and a few hours ago, the master had to restart. I use walmgr from Skytools, which works very well. I have already restart the master without any problem, but today, the slave doesn't work like I want. The field "Time of latest checkpoint" from the pg_controldata on the slave keep the same values, but WAL File are processed correctly. I try to restart the slave, but, after processed again all the WAL between "Time of latest checkpoint" and, it does nothing else, latest checkpoint stay at the same value. I don't know if it's important ( i think so ), and I can't fix it. Is anybody have an idea ? thanks, -- Sébastien Lardière
Sébastien Lardière wrote: > Hi All, > > I've a cluster ( Pg 8.3.7 ) with WAL Shipping, and a few hours ago, > the master had to restart. > > I use walmgr from Skytools, which works very well. > > I have already restart the master without any problem, but today, the > slave doesn't work like I want. The field "Time of latest checkpoint" > from the pg_controldata on the slave keep the same values, but WAL > File are processed correctly. > > I try to restart the slave, but, after processed again all the WAL > between "Time of latest checkpoint" and, it does nothing else, latest > checkpoint stay at the same value. > > I don't know if it's important ( i think so ), and I can't fix it. > It is normal for it to lag behind somewhat on the slave (depending on what your checkpoint timeout etc settings are). However, I've noticed what you are seeing as well - particularly when there are no actual data changes coming through in the logs - the slave checkpoint time does not change even tho there have been checkpoints on the master (I may have a look in the code to see what the story really is...if I have time). Cheers Mark
On 26/08/2009 04:46, Mark Kirkwood wrote: > Sébastien Lardière wrote: >> Hi All, >> >> I've a cluster ( Pg 8.3.7 ) with WAL Shipping, and a few hours ago, >> the master had to restart. >> >> I use walmgr from Skytools, which works very well. >> >> I have already restart the master without any problem, but today, the >> slave doesn't work like I want. The field "Time of latest checkpoint" >> from the pg_controldata on the slave keep the same values, but WAL >> File are processed correctly. >> >> I try to restart the slave, but, after processed again all the WAL >> between "Time of latest checkpoint" and, it does nothing else, latest >> checkpoint stay at the same value. >> >> I don't know if it's important ( i think so ), and I can't fix it. >> > It is normal for it to lag behind somewhat on the slave (depending on > what your checkpoint timeout etc settings are). > > However, I've noticed what you are seeing as well - particularly when > there are no actual data changes coming through in the logs - the > slave checkpoint time does not change even tho there have been > checkpoints on the master (I may have a look in the code to see what > the story really is...if I have time). > Yes, but the delay between the last checkpoint on the master and the slave is very high, now ( 100 000 sec ), because the last checkpoint on the slave was yesterday ( as far as pg_controldata is right ) Here a graph from our munin plugin : http://seb.ouvaton.org/tmp/bdd-pg_walmgr-week.png The blue line represent an average between two WAL processed on the slave, and the green line, the delai between last checkpoint on the master and the slave. Maybe it's not some good indicator, but the green line let me think there is problem. Thanks, -- Sébastien Lardière
Sébastien Lardière wrote: > On 26/08/2009 04:46, Mark Kirkwood wrote: >> Sébastien Lardière wrote: >>> Hi All, >>> >>> I've a cluster ( Pg 8.3.7 ) with WAL Shipping, and a few hours ago, >>> the master had to restart. >>> >>> I use walmgr from Skytools, which works very well. >>> >>> I have already restart the master without any problem, but today, >>> the slave doesn't work like I want. The field "Time of latest >>> checkpoint" from the pg_controldata on the slave keep the same >>> values, but WAL File are processed correctly. >>> >>> I try to restart the slave, but, after processed again all the WAL >>> between "Time of latest checkpoint" and, it does nothing else, >>> latest checkpoint stay at the same value. >>> >>> I don't know if it's important ( i think so ), and I can't fix it. >>> >> It is normal for it to lag behind somewhat on the slave (depending on >> what your checkpoint timeout etc settings are). >> >> However, I've noticed what you are seeing as well - particularly when >> there are no actual data changes coming through in the logs - the >> slave checkpoint time does not change even tho there have been >> checkpoints on the master (I may have a look in the code to see what >> the story really is...if I have time). >> > > Yes, but the delay between the last checkpoint on the master and the > slave is very high, now ( 100 000 sec ), because the last checkpoint > on the slave was yesterday ( as far as pg_controldata is right ) > > Here a graph from our munin plugin : > http://seb.ouvaton.org/tmp/bdd-pg_walmgr-week.png > > The blue line represent an average between two WAL processed on the > slave, and the green line, the delai between last checkpoint on the > master and the slave. > > Maybe it's not some good indicator, but the green line let me think > there is problem. > > Do you have archive_timeout set? If so, then what *could* be happening is this: There are actually no "real" data changes being made on your master for some reason. So every time archive_timeout is reached a log full of no changes is shipped to your slave and applied - and no checkpoint times are changed for reasons I mentioned above. A way to test the would be to do something that makes real data changes in the master. A good thing to try would be to: - create a new database - create tables and add some reasonable amount of data (e.g. initialized pgbench scale 100). Then see if your checkpoint time gets updated a few minutes or so later.
On 27/08/2009 00:18, Mark Kirkwood wrote: > Sébastien Lardière wrote: >> On 26/08/2009 04:46, Mark Kirkwood wrote: >>> Sébastien Lardière wrote: >>>> Hi All, >>>> >>>> I've a cluster ( Pg 8.3.7 ) with WAL Shipping, and a few hours ago, >>>> the master had to restart. >>>> >>>> I use walmgr from Skytools, which works very well. >>>> >>>> I have already restart the master without any problem, but today, >>>> the slave doesn't work like I want. The field "Time of latest >>>> checkpoint" from the pg_controldata on the slave keep the same >>>> values, but WAL File are processed correctly. >>>> >>>> I try to restart the slave, but, after processed again all the WAL >>>> between "Time of latest checkpoint" and, it does nothing else, >>>> latest checkpoint stay at the same value. >>>> >>>> I don't know if it's important ( i think so ), and I can't fix it. >>>> >>> It is normal for it to lag behind somewhat on the slave (depending >>> on what your checkpoint timeout etc settings are). >>> >>> However, I've noticed what you are seeing as well - particularly >>> when there are no actual data changes coming through in the logs - >>> the slave checkpoint time does not change even tho there have been >>> checkpoints on the master (I may have a look in the code to see what >>> the story really is...if I have time). >>> >> >> Yes, but the delay between the last checkpoint on the master and the >> slave is very high, now ( 100 000 sec ), because the last checkpoint >> on the slave was yesterday ( as far as pg_controldata is right ) >> >> Here a graph from our munin plugin : >> http://seb.ouvaton.org/tmp/bdd-pg_walmgr-week.png >> >> The blue line represent an average between two WAL processed on the >> slave, and the green line, the delai between last checkpoint on the >> master and the slave. >> >> Maybe it's not some good indicator, but the green line let me think >> there is problem. >> >> > Do you have archive_timeout set? If so, then what *could* be happening > is this: > > There are actually no "real" data changes being made on your master > for some reason. So every time archive_timeout is reached a log full > of no changes is shipped to your slave and applied - and no checkpoint > times are changed for reasons I mentioned above. > > thanks, but we have not set archive_timeout, and we have a lot of real data changes. That's why i don't understand why checkpoint never happen on the slave. -- Sébastien Lardière
>> There are actually no "real" data changes being made on your master >> for some reason. So every time archive_timeout is reached a log full >> of no changes is shipped to your slave and applied - and no checkpoint >> times are changed for reasons I mentioned above. >> >> > > thanks, but we have not set archive_timeout, and we have a lot of real > data changes. > > That's why i don't understand why checkpoint never happen on the slave. > What about the other values in pg_controldata output, do they change at all? If they remain constant, maybe the logs are just not applied. Also, are there any "restored log file xxx from archive" entries in postgres log? regards, Martin
On 27/08/2009 18:11, Martin Pihlak wrote: >>> There are actually no "real" data changes being made on your master >>> for some reason. So every time archive_timeout is reached a log full >>> of no changes is shipped to your slave and applied - and no checkpoint >>> times are changed for reasons I mentioned above. >>> >>> >>> >> thanks, but we have not set archive_timeout, and we have a lot of real >> data changes. >> >> That's why i don't understand why checkpoint never happen on the slave. >> >> > What about the other values in pg_controldata output, do they change at > all? If they remain constant, maybe the logs are just not applied. Also, > are there any "restored log file xxx from archive" entries in postgres > log? > > > No, i don't see any change in pg_controldata, but : there is messages in logfile : 2009-08-28 10:02:51,129 26717 INFO 00000001000003F700000088: Found 2009-08-28 10:02:51,169 26717 INFO {count: 1} 2009-08-28 10:02:51 CEST [18439]: [1862-1] user=,db= LOG: restored log file "00000001000003F700000088" from archive postgres@bdd2:/data/postgresql/8.3/main$ grep "restored log file" /var/log/postgresql/postgresql-2009-08-28.log | wc -l 164 And I can see that the cluster is updated inside the directory px_xlog and base, at least. So we can think it's working, but, when i restart the slave cluster, it process again all the WAL since the last checkpoint ( for nothing, I think ), so, when we'll need this slave, it will take a lot of time, more and more, and this point worried me a bit. -- Sébastien Lardière
Sébastien Lardière wrote: > No, i don't see any change in pg_controldata, but : > > there is messages in logfile : > > 2009-08-28 10:02:51,129 26717 INFO 00000001000003F700000088: Found > 2009-08-28 10:02:51,169 26717 INFO {count: 1} > 2009-08-28 10:02:51 CEST [18439]: [1862-1] user=,db= LOG: restored log > file "00000001000003F700000088" from archive > This is weird, indeed it seems that for some reason the recovery restartpoints are not created. Looking quickly at RecoveryRestartPoint() in xlog.c, there are two cases when it doesn't do a checkpoint. For one thing, it checks if if enough time has elapsed since last controlfile modification. If not, the checkpoint is skipped. I'm wondering if it does the correct thing if the clocks of two machines are too far apart. Another check is "is it safe to do a checkpoint". This is logged with DEBUG2, so it should be visible if you set the logging level accordingly. Alternatively, you could attach a debugger to the recovery process and see if the RecoveryRestartPoint() and CheckPointGuts() functions are called at all. regards, Martin
On Fri, 2009-08-28 at 17:55 +0300, Martin Pihlak wrote: > This is weird, indeed it seems that for some reason the recovery restartpoints > are not created. > > Looking quickly at RecoveryRestartPoint() in xlog.c, there are two cases when it > doesn't do a checkpoint. For one thing, it checks if if enough time has elapsed > since last controlfile modification. If not, the checkpoint is skipped. I'm wondering > if it does the correct thing if the clocks of two machines are too far apart. Probably not that. > Another check is "is it safe to do a checkpoint". This is logged with DEBUG2, so > it should be visible if you set the logging level accordingly. This seems like the most likely cause. I would guess that one of your GiST indexes has a corruption in it and is preventing a restartpoint from taking place. That has happened previously. > Alternatively, you could attach a debugger to the recovery process and see if the > RecoveryRestartPoint() and CheckPointGuts() functions are called at all. I would say no need for that, but you can check the Gist pending actions table. -- Simon Riggs www.2ndQuadrant.com
On 28/08/2009 17:13, Simon Riggs wrote: >> Another check is "is it safe to do a checkpoint". This is logged with DEBUG2, so >> it should be visible if you set the logging level accordingly. >> > This seems like the most likely cause. I would guess that one of your > GiST indexes has a corruption in it and is preventing a restartpoint > from taking place. That has happened previously. > > You're right. On the master, we had an error about a gist index on a ltree column : Aug 25 09:07:02 eurovox-bdd postgres[5575]: [18-1] user=eurovox,db=eurovox_db PANIC: failed to add item to index page in "appels_entrants_arbo_key" The cluster try to restart ( it fail ), i had to force to stop. Since this moment, the slave didn't make any checkpoint. Now, we know why. Thanks a lot ! But how can i fix it ? -- Sébastien Lardière
On Fri, 2009-08-28 at 17:54 +0200, Sébastien Lardière wrote: > Since this moment, the slave didn't make any checkpoint. > > Now, we know why. Thanks a lot ! > > But how can i fix it ? Current issue: Rebuild standby from base backup. Cause: Locate the bug in the Index AM that causes it. Symptom: Currently index AMs don't detect situation that index is corrupt so they keep trying to recover the index for ever, even though attempting to do so permanently prevents restartpoints. I would also like them to have a mechanism for marking index corrupt and then auto-rebuild them following recovery. -- Simon Riggs www.2ndQuadrant.com
On 28/08/2009 18:14, Simon Riggs wrote: > On Fri, 2009-08-28 at 17:54 +0200, Sébastien Lardière wrote: > >> Since this moment, the slave didn't make any checkpoint. >> >> Now, we know why. Thanks a lot ! >> >> But how can i fix it ? >> > Current issue: Rebuild standby from base backup. > > Cause: Locate the bug in the Index AM that causes it. > > Symptom: Currently index AMs don't detect situation that index is > corrupt so they keep trying to recover the index for ever, even though > attempting to do so permanently prevents restartpoints. I would also > like them to have a mechanism for marking index corrupt and then > auto-rebuild them following recovery. > Thanks, I make a restore backup on the slave this morning, and It works ! Thanks a lot, -- Sébastien Lardière
On 31/08/2009, at 6:16 PM, Sébastien Lardière wrote: > On 28/08/2009 18:14, Simon Riggs wrote: >> On Fri, 2009-08-28 at 17:54 +0200, Sébastien Lardière wrote: >> >>> Since this moment, the slave didn't make any checkpoint. >>> >>> Now, we know why. Thanks a lot ! >>> >>> But how can i fix it ? >>> >> Current issue: Rebuild standby from base backup. >> >> Cause: Locate the bug in the Index AM that causes it. >> >> Symptom: Currently index AMs don't detect situation that index is >> corrupt so they keep trying to recover the index for ever, even >> though >> attempting to do so permanently prevents restartpoints. I would also >> like them to have a mechanism for marking index corrupt and then >> auto-rebuild them following recovery. >> > > Thanks, I make a restore backup on the slave this morning, and It > works ! Could you detail your solution please, if any? I've seen frozen pg_controldata output on my standby server for ages and attributed that to the ancient version of pgsql (8.0.x) I'm stuck with. There have been no index related error messages in my log though. Thank you! Yar
On Tue, 2009-09-01 at 13:14 +1000, Yaroslav Tykhiy wrote: > Could you detail your solution please, if any? I've seen frozen > pg_controldata output on my standby server for ages and attributed > that to the ancient version of pgsql (8.0.x) I'm stuck with. It won't ever work before 8.2 -- Simon Riggs www.2ndQuadrant.com
On 01/09/2009 05:14, Yaroslav Tykhiy wrote: >> >> Thanks, I make a restore backup on the slave this morning, and It >> works ! > > > Could you detail your solution please, if any? I've seen frozen > pg_controldata output on my standby server for ages and attributed > that to the ancient version of pgsql (8.0.x) I'm stuck with. There > have been no index related error messages in my log though. > > Hi sorry for my late reply, We use walmgr.py from skytools ( http://pgfoundry.org/projects/skytools ), a great tool which let us easy backup, restore, WAL Shipping. In this case, I use the sub-command 'restore' on the slave, which take the last nightly backup we made, and launch a slave cluster on WAL Shipping, and It works, but we use pg 8.3. -- Sébastien