Thread: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
Three times in the last two weeks, we have experience data corruption secondary servers using streaming replication on clientsystems. The versions involved are 9.0.14, 9.2.5, and 9.3.1. Each incident was separate; two cases they were forthe same client (9.0.14 and 9.3.1), one for a different client (9.2.5). The details of each incident are similar, but not identical. The details of each incident are: INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using rsync off of an existing, correct primary (P1) for thebase backup, and using WAL-E for WAL segment shipping. Both the primary and secondary were running 9.0.14. S1 properlyconnected to the primary once the it was caught up on WAL segments, and S1 was then promoted as a primary using thetrigger file. No errors in the log files on either system. After promotion, it was discovered that there was significant data loss on S1. Rows that were present on P1 were missingon S1, and some rows were duplicated (including duplicates that violated primary key and other unique constraints). The indexes were corrupt, in that they seemed to think that the duplicates were not duplicated, and that themissing rows were still present. Because the client's schema included a "last_updated" field, we were able to determine that all of the rows that were eithermissing or duplicated had been updated on P1 shortly (3-5 minutes) before S1 was promoted. It's possible, but notconfirmed, that there were active queries (including updates) running on P1 at the moment of S1's promotion. As a note, P1 was created from another system (let's call it P0) using just WAL shipping (no streaming replication), andno data corruption was observed. P1 and S1 were both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume. P1 and S1 have been destroyed at this point. INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of S1y, after having dropped the primary and uniqueconstraints, and restored into a new 9.3.1 server, P2. Duplicate rows were purged, and missing rows were added again. The database, a new primary, was then put back into production, and ran without incident. A new secondary, S2 was created off of the primary. This secondary was created using pg_basebackup using --xlog-method=stream,although the WAL-E archiving was still present. S2 attached to P2 without incident and no errors in the logs, but nearly-identical corruption was discovered (although thistime without the duplicated rows, just missing rows). At this point, it's not clear if there was some clustering inthe "last_updated" timestamp for the rows that are missing from S2. No duplicated rows were observed. P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume. No errors in the log files on either system. P2 and S2 are still operational. INCIDENT #3: 9.2.5 -- A client was migrating a large database from a 9.2.2 system (P3) to a new 9.2.5 system (S3) using streamingreplication. As I personally didn't do the steps on this one, I don't have quite as much information, but the basicsare close to incident #2: When S3 was promoted using the trigger file, no errors were observed and the database cameup normally, but rows were missing from S3 that were present on P3. P1 is running Centos 6.3 with ext4 as the file system. P2 is running Centos 6.4 with ext3 as the file system. Log shipping in this case was done via rsync. P3 and S3 are still operational. No errors in the log files on either system. -- Obviously, we're very concerned that a bug was introduced in the latest minor release. We're happy to gather data as requiredto assist in diagnosing this. -- -- Christophe Pettus xof@thebuild.com
On 11/18/2013 10:58 AM, Christophe Pettus wrote: > Three times in the last two weeks, we have experience data corruption secondary servers using streaming replication onclient systems. The versions involved are 9.0.14, 9.2.5, and 9.3.1. Each incident was separate; two cases they were forthe same client (9.0.14 and 9.3.1), one for a different client (9.2.5). To emphasize a salient point: we have not previously seen data corruption with these exact symptoms prior to the recent patch release. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 18, 2013, at 10:58 AM, Christophe Pettus <xof@thebuild.com> wrote: > As a note, P1 was created from another system (let's call it P0) using just WAL shipping (no streaming replication), andno data corruption was observed. As another data point, P0 was running 9.0.13, rather than 9.0.14. -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
Hi, On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote: > INCDIDENT #1: 9.0.14 -- A new secondary (S1) was initialized using > rsync off of an existing, correct primary (P1) for the base backup, > and using WAL-E for WAL segment shipping. Both the primary and > secondary were running 9.0.14. S1 properly connected to the primary > once the it was caught up on WAL segments, and S1 was then promoted as > a primary using the trigger file. Could you detail how exactly the base backup was created? Including the *exact* logic for copying? > No errors in the log files on either system. Do you have the log entries for the startup after the base backup? > Because the client's schema included a "last_updated" field, we were > able to determine that all of the rows that were either missing or > duplicated had been updated on P1 shortly (3-5 minutes) before S1 was > promoted. It's possible, but not confirmed, that there were active > queries (including updates) running on P1 at the moment of S1's > promotion. Any chance you have log_checkpoints enabled? If so, could you check whether there was a checkpoint around the time of the base backup? This server is gone, right? If not, could you do: SELECT ctid, xmin, xmax, * FROM whatever WHERE duplicate_row? > INCIDENT #2: 9.3.1 -- In order to repair the database, a pg_dump was taken of S1y, after having dropped the primary andunique constraints, and restored into a new 9.3.1 server, P2. Duplicate rows were purged, and missing rows were addedagain. The database, a new primary, was then put back into production, and ran without incident. > > A new secondary, S2 was created off of the primary. This secondary was created using pg_basebackup using --xlog-method=stream,although the WAL-E archiving was still present. > > S2 attached to P2 without incident and no errors in the logs, but nearly-identical corruption was discovered (althoughthis time without the duplicated rows, just missing rows). At this point, it's not clear if there was some clusteringin the "last_updated" timestamp for the rows that are missing from S2. No duplicated rows were observed. > > P2 and S2 are both AWS instances running Ubuntu 12.04, using EBS (with xfs as the file system) as the data volume. > > No errors in the log files on either system. Could you list the *exact* steps you did to startup the cluster? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 18, 2013, at 11:28 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Could you detail how exactly the base backup was created? Including the > *exact* logic for copying? 0. Before any of this began, P1 was archiving WAL segments to AWS-S3. 1. pg_start_backup('', true) on P1. 2. Using rsync -av on P1, the entire $PGDATA directory was pushed from P1 to S2. 3. Once the rsync was complete, pg_stop_backup() on P1. 4. Create appropriate recovery.conf on S1. 5. Bring up PostgreSQL on S1. 6. PostgreSQL recovers normally (pulling WAL segments from WAL-E), and eventually connects to P1. > Do you have the log entries for the startup after the base backup? Sadly, not anymore. > This server is gone, right? Correct. > Could you list the *exact* steps you did to startup the cluster? 0. Before any of this began, P2 was archiving WAL segments to AWS-S3. 1. Initial (empty) data directory deleted on S2. 2. New data directory created with: /usr/lib/postgresql/9.3/bin/pg_basebackup --verbose --progress --xlog-method=stream --host=<ip> --user=repluser --pgdata=/data/9.3/main 3. Once the pg_basebackup completed, create appropriate recovery.conf on S1. 4. Bring up PostgreSQL on S2. 5. PostgreSQL recovers normally (pulling a small number of WAL segments from WAL-E), and eventually connects to P2. -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-18 11:38:43 -0800, Christophe Pettus wrote: > > On Nov 18, 2013, at 11:28 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Could you detail how exactly the base backup was created? Including the > > *exact* logic for copying? > > 0. Before any of this began, P1 was archiving WAL segments to AWS-S3. > 1. pg_start_backup('', true) on P1. > 2. Using rsync -av on P1, the entire $PGDATA directory was pushed from P1 to S2. Without deleting any data, including pg_xlog/, backup.label, anything? Did you have hot_standby enabled on all of those machines? Even on the 9.0.13 cluster? > > Could you list the *exact* steps you did to startup the cluster? > > 0. Before any of this began, P2 was archiving WAL segments to AWS-S3. > 1. Initial (empty) data directory deleted on S2. > 2. New data directory created with: > > /usr/lib/postgresql/9.3/bin/pg_basebackup --verbose --progress --xlog-method=stream --host=<ip> --user=repluser --pgdata=/data/9.3/main > > 3. Once the pg_basebackup completed, create appropriate recovery.conf on S1. That was just recovery command and primary conninfo? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 18, 2013, at 11:47 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Without deleting any data, including pg_xlog/, backup.label, anything? Correct. > Did you have hot_standby enabled on all of those machines? Even on the > 9.0.13 cluster? Yes. > That was just recovery command and primary conninfo? Correct. -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 18, 2013, at 11:47 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Did you have hot_standby enabled on all of those machines? Even on the > 9.0.13 cluster? Actually, it's a bit more complex than this: 1. We don't know about P0, the 9.0.13 machine. I assume it was, but it was managed elsewhere. 2. P1 never had hot_standby = 'on', as it was never intended to be a hot stand_by. 3. S1 did have hot_standby = 'on. -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 18, 2013, at 11:47 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Without deleting any data, including pg_xlog/, backup.label, anything? One more correction: After rsync finished and the pg_base_backup() was issued, the contents of pg_xlog/ on S1 were deleted. -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 18, 2013, at 12:00 PM, Christophe Pettus <xof@thebuild.com> wrote: > One more correction: After rsync finished and the pg_base_backup() was issued, the contents of pg_xlog/ on S1 were deleted. pg_stop_backup(), sorry. -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-18 10:58:26 -0800, Christophe Pettus wrote: > After promotion, it was discovered that there was significant data > loss on S1. Rows that were present on P1 were missing on S1, and some > rows were duplicated (including duplicates that violated primary key > and other unique constraints). The indexes were corrupt, in that they > seemed to think that the duplicates were not duplicated, and that the > missing rows were still present. Were there any kind of patterns in the lost data? What kind of workload are they running? I have an idea what the issue might be... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 18, 2013, at 12:57 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Were there any kind of patterns in the lost data? What kind of workload > are they running? I have an idea what the issue might be... On the P1 > S1 case, the data corrupted was data modified in the last few minutes before the switchover. I don't want toover-analyze, but it was within the checkpoint_timeout value for that sever. On the P2 > S2 case, it's less obvious what the pattern is, since there was no cutover. Insufficient information on the P3 > S3 case. Each of them is a reasonably high-volume OLTP-style workload. The P1/P2 client has a very high level of writes; the P3 moreread-heavy, but still a fair number of writes. -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
Hi, Afaics it's likely a combination/interaction of bugs and fixes between: * the initial HS code * 5a031a5556ff83b8a9646892715d7fef415b83c3 * f44eedc3f0f347a856eea8590730769125964597 But that'd mean nobody noticed it during 9.3's beta... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
Great! If there's any further data I can supply to help, let me know. On Nov 18, 2013, at 2:15 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > Afaics it's likely a combination/interaction of bugs and fixes between: > * the initial HS code > * 5a031a5556ff83b8a9646892715d7fef415b83c3 > * f44eedc3f0f347a856eea8590730769125964597 > > But that'd mean nobody noticed it during 9.3's beta... > > Greetings, > > Andres Freund > > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-18 14:25:58 -0800, Christophe Pettus wrote: > Great! If there's any further data I can supply to help, let me know. Trying to reproduce the issue with and without hot_standby=on would be very helpful, but I guess that's time consuming? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 18, 2013, at 2:26 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Trying to reproduce the issue with and without hot_standby=on would be > very helpful, but I guess that's time consuming? I've been working on it, but I haven't gotten it to fail yet. I'll keep at it. -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
Hi, On 2013-11-18 23:15:59 +0100, Andres Freund wrote: > Afaics it's likely a combination/interaction of bugs and fixes between: > * the initial HS code > * 5a031a5556ff83b8a9646892715d7fef415b83c3 > * f44eedc3f0f347a856eea8590730769125964597 Yes, the combination of those is guilty. Man, this is (to a good part my) bad. > But that'd mean nobody noticed it during 9.3's beta... It's fairly hard to reproduce artificially since a) there have to be enough transactions starting and committing from the start of the checkpoint the standby is starting from to the point it does LogStandbySnapshot() to cross a 32768 boundary b) hint bits often save the game by not accessing clog at all anymore and thus not noticing the corruption. I've reproduced the issue by having an INSERT ONLY table that's never read from. It's helpful to disable autovacuum. Imo something the attached patch should be done. The description I came up with is: Fix Hot-Standby initialization of clog and subtrans. These bugs can cause data loss on standbys started with hot_standby=on at the moment they start to accept read onlyqueries by marking committed transactions as uncommited. The likelihood of such corruptions is small unless the primaryhas a high transaction rate. 5a031a5556ff83b8a9646892715d7fef415b83c3 fixed bugs in HS's startup logic by maintaining less state until at least STANDBY_SNAPSHOT_PENDING state was reached, missing the fact that both clog and subtrans are written to before that. Thisonly failed to fail in common cases because the usage of ExtendCLOG in procarray.c was superflous since clog extensionsare actually WAL logged. f44eedc3f0f347a856eea8590730769125964597/I then tried to fix the missing extensions of pg_subtrans due to the formercommit's changes - which are not WAL logged - by performing the extensions when switching to a state > STANDBY_INITIALIZEDand not performing xid assignments before that - again missing the fact that ExtendCLOG is unneccessary- but screwed up twice: Once because latestObservedXid wasn't updated anymore in that state due to the earliercommit and once by having an off-by-one error in the loop performing extensions. This means that whenever a CLOG_XACTS_PER_PAGE(32768 with default settings) boundary was crossed between the start of the checkpoint recovery startedfrom and the first xl_running_xact record old transactions commit bits in pg_clog could be overwritten if they started and committed in that window. Fix this mess by not performing ExtendCLOG() in HS at all anymore since it's unneeded and evidently dangerous and byperforming subtrans extensions even before reaching STANDBY_SNAPSHOT_PENDING. Imo this warrants and expedited point release :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-19 15:20:01 +0100, Andres Freund wrote: > Imo something the attached patch should be done. The description I came > up with is: > > Fix Hot-Standby initialization of clog and subtrans. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andrew Dunstan
Date:
On 11/19/2013 09:20 AM, Andres Freund wrote: > Imo this warrants and expedited point release :( I presume anyone who is vulnerable to it would need to recreate their secondary servers to get rid of potential problems? cheers andrew
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-19 09:33:34 -0500, Andrew Dunstan wrote: > > On 11/19/2013 09:20 AM, Andres Freund wrote: > >Imo this warrants and expedited point release :( > > > I presume anyone who is vulnerable to it would need to recreate their > secondary servers to get rid of potential problems? Yes. There's less expensive ways to do it, but those seem to complicated to suggest. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 19, 2013, at 6:59 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Yes. There's less expensive ways to do it, but those seem to complicated > to suggest. If this is something that could be built into to a tool, acknowledging the complexity, I'd be happy to see about buildingit. -- -- Christophe Pettus xof@thebuild.com
On 11/19/2013 06:20 AM, Andres Freund wrote: > Hi, > > On 2013-11-18 23:15:59 +0100, Andres Freund wrote: >> Afaics it's likely a combination/interaction of bugs and fixes between: >> * the initial HS code >> * 5a031a5556ff83b8a9646892715d7fef415b83c3 >> * f44eedc3f0f347a856eea8590730769125964597 > > Yes, the combination of those is guilty. > > Man, this is (to a good part my) bad. > >> But that'd mean nobody noticed it during 9.3's beta... Ah, so this affected 9.3.0 as well? Maybe it's worth it now to devise some automated replication testing? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-19 09:51:28 -0800, Josh Berkus wrote: > On 11/19/2013 06:20 AM, Andres Freund wrote: > > Hi, > > > > On 2013-11-18 23:15:59 +0100, Andres Freund wrote: > >> Afaics it's likely a combination/interaction of bugs and fixes between: > >> * the initial HS code > >> * 5a031a5556ff83b8a9646892715d7fef415b83c3 > >> * f44eedc3f0f347a856eea8590730769125964597 > > > > Yes, the combination of those is guilty. > > > > Man, this is (to a good part my) bad. > > > >> But that'd mean nobody noticed it during 9.3's beta... > > Ah, so this affected 9.3.0 as well? Yes. > Maybe it's worth it now to devise some automated replication testing? It'd be a good idea, but I am not sure where to get resources for it from. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/19/2013 09:58 AM, Andres Freund wrote: > On 2013-11-19 09:51:28 -0800, Josh Berkus wrote: >> Maybe it's worth it now to devise some automated replication testing? > > It'd be a good idea, but I am not sure where to get resources for it > from. Well, servers isn't hard ... various cloud companies will donate them. What would be hard is writing the tests. I know how they should work, and the tools to use, but there's about 1,000 to 2,000 lines of code between here and there. (This is why we should have encouraged the folks from SFDC and Huawei who wanted to do testing, instead of blowing them off.) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-19 10:16:26 -0800, Josh Berkus wrote: > On 11/19/2013 09:58 AM, Andres Freund wrote: > > On 2013-11-19 09:51:28 -0800, Josh Berkus wrote: > >> Maybe it's worth it now to devise some automated replication testing? > > > > It'd be a good idea, but I am not sure where to get resources for it > > from. > > Well, servers isn't hard ... various cloud companies will donate them. Yea, I was talking about human/time/financial ones. > What would be hard is writing the tests. I know how they should work, > and the tools to use, but there's about 1,000 to 2,000 lines of code > between here and there. I am not sure how much code it's going to take (I'd hope somewhat less), but it certainly will take some time to agree how it should be built and then building and integrating it. > (This is why we should have encouraged the folks from SFDC and Huawei > who wanted to do testing, instead of blowing them off.) Haven't noticed/followed anything around that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 19, 2013, at 10:25 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I am not sure how much code it's going to take (I'd hope somewhat less), > but it certainly will take some time to agree how it should be built and > then building and integrating it. Given that the situation we're in right now is that we have an unknown number of silently corrupt secondaries out there whichwill only be discovered when someone promotes them to being a primary (possibly because the current primary died withouta backup), I'd say that this is something pretty urgent. -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-19 10:27:30 -0800, Christophe Pettus wrote: > > On Nov 19, 2013, at 10:25 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I am not sure how much code it's going to take (I'd hope somewhat less), > > but it certainly will take some time to agree how it should be built and > > then building and integrating it. > > Given that the situation we're in right now is that we have an unknown > number of silently corrupt secondaries out there which will only be > discovered when someone promotes them to being a primary (possibly > because the current primary died without a backup), I'd say that this > is something pretty urgent. It's pretty unlikely that any automated testing would have cought this, the required conditions are too unlikely for that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 19, 2013, at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote: > It's pretty unlikely that any automated testing would have cought this, > the required conditions are too unlikely for that. I would expect that "promote secondary while primary is under heavy load" is clear-cut test case. What concerns me moreis that we don't seem to have a framework to put in a regression test on the bug you just found (and thank you for findingit so quickly!). -- -- Christophe Pettus xof@thebuild.com
On 11/19/2013 10:29 AM, Andres Freund wrote: > It's pretty unlikely that any automated testing would have cought this, > the required conditions are too unlikely for that. Given our almost total lack of automated testing for replication, how would you (or anyone else) possibly know that? We certainly do know that we won't catch these kinds of bugs *without* testing. Heck, even MySQL AB had automated testing for replication. They largely ignored the results, but they had it. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-19 10:32:10 -0800, Christophe Pettus wrote: > > On Nov 19, 2013, at 10:29 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > > It's pretty unlikely that any automated testing would have cought this, > > the required conditions are too unlikely for that. > > I would expect that "promote secondary while primary is under heavy > load" is clear-cut test case. That's not sufficient though. It's e.g. very hard to reproduce the issue using the standard pgbench workload (not enough xids generated, too many hint bits). Note that the bug isn't caused by promotion, the problem occurs during the initial startup of a Hot-Standby standby. If the bug wasn't hit there, it won't be a problem at promotion. > What concerns me more is that we don't seem to have a framework to put > in a regression test on the bug you just found (and thank you for > finding it so quickly!). Agreed. But regarding it as a bad situation isn't fixing it unfortunately. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-19 10:34:09 -0800, Josh Berkus wrote: > On 11/19/2013 10:29 AM, Andres Freund wrote: > > It's pretty unlikely that any automated testing would have cought this, > > the required conditions are too unlikely for that. > > Given our almost total lack of automated testing for replication, how > would you (or anyone else) possibly know that? I've tried reproducing it, and it's not easy. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/19/2013 10:40 AM, Andres Freund wrote: > On 2013-11-19 10:32:10 -0800, Christophe Pettus wrote: >> What concerns me more is that we don't seem to have a framework to put >> in a regression test on the bug you just found (and thank you for >> finding it so quickly!). > > Agreed. But regarding it as a bad situation isn't fixing it > unfortunately. Well, I happen to have some pieces of such a framework: the parts which can automate spinning up arbitrarily complex groups of replicas and doing failover between them. What we'd still need is: a) a slightly better workload than pgbench b) a way to compare and test databases for data corruption of several kinds Can someone else kick in to help with this? I think this last issue shows that it's critical as a community to have such a testing framework in place, otherwise we really need to halt all work on replication until we have such a thing. I can't see how you expect to complete streaming logical replication without a replication testing framework. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-19 10:43:14 -0800, Josh Berkus wrote: > I think this last issue shows that it's critical as a community to have > such a testing framework in place, otherwise we really need to halt all > work on replication until we have such a thing. > I can't see how you expect to complete streaming logical replication > without a replication testing framework. That's actually easier to test since you can relatively easily integrate it into pg_regress and isolationtester - thus the patchset actually contains tests. You seem to imply that I/we should do that work? That's a bit onesided, isn't it? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/19/2013 10:51 AM, Andres Freund wrote: > That's actually easier to test since you can relatively easily integrate > it into pg_regress and isolationtester - thus the patchset actually > contains tests. > You seem to imply that I/we should do that work? That's a bit onesided, > isn't it? Nope, just that you should be worried about it, and maybe shouting for help. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 19, 2013, at 10:51 AM, Andres Freund <andres@2ndquadrant.com> wrote: > You seem to imply that I/we should do that work? No, just that it be done. Of course, the more support from the professional PG community that is given to it, the better. -- -- Christophe Pettus xof@thebuild.com
On 11/19/2013 06:20 AM, Andres Freund wrote: > Imo this warrants and expedited point release :( Yes, I agree. I'd go further and say it's worth releasing source as soon as we have it on this one. What amount of testing were you able to give your patch? We're trying to arrange some testing on our side. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
Hi, Andres, From my understanding, the problem only occurs over streaming replication; if the secondary was never a hot standby, andonly used the archived WAL segments, that would be safe. Is that correct? -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
Hi, On 2013-11-19 15:53:36 -0800, Christophe Pettus wrote: > From my understanding, the problem only occurs over streaming > replication; if the secondary was never a hot standby, and only used > the archived WAL segments, that would be safe. Is that correct? Not entirely. It's related to a standby running with hot_standby=on. Both archive based and streaming replication can be used with hot_standby=on or off. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-19 15:41:58 -0800, Josh Berkus wrote: > On 11/19/2013 06:20 AM, Andres Freund wrote: > > Imo this warrants and expedited point release :( > > Yes, I agree. I'd go further and say it's worth releasing source as > soon as we have it on this one. > > What amount of testing were you able to give your patch? We're trying > to arrange some testing on our side. So far I've only tested whether it fixes the corruption I could reproduce with some effort and that pgbench tables are consistent on the primary and the standby. This certainly warrants another look from somebody that knows that code. Which afair is only Simon and Heikki? I'll also have another look after sleeping on it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Nov 19, 2013, at 4:05 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2013-11-19 15:53:36 -0800, Christophe Pettus wrote: >> From my understanding, the problem only occurs over streaming >> replication; if the secondary was never a hot standby, and only used >> the archived WAL segments, that would be safe. Is that correct? > > Not entirely. > > It's related to a standby running with hot_standby=on. Both archive > based and streaming replication can be used with hot_standby=on or off. > So, does that mean that restoring from PITR based backups from tools like barman and wal-e could exhibit the same issue ifhot_standby=on was in the postgresql.conf?
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Heikki Linnakangas
Date:
On 19.11.2013 16:22, Andres Freund wrote: > On 2013-11-19 15:20:01 +0100, Andres Freund wrote: >> Imo something the attached patch should be done. The description I came >> up with is: >> >> Fix Hot-Standby initialization of clog and subtrans. Looks ok for a back-patchable fix. It's a bit bizarre that the ExtendSUBTRANS loop in ProcArrayApplyRecoveryInfo looks different from the one in RecordKnownAssignedTransactionIds, but both look correct to me. In master, it'd be nice to do some further cleanup. Some gripes: In ProcArrayApplyXidAssignment, I wonder if it would be best to just remove the (standbyState == STANDBY_INITIALIZED) check altogether. The KnownAssignedXidsRemoveTree() that follows is harmless if there is nothing in the known-assigned-xids array (xact_redo_commit does it in STANDBY_INITIALIZED state too). The other thing that's done after that check is updating lastOverflowedXid, and AFAICS it would be harmless to update that, too. It will be overwritten by the ProcArrayApplyRecoveryInfo() call that comes later. Clog, subtrans and multixact are all handled differently. Extensions of clog and multixact are WAL-logged, but extensions of subtrans are not. They all have a Startup function, but it has a slightly different purpose. StartupCLOG only sets latest_page_number, but StartupSUBTRANS and StartupMultiXact zero out the current page. For CLOG, the TrimCLOG() function does that. Why is clog handled differently from multixact? StartupCLOG() and StartupMultiXact set latest_page_number, but StartupSUBTRANS does not. Is that a problem for subtrans? StartupCLOG() and StartupMultiXact() are called at different stages in hot standby - StartupCLOG() is called at the beginning of recovery, but StartupMultiXact() is only called at end of recovery. Why the discrepancy, does latest_page_number need to be set during hot standby or not? I think we should bite the bullet, and WAL-log the extension of subtrans, too. Then make the startup and extension procedure for all the SLRUs the same. - Heikki
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: > On 19.11.2013 16:22, Andres Freund wrote: > >On 2013-11-19 15:20:01 +0100, Andres Freund wrote: > >>Imo something the attached patch should be done. The description I came > >>up with is: > >> > >> Fix Hot-Standby initialization of clog and subtrans. > > Looks ok for a back-patchable fix. > > It's a bit bizarre that the ExtendSUBTRANS loop in > ProcArrayApplyRecoveryInfo looks different from the one in > RecordKnownAssignedTransactionIds, but both look correct to me. That's because of the different upper bounds (nextxid) vs xid]), but yea, I wondered about that as well. > In master, it'd be nice to do some further cleanup. Some gripes: > > In ProcArrayApplyXidAssignment, I wonder if it would be best to just remove > the (standbyState == STANDBY_INITIALIZED) check altogether. The > KnownAssignedXidsRemoveTree() that follows is harmless if there is nothing > in the known-assigned-xids array (xact_redo_commit does it in > STANDBY_INITIALIZED state too). The other thing that's done after that check > is updating lastOverflowedXid, and AFAICS it would be harmless to update > that, too. It will be overwritten by the ProcArrayApplyRecoveryInfo() call > that comes later. I was thinking about removing it entirely in the patch, but chose not to do so. I don't really care which way we go. > Clog, subtrans and multixact are all handled differently. Extensions of clog > and multixact are WAL-logged, but extensions of subtrans are not. They all > have a Startup function, but it has a slightly different purpose. > StartupCLOG only sets latest_page_number, but StartupSUBTRANS and > StartupMultiXact zero out the current page. For CLOG, the TrimCLOG() > function does that. Why is clog handled differently from multixact? I'd guess clog and multixact are handled differently because multixact supposedly is never queried during recovery. But I don't that's actually still true, thinking of 9.3's changes around fkey locks and HeapTupleGetUpdateXid(). So it's probably time to split StartupMultiXact similar to clog's routines. > StartupCLOG() and StartupMultiXact set latest_page_number, but > StartupSUBTRANS does not. Is that a problem for subtrans? I don't think it is, the difference is that StartupSUBTRANS() zeroes out the current subtrans page which will set latest_page_number, the other's access the pages normally, which doesn't set it. > StartupCLOG() and > StartupMultiXact() are called at different stages in hot standby - > StartupCLOG() is called at the beginning of recovery, but StartupMultiXact() > is only called at end of recovery. Why the discrepancy, does > latest_page_number need to be set during hot standby or not? latest_page_number primarily is an optimization, isn't it? Except for a safeguard check in SimpleLruTruncate() it should only influence victim buffer initialization. But: slru.c explicitly doesn't initialize ->latest_page_number, which means it's zeroed from a memset slightly above. Which seems we might cause problems when performing truncations during recovery, before the first page is zeroed (which'd set latest_page_number again). ... Hm. Do we actually *ever* truncate the multixact slru during recovery? clog.c's truncations are WAL logged, TruncateSUBTRANS() is performed by restartpoints, but there's no callers to TruncateMultiXact but vac_truncate_clog and it's not logged? That doesn't seem right. > I think we should bite the bullet, and WAL-log the extension of subtrans, > too. Then make the startup and extension procedure for all the SLRUs the > same. Hm. I don't really see a big advantage in that? I am all for trying to bring more symetry to the startup routines, but I don't think forcing WAL logging for something scrapped every restart is necessary for that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-19 09:33:34 -0500, Andrew Dunstan wrote: >> >> On 11/19/2013 09:20 AM, Andres Freund wrote: >> >Imo this warrants and expedited point release :( +1 >> I presume anyone who is vulnerable to it would need to recreate >> their secondary servers to get rid of potential problems? > > Yes. There's less expensive ways to do it, but those seem to > complicated to suggest. Wouldn't a database VACUUM FREEZE fix it, with WAL-logged writing of everything that doesn't yet have hint bits set? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote: > > Yes. There's less expensive ways to do it, but those seem to > > complicated to suggest. > > Wouldn't a database VACUUM FREEZE fix it, with WAL-logged writing > of everything that doesn't yet have hint bits set? Besides also being pretty expensive it still wouldn't correct the clog - and we don't always rely on hint bits. I was thinking about just copying over the clog from the primary, but it's not trivial if the standby isn't cought up, since the primary's clog could have been truncated ahead of what the standby needs. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote: >> Wouldn't a database VACUUM FREEZE fix it, with WAL-logged >> writing of everything that doesn't yet have hint bits set? > > Besides also being pretty expensive it still wouldn't correct the > clog - and we don't always rely on hint bits. I'm talking about after a fix is deployed, fixing up the possible corruption. Can you explain where VACUUM FREEZE would not suffice? I don't know of anywhere that we have hint bits set for a tuple and we go fetch the clog bits in spite of that. I don't understand where that would make sense; especially since I thought that a database FREEZE followed by a checkpoint releases old clog space anyway. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-11-20 05:30:39 -0800, Kevin Grittner wrote: > > >> Wouldn't a database VACUUM FREEZE fix it, with WAL-logged > >> writing of everything that doesn't yet have hint bits set? > > > > Besides also being pretty expensive it still wouldn't correct the > > clog - and we don't always rely on hint bits. > > I'm talking about after a fix is deployed, fixing up the possible > corruption. Can you explain where VACUUM FREEZE would not suffice? > I don't know of anywhere that we have hint bits set for a tuple and > we go fetch the clog bits in spite of that. There's several places. Grep for TransactionIdDidCommit() and ignore the bits in tqual.c. Many of the remaining ones do not look at hint bits. > I don't understand > where that would make sense; especially since I thought that a > database FREEZE followed by a checkpoint releases old clog space > anyway. It only releases them up to the (cluster wide) xmin horizon. So if there are older snapshots or prepared xacts around... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote: >> I don't understand where that would make sense; especially since >> I thought that a database FREEZE followed by a checkpoint >> releases old clog space anyway. > > It only releases them up to the (cluster wide) xmin horizon. So > if there are older snapshots or prepared xacts around... So as long as there are no open transactions or prepared transactions on the master which started before the release with the fix is applied, VACUUM FREEZE would be guaranteed to work? Since I don't see how a non-prepared transaction would be running from before a minor release upgrade, that just means we have to make sure there are no prepared transactions from before the upgrade? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-11-20 05:59:58 -0800, Kevin Grittner wrote: > > >> I don't understand where that would make sense; especially since > >> I thought that a database FREEZE followed by a checkpoint > >> releases old clog space anyway. > > > > It only releases them up to the (cluster wide) xmin horizon. So > > if there are older snapshots or prepared xacts around... > > So as long as there are no open transactions or prepared > transactions on the master which started before the release with > the fix is applied, VACUUM FREEZE would be guaranteed to work? > Since I don't see how a non-prepared transaction would be running > from before a minor release upgrade, that just means we have to > make sure there are no prepared transactions from before the > upgrade? That's not a bad point. So the way to fix it would be: 1) Restart the standby to the new minor release, wait for catchup 2) Restart the primary (fast or smart) to the new minor release 3) Acquire enough new xids to make sure we cross a clog page (?) 4) Jot down a new xid: SELECT txid_current()::bigint % (1::bigint<<33-1) 5) vacuumdb -z -a 6) Ensure that there are no prepared xacts older than 3) around SELECT * FROM pg_prepared_xacts ORDER BY age(transaction) DESC LIMIT 1; 7) Ensure the xmin horizon is above the one from: 3: SELECT datname, datfrozenxid FROM pg_database WHERE datname != 'template0' ORDER BY age(datfrozenxid) DESC LIMIT 1; 8) Get the current lsn: SELECT pg_current_xlog_location(); 9) verify on each standby that SELECT pg_last_xlog_receive_location() is past 7) 10) be happy I am not sure how we can easily compute that 6) and 7) are past 3) in the presence of xid wraparounds. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Kevin Grittner
Date:
Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote: >> So as long as there are no open transactions or prepared >> transactions on the master which started before the release with >> the fix is applied, VACUUM FREEZE would be guaranteed to work? >> Since I don't see how a non-prepared transaction would be running >> from before a minor release upgrade, that just means we have to >> make sure there are no prepared transactions from before the >> upgrade? > > That's not a bad point. So the way to fix it would be: > > 1) Restart the standby to the new minor release, wait for catchup > 2) Restart the primary (fast or smart) to the new minor release > 3) Acquire enough new xids to make sure we cross a clog page (?) > 4) Jot down a new xid: SELECT txid_current()::bigint % (1::bigint<<33-1) > 5) vacuumdb -z -a > 6) Ensure that there are no prepared xacts older than 3) around > SELECT * > FROM pg_prepared_xacts > ORDER BY age(transaction) DESC LIMIT 1; > 7) Ensure the xmin horizon is above the one from: 3: > SELECT datname, datfrozenxid > FROM pg_database > WHERE datname != 'template0' > ORDER BY age(datfrozenxid) DESC LIMIT 1; > 8) Get the current lsn: SELECT pg_current_xlog_location(); > 9) verify on each standby that SELECT pg_last_xlog_receive_location() is > past 7) > 10) be happy > > I am not sure how we can easily compute that 6) and 7) are past 3) in > the presence of xid wraparounds. I may well be missing something here, but wouldn't it be sufficient to?: 1) Restart the standby to the new minor release, wait for catchup 2) Restart the primary (fast or smart) to the new minor release 3) Run VACUUM FREEZE (optionally with ANALYZE) in each database on primary 4) Run CHECKPOINT command on primary, or just wait for one to run 5) Wait for standby to process to the checkpoint 6) Be happy -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 07:06:04 -0800, Kevin Grittner wrote: > > That's not a bad point. So the way to fix it would be: > > > > 1) Restart the standby to the new minor release, wait for catchup > > 2) Restart the primary (fast or smart) to the new minor release > > 3) Acquire enough new xids to make sure we cross a clog page (?) > > 4) Jot down a new xid: SELECT txid_current()::bigint % (1::bigint<<33-1) > > 5) vacuumdb -z -a > > 6) Ensure that there are no prepared xacts older than 3) around > > SELECT * > > FROM pg_prepared_xacts > > ORDER BY age(transaction) DESC LIMIT 1; > > 7) Ensure the xmin horizon is above the one from: 3: > > SELECT datname, datfrozenxid > > FROM pg_database > > WHERE datname != 'template0' > > ORDER BY age(datfrozenxid) DESC LIMIT 1; > > 8) Get the current lsn: SELECT pg_current_xlog_location(); > > 9) verify on each standby that SELECT pg_last_xlog_receive_location() is > > past 7) > > 10) be happy > > > > I am not sure how we can easily compute that 6) and 7) are past 3) in > > the presence of xid wraparounds. > > > I may well be missing something here, but wouldn't it be sufficient to?: > 1) Restart the standby to the new minor release, wait for catchup > 2) Restart the primary (fast or smart) to the new minor release > 3) Run VACUUM FREEZE (optionally with ANALYZE) in each database on primary > 4) Run CHECKPOINT command on primary, or just wait for one to run > 5) Wait for standby to process to the checkpoint > 6) Be happy Well, in some cases it might. But what if there's a prepared xact around? Or a transaction started directly after 2) preventing FreezeLimit to go up? Or vacuum_defer_cleanup_age is set? Or there's another bug like 4c697d8f4845823a8af67788b219ffa4516ad14c? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Heikki Linnakangas
Date:
On 20.11.2013 17:06, Kevin Grittner wrote: > Andres Freund <andres@2ndquadrant.com> wrote: >> On 2013-11-20 06:21:13 -0800, Kevin Grittner wrote: > > >>> So as long as there are no open transactions or prepared >>> transactions on the master which started before the release with >>> the fix is applied, VACUUM FREEZE would be guaranteed to work? >>> Since I don't see how a non-prepared transaction would be running >>> from before a minor release upgrade, that just means we have to >>> make sure there are no prepared transactions from before the >>> upgrade? >> >> That's not a bad point. So the way to fix it would be: >> >> 1) Restart the standby to the new minor release, wait for catchup >> 2) Restart the primary (fast or smart) to the new minor release >> 3) Acquire enough new xids to make sure we cross a clog page (?) >> 4) Jot down a new xid: SELECT txid_current()::bigint % (1::bigint<<33-1) >> 5) vacuumdb -z -a >> 6) Ensure that there are no prepared xacts older than 3) around >> SELECT * >> FROM pg_prepared_xacts >> ORDER BY age(transaction) DESC LIMIT 1; >> 7) Ensure the xmin horizon is above the one from: 3: >> SELECT datname, datfrozenxid >> FROM pg_database >> WHERE datname != 'template0' >> ORDER BY age(datfrozenxid) DESC LIMIT 1; >> 8) Get the current lsn: SELECT pg_current_xlog_location(); >> 9) verify on each standby that SELECT pg_last_xlog_receive_location() is >> past 7) >> 10) be happy >> >> I am not sure how we can easily compute that 6) and 7) are past 3) in >> the presence of xid wraparounds. > > > I may well be missing something here, but wouldn't it be sufficient to?: > 1) Restart the standby to the new minor release, wait for catchup > 2) Restart the primary (fast or smart) to the new minor release > 3) Run VACUUM FREEZE (optionally with ANALYZE) in each database on primary > 4) Run CHECKPOINT command on primary, or just wait for one to run > 5) Wait for standby to process to the checkpoint > 6) Be happy Isn't it possible that the standby has already incorrectly set HEAP_XMIN_INVALID hint bit on a page? The full page images generated by VACUUM FREEZE will correct the damage, but if not, e.g. because full_page_writes=off, strange things will happen. Personally, I wouldn't trust anything less than a new base backup. - Heikki
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 18:25:56 +0200, Heikki Linnakangas wrote: > Isn't it possible that the standby has already incorrectly set > HEAP_XMIN_INVALID hint bit on a page? The full page images generated by > VACUUM FREEZE will correct the damage, but if not, e.g. because > full_page_writes=off, strange things will happen. The xlog_heap_freeze records should repair that afaics. > Personally, I wouldn't trust anything less than a new base backup. But I still tend to agree. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andrews, Kevin: Presumably a replica created while all traffic was halted on the master would be clean, correct? This bug can only be triggered if there's heavy write load on the master, right? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 11/20/2013 10:30 AM, Josh Berkus wrote: > Andrews, Kevin: Andres, that is. > > Presumably a replica created while all traffic was halted on the master > would be clean, correct? This bug can only be triggered if there's > heavy write load on the master, right? > Also, just to verify: If someone is doing PITR based on a snapshot taken with pg_basebackup, that will only trip this corruption bug if the user has hot_standby=on in their config *while restoring*? Or is it critical if they have hot_standby=on while backing up? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 10:48:41 -0800, Josh Berkus wrote: > > Presumably a replica created while all traffic was halted on the master > > would be clean, correct? This bug can only be triggered if there's > > heavy write load on the master, right? Kinda. It's unfortunately necessary to understand how HS works to some degree: Everytime a server is (re-)started with a recovery.conf present and hot_standby=on (be it streaming, archive based replication or PITR) the Hot Standby code is used. (Crash|Replication)-Recovery starts by reading the last checkpoint (from pg_control or, if present, backup.label) and then replays WAL from the 'redo' point included in the checkpoint. The bug then occurs when it first (or, in some case second time) replays a 'xl_running_xacts' record. That's used to reconstruct information needed to allow queries. Everytime the server in HS mode allows connections ("consistent recovery state reached at ..." and "database system is ready to accept read only connections" in the log), the bug can be triggered. If there weren't too many transactions at that point, the problem won't occur until the standby is restarted. > If someone is doing PITR based on a snapshot taken with pg_basebackup, > that will only trip this corruption bug if the user has hot_standby=on > in their config *while restoring*? Or is it critical if they have > hot_standby=on while backing up? hot_standby=on only has an effect while starting up with a recovery.conf present. So, if you have an old base backup around and all WAL files, you can start from that. Does that answer your questsions? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres, > Everytime the server in HS mode allows connections ("consistent recovery state > reached at ..." and "database system is ready to accept read only > connections" in the log), the bug can be triggered. If there weren't too > many transactions at that point, the problem won't occur until the > standby is restarted. Oh, so this doesn't just happen when the base backup is first taken; *any* time the standby is restarted, it can happen. (!!!) >> If someone is doing PITR based on a snapshot taken with pg_basebackup, >> that will only trip this corruption bug if the user has hot_standby=on >> in their config *while restoring*? Or is it critical if they have >> hot_standby=on while backing up? > > hot_standby=on only has an effect while starting up with a recovery.conf > present. So, if you have an old base backup around and all WAL files, > you can start from that. > > Does that answer your questsions? Yeah, thanks. If you have any ideas for how we'd write code to scan for this kind of corruption, please post them. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 15:52:22 -0800, Josh Berkus wrote: > Andres, > > > Everytime the server in HS mode allows connections ("consistent recovery state > > reached at ..." and "database system is ready to accept read only > > connections" in the log), the bug can be triggered. If there weren't too > > many transactions at that point, the problem won't occur until the > > standby is restarted. > > Oh, so this doesn't just happen when the base backup is first taken; > *any* time the standby is restarted, it can happen. (!!!) Yes. > If you have any ideas for how we'd write code to scan for this kind of > corruption, please post them. I don't really have one. Current corruption would be somewhat easy to detect (walk through the clog, check if all commit bits match), but that doesn't detect wether already truncated clog was corrupted. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Christophe Pettus
Date:
On Nov 20, 2013, at 3:57 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-20 15:52:22 -0800, Josh Berkus wrote: >> Oh, so this doesn't just happen when the base backup is first taken; >> *any* time the standby is restarted, it can happen. (!!!) > > Yes. So, to be completely clear, any secondary running the affected versions which is started with hot_standby=on could potentiallybe corrupted even if it never connects to a primary? -- -- Christophe Pettus xof@thebuild.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 16:36:46 -0800, Christophe Pettus wrote: > > On Nov 20, 2013, at 3:57 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > > On 2013-11-20 15:52:22 -0800, Josh Berkus wrote: > >> Oh, so this doesn't just happen when the base backup is first taken; > >> *any* time the standby is restarted, it can happen. (!!!) > > > > Yes. > > So, to be completely clear, any secondary running the affected > versions which is started with hot_standby=on could potentially be > corrupted even if it never connects to a primary? Yes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-20 17:46:22 +0100, Andres Freund wrote: > On 2013-11-20 18:25:56 +0200, Heikki Linnakangas wrote: > > Isn't it possible that the standby has already incorrectly set > > HEAP_XMIN_INVALID hint bit on a page? The full page images generated by > > VACUUM FREEZE will correct the damage, but if not, e.g. because > > full_page_writes=off, strange things will happen. > > The xlog_heap_freeze records should repair that afaics. Nope, it wouldn't, it just uses heap_freeze_tuple() again which *only* sets HEAP_XMIN_COMMITTED but does *not* clear HEAP_XMIN_INVALID. And if we were to change that now, it wouldn't help already frozen tuples that wouldn't get frozen again. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Nov 19, 2013 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-11-19 15:20:01 +0100, Andres Freund wrote: >> Imo something the attached patch should be done. The description I came >> up with is: >> >> Fix Hot-Standby initialization of clog and subtrans. > G'day Andres. This wouldn't happen to be related to the issue I was having with missing pg_subtrans files, would it, as discussed here http://www.postgresql.org/message-id/CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com ? Based on my reading of the commit message and what I could glean from the patch itself, they seem to be related. If there's anything I can do to help test, just say the word and I can fit in some time. Cheers.
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-21 10:25:20 -0500, J Smith wrote: > On Tue, Nov 19, 2013 at 9:22 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-11-19 15:20:01 +0100, Andres Freund wrote: > >> Imo something the attached patch should be done. The description I came > >> up with is: > >> > >> Fix Hot-Standby initialization of clog and subtrans. > > > > G'day Andres. > > This wouldn't happen to be related to the issue I was having with > missing pg_subtrans files, would it, as discussed here > http://www.postgresql.org/message-id/CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com > ? Based on my reading of the commit message and what I could glean > from the patch itself, they seem to be related. I don't think so - for one, pg_subtrans isn't really the problems with that bug, for another, it won't cause missing files. Also, you're not using replication, right? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Tatsuo Ishii
Date:
> Well, I happen to have some pieces of such a framework: the parts which > can automate spinning up arbitrarily complex groups of replicas and > doing failover between them. What we'd still need is: > > a) a slightly better workload than pgbench > b) a way to compare and test databases for data corruption of several kinds > > Can someone else kick in to help with this? Recently pgpool-II introduces its own regression test framework written in shell script which can create a streaming replication primary and arbitrary number of standbys in single server. The test scenarios are written in a shell script, thus is reasonably flexible. I have read through all the thread but I am not sure how to reproduce the problem reliably. If I would know it, probably I can integrate into the regression test. > I think this last issue shows that it's critical as a community to have > such a testing framework in place, otherwise we really need to halt all > work on replication until we have such a thing. Agreed. I think this is very critical for us. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On Thu, Nov 21, 2013 at 10:27 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I don't think so - for one, pg_subtrans isn't really the problems with > that bug, for another, it won't cause missing files. Also, you're not > using replication, right? > Actually, this server is a master being replicated to a standby. We haven't experienced the problem on the standby, but we are indeed replicating the server experiencing the issue.
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-21 11:54:58 -0500, J Smith wrote: > On Thu, Nov 21, 2013 at 10:27 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > > > I don't think so - for one, pg_subtrans isn't really the problems with > > that bug, for another, it won't cause missing files. Also, you're not > > using replication, right? > > > > Actually, this server is a master being replicated to a standby. We > haven't experienced the problem on the standby, but we are indeed > replicating the server experiencing the issue. It's still not this issue in that case, but I might have an idea... Do you have hot_standby_feedback enabled? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Nov 21, 2013 at 12:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > It's still not this issue in that case, but I might have an idea... Do > you have hot_standby_feedback enabled? > Nope, hot_standby_feedback is set to its default setting which is off. At any rate, I'm going to try and capture a backtrace from the error I'm experiencing as suggested in the other thread by Robert Haas and we'll see if that provides any insight. Cheers
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
"Joshua D. Drake"
Date:
Hello, This is turning into a rather large thread and I have a simple question: Is a work-around to this problem as simple as disabling streaming replication and enabling log shipping instead? Joshua D. Drake -- Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc For my dreams of your image that blossoms a rose in the deeps of my heart. - W.B. Yeats
On 11/21/2013 12:36 PM, Joshua D. Drake wrote: > > Hello, > > This is turning into a rather large thread and I have a simple question: > > Is a work-around to this problem as simple as disabling streaming > replication and enabling log shipping instead? Yes, and re-cloning the replica, in case it already has corruption. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-21 12:39:54 -0800, Josh Berkus wrote: > On 11/21/2013 12:36 PM, Joshua D. Drake wrote: > > > > Hello, > > > > This is turning into a rather large thread and I have a simple question: > > > > Is a work-around to this problem as simple as disabling streaming > > replication and enabling log shipping instead? No. Check http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de The problem is starting with hot_standby=on on a system with recovery.conf present. It is independent of whether you use streaming replication, archive based recovery, or just shutdown the server and manually copy xlog segments there. As long as hot_standby=on, and recovery.conf is present you can hit the bug. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 11/21/2013 12:46 PM, Andres Freund wrote: > No. Check > http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de > > The problem is starting with hot_standby=on on a system with > recovery.conf present. It is independent of whether you use streaming > replication, archive based recovery, or just shutdown the server and > manually copy xlog segments there. > As long as hot_standby=on, and recovery.conf is present you can hit the > bug. Oh, aha. There have to be some transactions which are awaiting checkpoint, though, correct? As in, if there's no activity on the master, you can't trigger the bug? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-21 12:51:17 -0800, Josh Berkus wrote: > On 11/21/2013 12:46 PM, Andres Freund wrote: > > No. Check > > http://archives.postgresql.org/message-id/20131120234141.GI18801%40awork2.anarazel.de > > > > The problem is starting with hot_standby=on on a system with > > recovery.conf present. It is independent of whether you use streaming > > replication, archive based recovery, or just shutdown the server and > > manually copy xlog segments there. > > As long as hot_standby=on, and recovery.conf is present you can hit the > > bug. > > Oh, aha. There have to be some transactions which are awaiting > checkpoint, though, correct? As in, if there's no activity on the > master, you can't trigger the bug? Correct. Also, if you *start* at such a checkpoint you're not vulnerable until the standby is restarted. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
Hi, On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: > On 19.11.2013 16:22, Andres Freund wrote: > >On 2013-11-19 15:20:01 +0100, Andres Freund wrote: > >>Imo something the attached patch should be done. The description I came g> >>up with is: > >> > >> Fix Hot-Standby initialization of clog and subtrans. > > Looks ok for a back-patchable fix. Do you plan to commit this? Or who is going to? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Heikki Linnakangas
Date:
On 21.11.2013 22:53, Andres Freund wrote: > On 2013-11-21 12:51:17 -0800, Josh Berkus wrote: >> On 11/21/2013 12:46 PM, Andres Freund wrote: >>> The problem is starting with hot_standby=on on a system with >>> recovery.conf present. It is independent of whether you use streaming >>> replication, archive based recovery, or just shutdown the server and >>> manually copy xlog segments there. >>> As long as hot_standby=on, and recovery.conf is present you can hit the >>> bug. >> >> Oh, aha. There have to be some transactions which are awaiting >> checkpoint, though, correct? As in, if there's no activity on the >> master, you can't trigger the bug? > > Correct. Also, if you *start* at such a checkpoint you're not vulnerable > until the standby is restarted. Keep in mind that autovacuum counts as "activity" in this case. If you're unlucky, that is. It's next to impossible to determine after-the-fact if there has been activity in the master that might've caused problems. If you have ever set hot_standby=on in your standby server, running one of the affected versions, you're at risk. The standby might be corrupt, and should be rebuild from a base backup. The higher the transaction rate in the master, the higher the risk. I wouldn't try to narrow it down any further than that, it gets too complicated. - Heikki
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Heikki Linnakangas
Date:
On 21.11.2013 22:55, Andres Freund wrote: > Hi, > > On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: >> On 19.11.2013 16:22, Andres Freund wrote: >>> On 2013-11-19 15:20:01 +0100, Andres Freund wrote: >>>> Imo something the attached patch should be done. The description I came > g> >>up with is: >>>> >>>> Fix Hot-Standby initialization of clog and subtrans. >> >> Looks ok for a back-patchable fix. > > Do you plan to commit this? Or who is going to? I can commit it tomorrow, unless someone beats me to it. - Heikki
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-21 23:02:29 +0200, Heikki Linnakangas wrote: > On 21.11.2013 22:53, Andres Freund wrote: > >On 2013-11-21 12:51:17 -0800, Josh Berkus wrote: > >>On 11/21/2013 12:46 PM, Andres Freund wrote: > >>>The problem is starting with hot_standby=on on a system with > >>>recovery.conf present. It is independent of whether you use streaming > >>>replication, archive based recovery, or just shutdown the server and > >>>manually copy xlog segments there. > >>>As long as hot_standby=on, and recovery.conf is present you can hit the > >>>bug. > >> > >>Oh, aha. There have to be some transactions which are awaiting > >>checkpoint, though, correct? As in, if there's no activity on the > >>master, you can't trigger the bug? > > > >Correct. Also, if you *start* at such a checkpoint you're not vulnerable > >until the standby is restarted. > > Keep in mind that autovacuum counts as "activity" in this case. If you're > unlucky, that is. It's next to impossible to determine after-the-fact if > there has been activity in the master that might've caused problems. Well, in that case you're going to "just" loose the pg_database/pg_class/stats updates from analyze/vacuum. That's annoying, but not too bad. > I wouldn't try to narrow it down any further than that, it gets too > complicated. Yes. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Heikki Linnakangas
Date:
On 19.11.2013 16:20, Andres Freund wrote: > On 2013-11-18 23:15:59 +0100, Andres Freund wrote: >> Afaics it's likely a combination/interaction of bugs and fixes between: >> * the initial HS code >> * 5a031a5556ff83b8a9646892715d7fef415b83c3 >> * f44eedc3f0f347a856eea8590730769125964597 > > Yes, the combination of those is guilty. > > Man, this is (to a good part my) bad. > >> But that'd mean nobody noticed it during 9.3's beta... > > It's fairly hard to reproduce artificially since a) there have to be > enough transactions starting and committing from the start of the > checkpoint the standby is starting from to the point it does > LogStandbySnapshot() to cross a 32768 boundary b) hint bits often save > the game by not accessing clog at all anymore and thus not noticing the > corruption. > I've reproduced the issue by having an INSERT ONLY table that's never > read from. It's helpful to disable autovacuum. For the archive, here's what I used to reproduce this. It creates master and a standby, and also uses an INSERT only table. To make it trigger more easily, it helps to insert sleeps in CreateCheckpoint(), around the LogStandbySnapshot() call. - Heikki
Attachment
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Heikki Linnakangas
Date:
On 21.11.2013 22:55, Andres Freund wrote: > On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: >> On 19.11.2013 16:22, Andres Freund wrote: >>> On 2013-11-19 15:20:01 +0100, Andres Freund wrote: >>>> Imo something the attached patch should be done. The description I came > g> >>up with is: >>>> >>>> Fix Hot-Standby initialization of clog and subtrans. >> >> Looks ok for a back-patchable fix. > > Do you plan to commit this? Or who is going to? Ok, committed. - Heikki
Re: Data corruption issues using streaming replication on 9.0.14/9.2.5/9.3.1
From
Andres Freund
Date:
On 2013-11-22 15:01:10 +0200, Heikki Linnakangas wrote: > On 21.11.2013 22:55, Andres Freund wrote: > >On 2013-11-20 12:48:50 +0200, Heikki Linnakangas wrote: > >>Looks ok for a back-patchable fix. > > > >Do you plan to commit this? Or who is going to? > > Ok, committed. Thanks! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services