Thread: Problem after replication switchover

Problem after replication switchover

From

Lars Arvidson

Date:

05 April 2016, 22:13:00

Hi, I hope I'm mailing this to the correct mailing list.
We get errors inserting into a table:
2016-04-04 07:27:51 CEST [43342-2] <someuser>@<db1> ERROR:  could not read block 28991 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:27:51 CEST [43342-3] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:28:27 CEST [43342-4] <someuser>@<db1> ERROR:  could not read block 28992 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:28:27 CEST [43342-5] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:28:50 CEST [43342-6] <someuser>@<db1> ERROR:  could not read block 28993 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:28:50 CEST [43342-7] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:29:06 CEST [44085-1] <someuser>@<db1> ERROR:  could not read block 28994 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:29:06 CEST [44085-2] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:29:07 CEST [44085-3] <someuser>@<db1> ERROR:  could not read block 28995 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:29:07 CEST [44085-4] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:29:33 CEST [44085-5] <someuser>@<db1> ERROR:  could not read block 28996 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:29:33 CEST [44085-6] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:29:47 CEST [44085-7] <someuser>@<db1> ERROR:  could not read block 28997 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:29:47 CEST [44085-8] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:30:04 CEST [43267-2] <someuser>@<db1> ERROR:  could not read block 28998 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:30:04 CEST [43267-3] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:30:12 CEST [43267-4] <someuser>@<db1> ERROR:  could not read block 28999 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:30:12 CEST [43267-5] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:31:08 CEST [44063-1] <someuser>@<db1> ERROR:  could not read block 29000 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:31:08 CEST [44063-2] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"
2016-04-04 07:31:11 CEST [44063-3] <someuser>@<db1> ERROR:  could not read block 29001 in file "base/16390/572026":
readonly 0 of 8192 bytes 
2016-04-04 07:31:11 CEST [44063-4] <someuser>@<db1> STATEMENT:  INSERT INTO <table_12> (...) VALUES (...) RETURNING
"id"

We use steaming (asynchronous) replication and do a controlled automated switchover (the master is demoted to slave and
theprevious slave promoted to master) each week. Most times this works like a charm but a few times after the
switchoverinserts into one table fail (se log extract above). Aside from that error the application runs as normal and
replicationis working. The last two times this has happened I have trigged a switch back to the original master and
everythingruns fine. 

The switchover is done in the following steps assuming cluster1 is current master and cluster2 is the slave we want to
promote:
1. Make sure cluster1 is not in recovery mode.
2. Make sure cluster2 is in recovery mode.
3. Pause all new database activity in the web application and wait a few seconds to give current work a chance to
finish.
4. Create a recovery.conf file on cluster1 "pointing" to cluster2 (current slave).
5. Run "pg_ctlcluster -m fast ${CLUSTER_VERSION} ${CLUSTER_NAME} restart" on cluster1 and wait for it to restart.
6. Make sure cluster1 is in recovery mode (the clusters are now replicating from each other).
7. Wait for SELECT pg_xlog_location_diff(pg_last_xlog_replay_location(), replay_location) from pg_stat_replication
whereapplication_name='s2s-replication' to return "0" on both servers. This should make sure cluster2 is up to date and
synchronizedwith cluster1. 
8. Run "pg_ctlcluster ${CLUSTER_VERSION} ${CLUSTER_NAME} promote" on cluster2.
9. Wait for "SELECT pg_is_in_recovery()" to return "f" on cluster2.
10. Tell the application to detect what cluster is master and resume database activity.

Log from cluster1:
2016-04-04 06:32:09 CEST [48286-3] LOG:  received fast shutdown request
2016-04-04 06:32:09 CEST [48286-4] LOG:  aborting any active transactions
2016-04-04 06:32:09 CEST [15011-1] <someuser>@<db1> FATAL:  terminating connection due to administrator command
--- lots more terminating connections ---
2016-04-04 06:32:10 CEST [48301-51] LOG:  shutting down
2016-04-04 06:32:10 CEST [48301-52] LOG:  database system is shut down
2016-04-04 06:32:17 CEST [15203-1] [unknown]@[unknown] LOG:  incomplete startup packet
2016-04-04 06:32:17 CEST [15201-1] LOG:  database system was shut down at 2016-04-04 06:32:10 CEST
2016-04-04 06:32:17 CEST [15202-1] <someuser>@[unknown] FATAL:  the database system is starting up
2016-04-04 06:32:17 CEST [15206-1] <someuser>@[unknown] FATAL:  the database system is starting up
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo2-dbshard2-ufoprod/0000001E.history': No such file or
directory
2016-04-04 06:32:17 CEST [15201-2] LOG:  entering standby mode
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo2-dbshard2-ufoprod/0000001D.history': No such file or
directory
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo2-dbshard2-ufoprod/0000001D00000B860000009C': No such file
ordirectory 
2016-04-04 06:32:17 CEST [15213-1] <someuser>@<db0> FATAL:  the database system is starting up
2016-04-04 06:32:17 CEST [15201-3] LOG:  consistent recovery state reached at B86/9C000090
2016-04-04 06:32:17 CEST [15201-4] LOG:  record with zero length at B86/9C000090
2016-04-04 06:32:17 CEST [15200-1] LOG:  database system is ready to accept read only connections
2016-04-04 06:32:17 CEST [15217-1] LOG:  started streaming WAL from primary at B86/9C000000 on timeline 29
2016-04-04 06:32:28 CEST [15217-2] LOG:  replication terminated by primary server
2016-04-04 06:32:28 CEST [15217-3] DETAIL:  End of WAL reached on timeline 29 at B86/9C000090.
2016-04-04 06:32:28 CEST [15217-4] LOG:  fetching timeline history file for timeline 30 from primary server
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo2-dbshard2-ufoprod/0000001E.history': No such file or
directory
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo2-dbshard2-ufoprod/0000001F.history': No such file or
directory
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo2-dbshard2-ufoprod/0000001E.history': No such file or
directory
2016-04-04 06:32:28 CEST [15201-5] LOG:  new target timeline is 30
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo2-dbshard2-ufoprod/0000001E00000B860000009C': No such file
ordirectory 
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo2-dbshard2-ufoprod/0000001D00000B860000009C': No such file
ordirectory 
2016-04-04 06:32:28 CEST [15217-5] LOG:  restarted WAL streaming at B86/9C000000 on timeline 30
2016-04-04 06:32:28 CEST [15201-6] LOG:  redo starts at B86/9C000090

Log from cluster2:
2016-04-04 06:32:12 CEST [1801-2] LOG:  replication terminated by primary server
2016-04-04 06:32:12 CEST [1801-3] DETAIL:  End of WAL reached on timeline 29 at B86/9C000090.
2016-04-04 06:32:12 CEST [1801-4] FATAL:  could not send end-of-streaming message to primary: no COPY in progress

cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001E.history': No such file or
directory
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001D00000B860000009C': No such file
ordirectory 
2016-04-04 06:32:12 CEST [1472-6] LOG:  invalid record length at B86/9C000090
2016-04-04 06:32:16 CEST [27378-1] FATAL:  could not connect to the primary server: server closed the connection
unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.

cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001E.history': No such file or
directory
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001D00000B860000009C': No such file
ordirectory 
2016-04-04 06:32:17 CEST [27383-1] FATAL:  could not connect to the primary server: FATAL:  the database system is
startingup 
        FATAL:  the database system is starting up

cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001E.history': No such file or
directory
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001D00000B860000009C': No such file
ordirectory 
2016-04-04 06:32:23 CEST [27389-1] LOG:  started streaming WAL from primary at B86/9C000000 on timeline 29
2016-04-04 06:32:23 CEST [1472-7] LOG:  received promote request
2016-04-04 06:32:23 CEST [27389-2] FATAL:  terminating walreceiver process due to administrator command
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001E.history': No such file or
directory
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001D00000B860000009C': No such file
ordirectory 
2016-04-04 06:32:27 CEST [1472-8] LOG:  redo done at B86/9C000028
2016-04-04 06:32:27 CEST [1472-9] LOG:  last completed transaction was at log time 2016-04-04 06:32:03.399836+02
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001D00000B860000009C': No such file
ordirectory 
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001E.history': No such file or
directory
2016-04-04 06:32:27 CEST [1472-10] LOG:  selected new timeline ID: 30
cp: cannot stat '/net/glusterfs/pgwal/postgresql/wal/ufcolo1-dbshard2-ufoprod/0000001D.history': No such file or
directory
2016-04-04 06:32:28 CEST [1472-11] LOG:  archive recovery complete
2016-04-04 06:32:28 CEST [1472-12] LOG:  MultiXact member wraparound protections are now enabled
2016-04-04 06:32:28 CEST [27513-1] LOG:  autovacuum launcher started
2016-04-04 06:32:28 CEST [1465-2] LOG:  database system is ready to accept connections

We have had tree similar incidents:
2016-01-26 Postgresql version: 9.4.1 on Ubuntu 14.04
2016-03-07 Postgresql version: 9.4.5
2016-04-04 Postgresql version: 9.4.5/9.4.6 (The newly promoted cluster where the error occurred was running 9.4.6, the
demotedcluster was running 9.4.5) 
I'm not entirely sure the incident 2016-01-26 was the same problem as I was on vacation and there are not much logs
saved.I only have one log row and that looks similar to the other once and it was triggered by an insert "ERROR: could
notread block 2023 in file "base/16390/22841": read only 0 of 8192 bytes". The offending file was an index. This
problemwas fixed with a VACCUM FULL on the table the index belonged to. 
The incident 2016-03-07 was very similar to the latest incident 2016-04-04. The offending file was a pg_toast table.
From what I can see in the logs the two clusters should be completely synchronized? Yet the base/16390/572026 file is
notthe same on both servers as it fails to read (the last?) blocks from it. 
Next time this happens I will try to check the size of the offending file.

Is there something I missed in the switchover or could this be a bug?

Regards
Lars Arvidson

Re: Problem after replication switchover

From

Alan Hodgson

Date:

05 April 2016, 22:38:03

On Tuesday, April 05, 2016 12:55:04 PM Lars Arvidson wrote:
> Is there something I missed in the switchover or could this be a bug?
>

I'd guess it's probably more like option 3 - Glusterfs ate my database.

Re: Problem after replication switchover

From

Lars Arvidson

Date:

06 April 2016, 22:58:18

> I'd guess it's probably more like option 3 - Glusterfs ate my database.

Hi, thanks for your reply!
We do archive logs on a distributed Glusterfs volume in case the streaming replication gets too far behind and the
transactionlogs have been removed. 
Would a restore of a corrupt archived log file give the symptoms we are seeing? Would not Postgresql detect that the
logfilewas corrupt? 
Are there some way I can analyze archived logs files to see if this is the problem?

Regards
Lars Arvidson

Re: Problem after replication switchover

From

Alan Hodgson

Date:

06 April 2016, 23:48:57

On Wednesday, April 06, 2016 10:33:16 AM Lars Arvidson wrote:
> > I'd guess it's probably more like option 3 - Glusterfs ate my database.
>
> Hi, thanks for your reply!
> We do archive logs on a distributed Glusterfs volume in case the streaming
> replication gets too far behind and the transaction logs have been removed.
> Would a restore of a corrupt archived log file give the symptoms we are
> seeing? Would not Postgresql detect that the logfile was corrupt? Are there
> some way I can analyze archived logs files to see if this is the problem?
>

If it's just storing the logs, I doubt it's the cause of the problem. You can
ignore my message. I had too much fun fighting with Gluster recently.

I reread your original full post, and the one thing that stuck out for me was
"the clusters are now replicating from each other". I feel like that could be
a problem. But someone more intimate with the replication might want to input
on that.

Other than that, I wonder if you just have a hardware problem with your
storage.

Re: Problem after replication switchover

From

Lars Arvidson

Date:

07 April 2016, 16:18:42

> If it's just storing the logs, I doubt it's the cause of the problem. You can ignore my message. I had too much fun
fightingwith Gluster recently. 
Hehe, hope you came through on top ;).  Anyways, I added an md5sum calculation in the archiving script just to be able
toverify that the files don't get corrupted. From what I can tell this far they have not been corrupted but I'll keep
aneye on it. 

> I reread your original full post, and the one thing that stuck out for me was "the clusters are now replicating from
eachother". I feel like that could be a problem. But someone more intimate with the replication might want to input on
that.
Would be nice with some input on that. I do think it is the way to go for ensuring that two cluster are in synch when
switchingmaster (although I cannot remember the source of that information). Josh Berkus writes a bit about cycles in
replication(http://www.databasesoup.com/2013/01/cascading-replication-and-cycles.html) and writes "temporary cycle is a
legitimatepart of a transition between two different replication setups". 

> Other than that, I wonder if you just have a hardware problem with your storage.
We use fusionio ssd cards for storage. If there is something wrong with them I would expect more problems than I am
currentlyexperiencing, and more random problems... 

/Lars