Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM) - Mailing list pgsql-hackers

From Benedikt Grundmann
Subject Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)
Date
Msg-id CADbMkNPpn6HB9DXu=0zvq34-uC1cPu6AO=dc6A0OSsnG1dZXXQ@mail.gmail.com
Whole thread Raw
In response to Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)  (David Powers <dpowers@janestreet.com>)
Responses Re: streaming replication, "frozen snapshot backup on it" and missing relfile (postgres 9.2.3 on xfs + LVM)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Today we have seen

2013-05-28 04:11:12.300 EDT,,,30600,,51a41946.7788,1,,2013-05-27 22:41:10 EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied --- flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation base/16416/293974676",,,,""
2013-05-28 04:11:13.316 EDT,,,30600,,51a41946.7788,2,,2013-05-27 22:41:10 EDT,,0,ERROR,XX000,"xlog flush request 1E95/AFB2DB10 is not satisfied --- flushed only to 1E7E/21CB79A0",,,,,"writing block 9 of relation base/16416/293974676",,,,""

while taking the backup of the primary.  We have been running for a few days like that and today is the first day where we see these problems again.  So it's not entirely deterministic / we don't know yet what we have to do to reproduce.

So this makes Robert's theory more likely.  However we have also using this method (LVM + rsync with hardlinks from primary) for years without these problems.  So the big question is what changed?

One hypothesis is that it is related to the primary being in hot_standby instead of minimal or archive wal_method (which we used before we switched to 9.2).

Here are the entries in the log related to the startup of the corrupt testing cluster:

2013-05-27 22:41:10.029 EDT,,,30598,,51a41946.7786,1,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"database system was interrupted; last known up at 2013-05-26 21:01:09 EDT",,,,,,,,,""
2013-05-27 22:41:10.029 EDT,,,30599,"",51a41946.7787,1,"",2013-05-27 22:41:10 EDT,,0,LOG,00000,"connection received: host=172.27.65.204 port=55279",,,,,,,,,""
2013-05-27 22:41:10.030 EDT,,,30598,,51a41946.7786,2,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
2013-05-27 22:41:10.030 EDT,"as-elephant","postgres",30599,"172.27.65.204:55279",51a41946.7787,2,"",2013-05-27 22:41:10 EDT,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2013-05-27 22:41:10.031 EDT,,,30598,,51a41946.7786,3,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"redo starts at 1E7E/2152B178",,,,,,,,,""
2013-05-27 22:41:10.094 EDT,,,30598,,51a41946.7786,4,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"record with zero length at 1E7E/215AC6B8",,,,,,,,,""
2013-05-27 22:41:10.094 EDT,,,30598,,51a41946.7786,5,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"redo done at 1E7E/215AC688",,,,,,,,,""
2013-05-27 22:41:10.094 EDT,,,30598,,51a41946.7786,6,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"last completed transaction was at log time 2013-05-26 21:09:08.06351-04",,,,,,,,,""
2013-05-27 22:41:10.134 EDT,,,30595,,51a41945.7783,1,,2013-05-27 22:41:09 EDT,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""
2013-05-27 22:41:10.134 EDT,,,30603,,51a41946.778b,1,,2013-05-27 22:41:10 EDT,,0,LOG,00000,"autovacuum launcher started",,,,,,,,,""
2013-05-27 22:41:15.037 EDT,,,30608,"",51a4194b.7790,1,"",2013-05-27 22:41:15 EDT,,0,LOG,00000,"connection received: host=172.27.65.204 port=55283",,,,,,,,,""

This means we currently do NOT have a way to make backups that we trust.  We are very open to any suggestions of any alternative methods we should consider using.  The database is of non trivial size by now:

proddb=> select pg_size_pretty(pg_database_size('proddb'));
 pg_size_pretty
----------------
 1294 GB
(1 row)

The backup script itself is by now a rather long OCaml program, so I doubt the value in posting it to this list.  But here is the log of what it did which should be pretty explanatory:

proddb backup: starting
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 ls -d /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb'.*-*-*.*-*-*'" (enqueued)
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 ls -d /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb'.*-*-*.*-*-*'" (running as pid: [23422])
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 ls -d /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb'.*-*-*.*-*-*'" ([23422] exited normally)
proddb backup: /bin/mkdir -p /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (enqueued)
proddb backup: /bin/mkdir -p /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (running as pid: [23433])
proddb backup: /bin/mkdir -p /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress ([23433] exited normally)
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 /mnt/global/base/bin/db tools backup backup -v -src '"((dbname proddb) (hostname tot-dbc-001) (superuser postgres_prod) (basedir /database) (version 9.2))"' -dst '"((username postgres) (hostname 127.0.0.1) (backup_dir /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress) (last_backup_dir (..//proddb.2013-05-26.20-00-00)
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 /mnt/global/base/bin/db tools backup backup -v -src '"((dbname proddb) (hostname tot-dbc-001) (superuser postgres_prod) (basedir /database) (version 9.2))"' -dst '"((username postgres) (hostname 127.0.0.1) (backup_dir /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress) (last_backup_dir (..//proddb.2013-05-26.20-00-00)
proddb backup [23437]: 2013-05-27 20:00:02.232864-04:00 Debug backing up tot-dbc-001:proddb to ((username postgres) (hostname 127.0.0.1)
proddb backup [23437]:  (backup_dir /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress)
proddb backup [23437]:  (last_backup_dir (..//proddb.2013-05-26.20-00-00)))
proddb backup [23437]: 2013-05-27 20:00:02.232869-04:00 Debug running /usr/pgsql-9.2/bin/psql -U postgres_prod -h tot-dbc-001 -d postgres -c checkpoint
proddb backup [23437]: 2013-05-27 20:00:02.459035-04:00 Debug Running preliminary rsync outside of snapshot to minimize its lifetime.
proddb backup [23437]: 2013-05-27 20:00:02.459049-04:00 Debug running rsync from /database/ to postgres@127.0.0.1:/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 20:00:02.459053-04:00 Debug running nice -n 19 rsync --link-dest ..//proddb.2013-05-26.20-00-00 -a -q --delete /database/ postgres@127.0.0.1:/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 21:43:05.742036-04:00 Debug Taking snapshot of database
proddb backup [23437]: 2013-05-27 21:43:05.742056-04:00 Debug finding LVM device for /database
proddb backup [23437]: 2013-05-27 21:43:05.796354-04:00 Debug real path of /database is /mnt/dbc01-d1/proddb
proddb backup [23437]: 2013-05-27 21:43:05.797222-04:00 Debug /database is located on LVM device /dev/dbc01/d1 at relative path /proddb
proddb backup [23437]: 2013-05-27 21:43:05.797237-04:00 Debug running lvcreate --size 500G --snapshot --name proddb-snapshot /dev/dbc01/d1
proddb backup [23437]: 2013-05-27 21:43:06.487274-04:00 Debug running mount -o nouuid,ro /dev/dbc01/proddb-snapshot /tmp/snapfchjHv
proddb backup [23437]: 2013-05-27 21:43:06.682700-04:00 Debug running rsync from /tmp/snapfchjHv/proddb/ to postgres@127.0.0.1:/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 21:43:06.682710-04:00 Debug running nice -n 19 rsync --inplace --link-dest ..//proddb.2013-05-26.20-00-00 -a -q --delete /tmp/snapfchjHv/proddb/ postgres@127.0.0.1:/net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress//
proddb backup [23437]: 2013-05-27 22:38:45.817805-04:00 Debug Deleting snapshot of database
proddb backup [23437]: 2013-05-27 22:38:45.817824-04:00 Debug running umount /tmp/snapfchjHv
proddb backup [23437]: 2013-05-27 22:38:47.096200-04:00 Debug removing mount point /tmp/snapfchjHv
proddb backup [23437]: 2013-05-27 22:38:47.096732-04:00 Debug running lvremove -f /dev/dbc01/proddb-snapshot
proddb backup [23437]: 2013-05-27 22:38:47.532788-04:00 Debug running lvdisplay /dev/dbc01/proddb-snapshot
proddb backup: /bin/bash -c "/usr/bin/ssh -l root tot-dbc-001 /mnt/global/base/bin/db tools backup backup -v -src '"((dbname proddb) (hostname tot-dbc-001) (superuser postgres_prod) (basedir /database) (version 9.2))"' -dst '"((username postgres) (hostname 127.0.0.1) (backup_dir /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress) (last_backup_dir (..//proddb.2013-05-26.20-00-00)
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 cd /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (enqueued)
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 cd /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress (running as pid: [14318])
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 cd /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress ([14318] exited normally)
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 mv /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.2013-05-27.20-00-00 (enqueued)
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 mv /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.2013-05-27.20-00-00 (running as pid: [14319])
proddb backup: /usr/bin/ssh postgres@tot-dbc-001 mv /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.in-progress /net/nyc-isilon1/ifs/data/backup-dbc//proddb/proddb.2013-05-27.20-00-00 ([14319] exited normally)
proddb backup: finished; status = ok

Thanks in advance,

Bene




On Thu, May 23, 2013 at 5:30 PM, David Powers <dpowers@janestreet.com> wrote:
Thanks for the response.

I have some evidence against an issue in the backup procedure (though I'm not ruling it out).  We moved back to taking the backup off of the primary and all errors for all three clusters went away.  All of the hardware is the same, OS and postgres versions are largely the same (9.2.3 vs. 9.2.4 in some cases, various patch levels of Cent 6.3 for the OS).  The backup code is exactly the same, just pointed at a different set of boxes.

Currently I'm just running for a couple of days to ensure that we have viable static backups.  After that I'll redo one of the restores from a suspected backup and will post the logs.

-David


On Thu, May 23, 2013 at 11:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, May 21, 2013 at 11:59 AM, Benedikt Grundmann
<bgrundmann@janestreet.com> wrote:
> We are seeing these errors on a regular basis on the testing box now.  We
> have even changed the backup script to
> shutdown the hot standby, take lvm snapshot, restart the hot standby, rsync
> the lvm snapshot.  It still happens.
>
> We have never seen this before we introduced the hot standby.  So we will
> now revert to taking the backups from lvm snapshots on the production
> database.  If you have ideas of what else we should try / what information
> we can give you to debug this let us know and we will try to so.
>
> Until then we will sadly operate on the assumption that the combination of
> hot standby and "frozen snapshot" backup of it is not production ready.

I'm pretty suspicious that your backup procedure is messed up in some
way.  The fact that you got invalid page headers is really difficult
to attribute to a PostgreSQL bug.  A number of the other messages that
you have posted also tend to indicate either corruption, or that WAL
replay has stopped early.  It would be interesting to see the logs
from when the clone was first started up, juxtaposed against the later
WAL flush error messages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: background worker and normal exit
Next
From: Merlin Moncure
Date:
Subject: Re: Planning incompatibilities for Postgres 10.0