Home > mailing lists

Re: [HACKERS] Broken hint bits (freeze) - Mailing list pgsql-hackers

From	Dmitriy Sarafannikov
Subject	Re: [HACKERS] Broken hint bits (freeze)
Date	June 2, 2017 13:50:49
Msg-id	4C1127FC-74AE-4544-9B22-6A32FD0C0607@yandex.ru Whole thread Raw
In response to	Re: [HACKERS] Broken hint bits (freeze) (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: [HACKERS] Broken hint bits (freeze) Re: [HACKERS] Broken hint bits (freeze)
List	pgsql-hackers

Tree view

Thanks for all.

We found the source of the problem. It was mistake in upgrade to 9.6.

We upgrade replica with rsync as it is in the documentation:
rsync --verbose --relative --archive --hard-links --size-only old_pgdata new_pgdata remote_dir

We must provide 100% read-only availability of our shard (master + 2 replicas).
So we can’t stop master and both replicas, upgrade them one by one and start them.
We do it as follows:
Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, start 1st replica,
stop 2nd replica, upgrade it, start 2nd replica, start master, open master.
But upgraded replicas died under load without statistics and we decided to perform
analyze on master before upgrading replicas. In this case statistics would be copied to replicas by rsync.
The upgrade algorithm became as follows:
Close master, stop master, close master from replicas (iptables), upgrade master,
start master, perform analyze, stop master, stop 1st replica, upgrade 1st replica,
start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica,
start master, open master.

If autovacuum starts vacuuming relations while we are performing analyze, wal records
generated by it will not be replayed on replicas, because next step is stopping
master with checkpoint and new redo location LSN (newer that these wal records)
will appear in pg_control file, which then will be copied by rsync to replicas.

If it was simple vacuum, we most likely will not see the consequences. Because it marks
tuples as deleted, and some of the next new tuples will be placed here, and due to FPW
replicas will receive correct page, identical to master.
But if it was vacuum to prevent wraparound, we will see situation like ours. Tuples on
master will be frozen, but on replicas not. And it will not change if nobody will not
update any tuple on this page.

It’s dangerous, because, if we perform switchover to replica, «corrupted» page
will be delivered to all replicas after next update of any tuple from this page.

We reproduced this case in our test environment and this assumption was confirmed.

Starting and stopping master after running pg_upgrade but before rsync to collect statistics
was a bad idea.

We know how to find such «corrupted» tuples. And we want to fix this by manually
freezing tuples via calling specially written C functions. Functions are «copy-pasted»
and simplified code from vacuum functions with SQL interface (see attachment).
Can you look on them? Do you think it is safe to use them for fixing corrupted pages
or is there a better way not to loose data?

Regards,
Dmitriy Sarafannikov

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

freeze_tuple.c

pgsql-hackers by date:

From: Craig Ringer
Date: 02 June 2017, 13:50:15
Subject: Re: [HACKERS] Do we need the gcc feature "__builtin_expect" topromote the branches prediction?

From: Amit Khandekar
Date: 02 June 2017, 14:07:14
Subject: Re: [HACKERS] UPDATE of partition key

Re: [HACKERS] Broken hint bits (freeze) - Mailing list pgsql-hackers

Attachment

Previous

Next