Re: [HACKERS] Broken hint bits (freeze) - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: [HACKERS] Broken hint bits (freeze) |
Date | |
Msg-id | CAA4eK1L4WKBs5LQ52hZ_whrR=A-cQwsXvGwbaz7k6O3NYUjhKQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Broken hint bits (freeze) (Dmitriy Sarafannikov <dsarafannikov@yandex.ru>) |
Responses |
Re: [HACKERS] Broken hint bits (freeze)
|
List | pgsql-hackers |
On Fri, Jun 2, 2017 at 4:20 PM, Dmitriy Sarafannikov <dsarafannikov@yandex.ru> wrote: > Thanks for all. > > We found the source of the problem. It was mistake in upgrade to 9.6. > > We upgrade replica with rsync as it is in the documentation: > rsync --verbose --relative --archive --hard-links --size-only old_pgdata new_pgdata remote_dir > > We must provide 100% read-only availability of our shard (master + 2 replicas). > So we can’t stop master and both replicas, upgrade them one by one and start them. > We do it as follows: > Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, start 1st replica, > stop 2nd replica, upgrade it, start 2nd replica, start master, open master. > But upgraded replicas died under load without statistics and we decided to perform > analyze on master before upgrading replicas. In this case statistics would be copied to replicas by rsync. > The upgrade algorithm became as follows: > Close master, stop master, close master from replicas (iptables), upgrade master, > start master, perform analyze, stop master, stop 1st replica, upgrade 1st replica, > start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica, > start master, open master. > > If autovacuum starts vacuuming relations while we are performing analyze, wal records > generated by it will not be replayed on replicas, because next step is stopping > master with checkpoint and new redo location LSN (newer that these wal records) > will appear in pg_control file, which then will be copied by rsync to replicas. > > If it was simple vacuum, we most likely will not see the consequences. Because it marks > tuples as deleted, and some of the next new tuples will be placed here, and due to FPW > replicas will receive correct page, identical to master. > But if it was vacuum to prevent wraparound, we will see situation like ours. Tuples on > master will be frozen, but on replicas not. And it will not change if nobody will not > update any tuple on this page. > Why didn't rsync made the copies on master and replica same? > It’s dangerous, because, if we perform switchover to replica, «corrupted» page > will be delivered to all replicas after next update of any tuple from this page. > > We reproduced this case in our test environment and this assumption was confirmed. > > Starting and stopping master after running pg_upgrade but before rsync to collect statistics > was a bad idea. > > We know how to find such «corrupted» tuples. And we want to fix this by manually > freezing tuples via calling specially written C functions. Functions are «copy-pasted» > and simplified code from vacuum functions with SQL interface (see attachment). > Can you look on them? Do you think it is safe to use them for fixing corrupted pages > or is there a better way not to loose data? > I haven't looked in detail, but it sounds slightly risky proposition to manipulate the tuples by writing C functions of the form you have in your code. I would have preferred some way to avoid this problem by ensuring that replicas are properly synced (complete data of master via WAL) or by disabling autovacuum. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: