Re: [HACKERS] Broken hint bits (freeze) - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: [HACKERS] Broken hint bits (freeze)
Date
Msg-id CAA4eK1L4WKBs5LQ52hZ_whrR=A-cQwsXvGwbaz7k6O3NYUjhKQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Broken hint bits (freeze)  (Dmitriy Sarafannikov <dsarafannikov@yandex.ru>)
Responses Re: [HACKERS] Broken hint bits (freeze)
List pgsql-hackers
On Fri, Jun 2, 2017 at 4:20 PM, Dmitriy Sarafannikov
<dsarafannikov@yandex.ru> wrote:
> Thanks for all.
>
> We found the source of the problem. It was mistake in upgrade to 9.6.
>
> We upgrade replica with rsync as it is in the documentation:
> rsync --verbose --relative --archive --hard-links --size-only old_pgdata new_pgdata remote_dir
>
> We must provide 100% read-only availability of our shard (master + 2 replicas).
> So we can’t stop master and both replicas, upgrade them one by one and start them.
> We do it as follows:
> Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, start 1st replica,
> stop 2nd replica, upgrade it, start 2nd replica, start master, open master.
> But upgraded replicas died under load without statistics and we decided to perform
> analyze on master before upgrading replicas. In this case statistics would be copied to replicas by rsync.
> The upgrade algorithm became as follows:
> Close master, stop master, close master from replicas (iptables), upgrade master,
> start master, perform analyze, stop master, stop 1st replica, upgrade 1st replica,
> start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica,
> start master, open master.
>
> If autovacuum starts vacuuming relations while we are performing analyze, wal records
> generated by it will not be replayed on replicas, because next step is stopping
> master with checkpoint and new redo location LSN (newer that these wal records)
> will appear in pg_control file, which then will be copied by rsync to replicas.
>
> If it was simple vacuum, we most likely will not see the consequences. Because it marks
> tuples as deleted, and some of the next new tuples will be placed here, and due to FPW
> replicas will receive correct page, identical to master.
> But if it was vacuum to prevent wraparound, we will see situation like ours. Tuples on
> master will be frozen, but on replicas not. And it will not change if nobody will not
> update any tuple on this page.
>

Why didn't rsync made the copies on master and replica same?

> It’s dangerous, because, if we perform switchover to replica, «corrupted» page
> will be delivered to all replicas after next update of any tuple from this page.
>
> We reproduced this case in our test environment and this assumption was confirmed.
>
> Starting and stopping master after running pg_upgrade but before rsync to collect statistics
> was a bad idea.
>
> We know how to find such «corrupted» tuples. And we want to fix this by manually
> freezing tuples via calling specially written C functions. Functions are «copy-pasted»
> and simplified code from vacuum functions with SQL interface (see attachment).
> Can you look on them? Do you think it is safe to use them for fixing corrupted pages
> or is there a better way not to loose data?
>

I haven't looked in detail, but it sounds slightly risky proposition
to manipulate the tuples by writing C functions of the form you have
in your code.  I would have preferred some way to avoid this problem
by ensuring that replicas are properly synced (complete data of master
via WAL) or by disabling autovacuum.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: [HACKERS] Parallel Aggregation support for aggregate functionsthat use transitions not implemented for array_agg
Next
From: Dmitriy Sarafannikov
Date:
Subject: Re: [HACKERS] Broken hint bits (freeze)