Hi Jeremy,
On 08/28/2018 10:46 PM, Jeremy Finzel wrote:
> We have hit this error again, and we plan to snapshot the database
> as to be able to do whatever troubleshooting we can.
>
>
> I am happy to report that we were able to get replication working again
> by running snapshots of the systems in question on servers running the
> latest point release 9.6.10, and replication simply works and skips over
> these previously erroring relfilenodes. So whatever fixes were made in
> this point release to logical decoding seems to have fixed the issue.
>
Interesting.
So you were running 9.6.9 before, it triggered the issue (and was not
able to recover). You took a filesystem snapshot, started a 9.6.10 on
the snapshot, and it recovered without hitting the issue?
I quickly went through the commits in 9.6 branch between 9.6.9 and
9.6.10, looking for stuff that might be related, and these three commits
seem possibly related (usually because of invalidations, vacuum, ...):
6a46aba1cd6dd7c5af5d52111a8157808cbc5e10
Fix bugs in vacuum of shared rels, by keeping their relcache entries
current.
da10d6a8a94eec016fa072d007bced9159a28d39
Fix "base" snapshot handling in logical decoding
0a60a291c9a5b8ecdf44cbbfecc4504e3c21ef49
Add table relcache invalidation to index builds.
But it's hard to say if/which of those commits did the trick, without
more information.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services