On Thu, Aug 30, 2018 at 05:30:30PM -0400, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Wed, Aug 29, 2018 at 11:35:51AM -0400, Tom Lane wrote:
> >> As far as we can tell, that bug is a dozen years old, so it's not clear
> >> why you find that you can reproduce it only in 10.5. But there might be
> >> some subtle timing change accounting for that.
>
> > It seems to me there's one root problem occurring in (at least) two slightly
> > different ways. The issue/symptom that I've been seeing occurs in 10.5 but not
> > 10.4, and specifically at commit 2ce64ca, but not before.
>
> Yeah, as you probably saw in the other thread, we later realized that
> 2ce64ca created an additional pathway for ScanPgRelation to recurse;
> a pathway that's evidently easier to hit than the pre-existing ones.
> I note that both of your stack traces display ScanPgRelation recursion,
> so I'm feeling pretty confident that what you're seeing is the same
> thing.
>
> But, as Andres says, it'd be great if you could confirm whether the
> draft patches fix it for you.
I tested with relcache-rebuild.diff which hasn't broken in 15min, so I'm
confident that doesn't hit the additional recusive pathway, but have to wait
awhile and see if autovacuum survives, too.
I tried to apply fix-missed-inval-msg-accepts-1.patch on top of PG10.5 but
patch didn't apply, so I can test HEAD after the first patch soaks awhile.
Just curious, is there really any difficulty in reproducing this? Once I
realized this was a continuing issue and started to suspect pg10.5, it takes
just about nothing to reproduce anywhere I've tried. I just tested 5 servers,
and only one took more than a handful of seconds to fail. I gave up waiting
for a 6th server, because I found it was waiting on a pre-existing lock.
[pryzbyj@database ~]$ while :; do for a in pg_class_oid_index pg_class_relname_nsp_index
pg_class_tblspc_relfilenode_index;do psql ts -qc "REINDEX INDEX $a"; done; done&
[pryzbyj@database ~]$ a=0; time while psql ts -qc ''; do a=$((1+a)); done ; echo "$a"
psql: FATAL: could not read block 0 in file "base/16400/313581263": read only 0 of 8192 bytes
real 0m1.772s
user 0m0.076s
sys 0m0.116s
47
Justin