Thread: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes

We've seen several occurrences of $subject in the buildfarm in the past
month or so.  Scraping the logs, I find

 coypu    | 2018-06-14 21:17:49 | HEAD          | Check | 2018-06-14 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes 
 coypu    | 2018-06-14 21:17:49 | HEAD          | Check | 2018-06-14 23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes 
 gull     | 2018-06-21 04:02:11 | HEAD          | Check | 2018-06-20 21:27:06.843 PDT [31138:3] FATAL:  could not read
block3 in file "base/16384/2662": read only 0 of 8192 bytes 
 mandrill | 2018-06-22 16:14:16 | HEAD          | Check | 2018-06-22 16:46:24.138 UTC [14353240:43]
pg_regress/create_table_likeERROR:  could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes 
 mandrill | 2018-06-22 16:14:16 | HEAD          | Check | 2018-06-22 16:46:24.137 UTC [20710034:1] ERROR:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes 
 lapwing  | 2018-07-05 07:20:02 | HEAD          | Check | 2018-07-05 07:21:45.585 UTC [24022:1] ERROR:  could not read
block3 in file "base/16384/2662": read only 0 of 8192 bytes 
 lapwing  | 2018-07-05 07:20:02 | HEAD          | Check | 2018-07-05 07:21:45.591 UTC [23941:39]
pg_regress/roleattributesERROR:  could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes 
 mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [26072:1] ERROR:  could not read
block3 in file "base/16384/2662": read only 0 of 8192 bytes 
 mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [25990:82] ERROR:  could not read
block3 in file "base/16384/2662": read only 0 of 8192 bytes 
 coypu    | 2018-07-17 01:27:39 | HEAD          | Check | 2018-07-17 03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR:  could
notread block 3 in file "base/16384/2662": read only 0 of 8192 bytes 
 coypu    | 2018-07-20 11:18:02 | HEAD          | Check | 2018-07-20 13:32:14.104 CEST [5b51c833.4884:131] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes 

While I didn't go back indefinitely far, there are no other occurrences of
"could not read block" failures in the last three months.  This suggests
strongly that we broke something in early June.  Don't know what though.
Ideas?

In case anyone's wondering, 2662 is pg_class_oid_index.

            regards, tom lane


On 2018-07-20 16:15:14 -0400, Tom Lane wrote:
> We've seen several occurrences of $subject in the buildfarm in the past
> month or so.  Scraping the logs, I find
> 
>  coypu    | 2018-06-14 21:17:49 | HEAD          | Check | 2018-06-14 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  coypu    | 2018-06-14 21:17:49 | HEAD          | Check | 2018-06-14 23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  gull     | 2018-06-21 04:02:11 | HEAD          | Check | 2018-06-20 21:27:06.843 PDT [31138:3] FATAL:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  mandrill | 2018-06-22 16:14:16 | HEAD          | Check | 2018-06-22 16:46:24.138 UTC [14353240:43]
pg_regress/create_table_likeERROR:  could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  mandrill | 2018-06-22 16:14:16 | HEAD          | Check | 2018-06-22 16:46:24.137 UTC [20710034:1] ERROR:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  lapwing  | 2018-07-05 07:20:02 | HEAD          | Check | 2018-07-05 07:21:45.585 UTC [24022:1] ERROR:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  lapwing  | 2018-07-05 07:20:02 | HEAD          | Check | 2018-07-05 07:21:45.591 UTC [23941:39]
pg_regress/roleattributesERROR:  could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [26072:1] ERROR:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [25990:82] ERROR:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  coypu    | 2018-07-17 01:27:39 | HEAD          | Check | 2018-07-17 03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
>  coypu    | 2018-07-20 11:18:02 | HEAD          | Check | 2018-07-20 13:32:14.104 CEST [5b51c833.4884:131] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> 
> While I didn't go back indefinitely far, there are no other occurrences of
> "could not read block" failures in the last three months.  This suggests
> strongly that we broke something in early June.  Don't know what though.
> Ideas?
> 
> In case anyone's wondering, 2662 is pg_class_oid_index.

While I do not immediately see how, and the affected versions don't
really point that way, the timing does make me think of
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a54e1f1587793b5bf926630ec9ce142e4578d7a0

HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
but I don't immediatley know where.

Greetings,

Andres Freund


Hi,

On 2018-07-20 13:24:50 -0700, Andres Freund wrote:
> On 2018-07-20 16:15:14 -0400, Tom Lane wrote:
> > We've seen several occurrences of $subject in the buildfarm in the past
> > month or so.  Scraping the logs, I find
> > 
> >  coypu    | 2018-06-14 21:17:49 | HEAD          | Check | 2018-06-14 23:31:44.505 CEST [5b22deb8.30e1:124] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  coypu    | 2018-06-14 21:17:49 | HEAD          | Check | 2018-06-14 23:31:44.653 CEST [5b22deb8.6d4d:132] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  gull     | 2018-06-21 04:02:11 | HEAD          | Check | 2018-06-20 21:27:06.843 PDT [31138:3] FATAL:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  mandrill | 2018-06-22 16:14:16 | HEAD          | Check | 2018-06-22 16:46:24.138 UTC [14353240:43]
pg_regress/create_table_likeERROR:  could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  mandrill | 2018-06-22 16:14:16 | HEAD          | Check | 2018-06-22 16:46:24.137 UTC [20710034:1] ERROR:  could
notread block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  lapwing  | 2018-07-05 07:20:02 | HEAD          | Check | 2018-07-05 07:21:45.585 UTC [24022:1] ERROR:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  lapwing  | 2018-07-05 07:20:02 | HEAD          | Check | 2018-07-05 07:21:45.591 UTC [23941:39]
pg_regress/roleattributesERROR:  could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [26072:1] ERROR:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  mantid   | 2018-07-17 01:20:04 | REL_11_STABLE | Check | 2018-07-16 21:21:32.557 EDT [25990:82] ERROR:  could not
readblock 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  coypu    | 2018-07-17 01:27:39 | HEAD          | Check | 2018-07-17 03:47:26.264 CEST [5b4d4aa4.3ac4:71] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> >  coypu    | 2018-07-20 11:18:02 | HEAD          | Check | 2018-07-20 13:32:14.104 CEST [5b51c833.4884:131] ERROR:
couldnot read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
 
> > 
> > While I didn't go back indefinitely far, there are no other occurrences of
> > "could not read block" failures in the last three months.  This suggests
> > strongly that we broke something in early June.  Don't know what though.
> > Ideas?
> > 
> > In case anyone's wondering, 2662 is pg_class_oid_index.
> 
> While I do not immediately see how, and the affected versions don't
> really point that way, the timing does make me think of
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a54e1f1587793b5bf926630ec9ce142e4578d7a0
> 
> HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
> but I don't immediatley know where.

Hm, there was:
http://archives.postgresql.org/message-id/20180628150209.n2qch5jtn3vt2xaa%40alap3.anarazel.de

I don't immediately see it being responsible, but I wonder if there's a
chance it actually is: Note that it happens in a parallel group that
includes vacuum.sql, which does a VACUUM FULL pg_class - but I still
don't immediately see how it could apply.

But either way, we need to fix the issue mentioned in the above email.

Greetings,

Andres Freund


On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund <andres@anarazel.de> wrote:
>> HEAD/REL_11_STABLE apparently solely being affected points elsewhere,
>> but I don't immediatley know where.
>
> Hm, there was:
> http://archives.postgresql.org/message-id/20180628150209.n2qch5jtn3vt2xaa%40alap3.anarazel.de
>
>
> I don't immediately see it being responsible, but I wonder if there's a
> chance it actually is: Note that it happens in a parallel group that
> includes vacuum.sql, which does a VACUUM FULL pg_class - but I still
> don't immediately see how it could apply.

It's now pretty clear that it was not that particular bug, since I
pushed a fix, and yet the issue hasn't gone away on affected buildfarm
animals. There was a recurrence of the problem on lapwing, for example
[1].

Anyway, "VACUUM FULL pg_class" should be expected to corrupt
pg_class_oid_index when we happen to get a parallel build, since
pg_class is a mapped relation, and I've identified that as a problem
for parallel CREATE INDEX [2]. If that was the ultimate cause of the
issue, it would explain why only REL_11_STABLE and master are
involved.

My guess is that the metapage considers the root page to be at block 3
(block 3 is often the root page for small though not tiny B-Trees),
which for whatever reason is where we get a short read. I don't know
why there is a short read, but corrupting mapped catalog indexes at
random can be expected to cause all kinds of chaos, so that doesn't
mean much.

In any case, I'll probably push a fix for this other bug on Friday,
barring any objections. It's possible that that will make the problem
go away.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2018-08-04%2004%3A20%3A01
[2] https://www.postgresql.org/message-id/CAH2-Wzn=j0i8rxCAo6E=tBO9XuYXb8HbUsnW7J_StKON8dDOhQ@mail.gmail.com
-- 
Peter Geoghegan


Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Jul 25, 2018 at 4:07 PM, Andres Freund <andres@anarazel.de> wrote:
>> I don't immediately see it being responsible, but I wonder if there's a
>> chance it actually is: Note that it happens in a parallel group that
>> includes vacuum.sql, which does a VACUUM FULL pg_class - but I still
>> don't immediately see how it could apply.

> Anyway, "VACUUM FULL pg_class" should be expected to corrupt
> pg_class_oid_index when we happen to get a parallel build, since
> pg_class is a mapped relation, and I've identified that as a problem
> for parallel CREATE INDEX [2]. If that was the ultimate cause of the
> issue, it would explain why only REL_11_STABLE and master are
> involved.

Oooh ... but pg_class wouldn't be big enough to get a parallel
index rebuild during that test, would it?

            regards, tom lane


On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Anyway, "VACUUM FULL pg_class" should be expected to corrupt
>> pg_class_oid_index when we happen to get a parallel build, since
>> pg_class is a mapped relation, and I've identified that as a problem
>> for parallel CREATE INDEX [2]. If that was the ultimate cause of the
>> issue, it would explain why only REL_11_STABLE and master are
>> involved.
>
> Oooh ... but pg_class wouldn't be big enough to get a parallel
> index rebuild during that test, would it?

Typically not, but I don't think that we can rule it out right away.
"min_parallel_table_scan_size = 0" will make it happen, and that
happens in quite a few regression tests. Though that doesn't include
vacuum.sql.

The animals mandrill, mantid and lapwing are all
"force_parallel_mode=regress", according to their notes. That actually
isn't enough to force parallel CREATE INDEX, though I tend to think it
should be. I don't see anything interesting in their "extra_config",
but perhaps "min_parallel_table_scan_size = 0" got in some other way.
I don't know all that much about the buildfarm client code, and it's
late.

-- 
Peter Geoghegan


Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Oooh ... but pg_class wouldn't be big enough to get a parallel
>> index rebuild during that test, would it?

> Typically not, but I don't think that we can rule it out right away.

Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
a parallel build happened or not, so that we can check the buildfarm logs
next time we see that failure?

> I don't know all that much about the buildfarm client code, and it's
> late.

It doesn't really stick in any undocumented configuration changes,
AFAIK.  Possibly Andrew would have some more insight.

            regards, tom lane


On Wed, Aug 8, 2018 at 10:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Typically not, but I don't think that we can rule it out right away.
>
> Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
> a parallel build happened or not, so that we can check the buildfarm logs
> next time we see that failure?

I can do that tomorrow. Of course, it might be easier to just push the
pending fix for the mapped relation bug, and see if that makes the
issue go away. That would take longer, but it's also the only thing
that's likely to be definitive anyway.

If you want to go ahead with it yourself, I suggest modifying the
DEBUG1 elog()s within index_build() to be LOG elevel.

-- 
Peter Geoghegan



On 08/09/2018 01:03 AM, Tom Lane wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Oooh ... but pg_class wouldn't be big enough to get a parallel
>>> index rebuild during that test, would it?
>> Typically not, but I don't think that we can rule it out right away.
> Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
> a parallel build happened or not, so that we can check the buildfarm logs
> next time we see that failure?
>
>> I don't know all that much about the buildfarm client code, and it's
>> late.
> It doesn't really stick in any undocumented configuration changes,
> AFAIK.  Possibly Andrew would have some more insight.



No, everything should be visible in the config. Hidden things are what I 
try to avoid.

The only things the configure()  adds are the prefix and pgport 
settings, and the cache-file. make() only adds a "-j jobs" if so 
configured. make_check() normally just runs "make NO_LOCALE=1 check".



cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Wed, Aug 8, 2018 at 10:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Hmmm ... maybe we should temporarily stick in an elog(LOG) showing whether
>> a parallel build happened or not, so that we can check the buildfarm logs
>> next time we see that failure?
>
> I can do that tomorrow. Of course, it might be easier to just push the
> pending fix for the mapped relation bug, and see if that makes the
> issue go away. That would take longer, but it's also the only thing
> that's likely to be definitive anyway.

I'm going to push my fix for the relmapper.c parallel CREATE INDEX bug
tomorrow, and see what happens. I have a hard time imagining how there
could be a parallel index build on pg_class, now that Andrew has
indicated there is no reason to think that
"min_parallel_table_scan_size = 0" could slip in.

-- 
Peter Geoghegan


Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Oooh ... but pg_class wouldn't be big enough to get a parallel
>> index rebuild during that test, would it?

> Typically not, but I don't think that we can rule it out right away.

Didn't take long to show that the relmapper issue wasn't it:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2018-08-10%2021%3A21%3A40

So we're back to square one.  Although Tomas' recent report might
give us something new to chase.

            regards, tom lane


On Fri, Aug 10, 2018 at 7:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Didn't take long to show that the relmapper issue wasn't it:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=coypu&dt=2018-08-10%2021%3A21%3A40
>
> So we're back to square one.  Although Tomas' recent report might
> give us something new to chase.

Thanks for pointing out that Tomas had a lead - I missed that.

I'm concerned that this item has the potential to delay the release,
since, as you said, we're back to the drawing board.

-- 
Peter Geoghegan


Peter Geoghegan <pg@bowt.ie> writes:
> I'm concerned that this item has the potential to delay the release,
> since, as you said, we're back to the drawing board.

Me too.  I will absolutely not vote to release 11.0 before we've
solved this ...

            regards, tom lane


On Fri, Aug 10, 2018 at 8:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Me too.  I will absolutely not vote to release 11.0 before we've
> solved this ...

I believe that that's the right call, assuming things don't change.

This is spooky in a way that creates a lot of doubts in my mind. I
don't think it's at all advisable to make an executive decision to
release, while still not having the foggiest idea what's really going
on.

-- 
Peter Geoghegan


On 08/11/2018 05:02 AM, Tom Lane wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> I'm concerned that this item has the potential to delay the release,
>> since, as you said, we're back to the drawing board.
> 
> Me too.  I will absolutely not vote to release 11.0 before we've
> solved this ...
> 

Not sure. I can easily reproduce that on REL_10_STABLE, so it's not a
new issue specific to 11.0.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On 08/11/2018 03:16 PM, Tomas Vondra wrote:
> On 08/11/2018 05:02 AM, Tom Lane wrote:
>> Peter Geoghegan <pg@bowt.ie> writes:
>>> I'm concerned that this item has the potential to delay the release,
>>> since, as you said, we're back to the drawing board.
>>
>> Me too.  I will absolutely not vote to release 11.0 before we've
>> solved this ...
>>
> 
> Not sure. I can easily reproduce that on REL_10_STABLE, so it's not a
> new issue specific to 11.0.
> 

For the record, I can actually reproduce this on 9.6 (haven't tried
older releases, but I suspect it's there too). Instead of using the
failing subscription, I've used another pgbench script doing this:

  SET statement_timeout = 5;
  COPY t TO '/dev/null';

and doing something like:

   pgbench -n -c 20 -T 300 -f copy.sql test

I don't think the 20-client COPY concurrency is necessary, it just makes
restarts after the statement_timeout faster.

This produces about the same backtrace as reported on the other thread.
Attaches is both plain 'bt' and 'bt full'.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
Hi,

On 2018-08-11 15:40:19 +0200, Tomas Vondra wrote:
> For the record, I can actually reproduce this on 9.6 (haven't tried
> older releases, but I suspect it's there too). Instead of using the
> failing subscription, I've used another pgbench script doing this:

>   SET statement_timeout = 5;
>   COPY t TO '/dev/null';
> 
> and doing something like:
> 
>    pgbench -n -c 20 -T 300 -f copy.sql test

Just to confirm: That's with the vacuum full and insert running
concurrently? And then just restarting the above copy.sql (as pgbench
errors out after the timeouts) until you get the error?

I'm a bit confused what the copy + timeout is doing here? It shouldn't
trigger any invalidations itself, and the backtrace appears to be from
the insert.sql you posted earlier? Unclear why a copy to /dev/null
should trigger anything like this?

Greetings,

Andres Freund


On 08/11/2018 04:08 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-08-11 15:40:19 +0200, Tomas Vondra wrote:
>> For the record, I can actually reproduce this on 9.6 (haven't tried
>> older releases, but I suspect it's there too). Instead of using the
>> failing subscription, I've used another pgbench script doing this:
> 
>>   SET statement_timeout = 5;
>>   COPY t TO '/dev/null';
>>
>> and doing something like:
>>
>>    pgbench -n -c 20 -T 300 -f copy.sql test
> 
> Just to confirm: That's with the vacuum full and insert running
> concurrently? And then just restarting the above copy.sql (as pgbench
> errors out after the timeouts) until you get the error?
> 

Yes, pretty much.

> I'm a bit confused what the copy + timeout is doing here? It shouldn't
> trigger any invalidations itself, and the backtrace appears to be from
> the insert.sql you posted earlier? Unclear why a copy to /dev/null
> should trigger anything like this?
> 

My goal was to simulate the failing subscription sync, which does COPY
and fails because of duplicate data. The statement_timeout seemed like a
good approximation of that. It may be unnecessary, I don't know.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


On 8/9/18, 12:56 AM, "Peter Geoghegan" <pg@bowt.ie> wrote:

    On Wed, Aug 8, 2018 at 7:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
    >> Anyway, "VACUUM FULL pg_class" should be expected to corrupt
    >> pg_class_oid_index when we happen to get a parallel build, since
    >> pg_class is a mapped relation, and I've identified that as a problem
    >> for parallel CREATE INDEX [2]. If that was the ultimate cause of the
    >> issue, it would explain why only REL_11_STABLE and master are
    >> involved.
    >
    > Oooh ... but pg_class wouldn't be big enough to get a parallel
    > index rebuild during that test, would it?
    
    Typically not, but I don't think that we can rule it out right away.
    "min_parallel_table_scan_size = 0" will make it happen, and that
    happens in quite a few regression tests. Though that doesn't include
    vacuum.sql.
    
    The animals mandrill, mantid and lapwing are all
    "force_parallel_mode=regress", according to their notes. That actually
    isn't enough to force parallel CREATE INDEX, though I tend to think it
    should be. I don't see anything interesting in their "extra_config",
    but perhaps "min_parallel_table_scan_size = 0" got in some other way.
    I don't know all that much about the buildfarm client code, and it's
    late.
    
Sorry, I just noticed this.  Mantid is my animal, so I can set "min_parallel_table_scan_size = 0"
on it if that would be helpful.  (Please reply directly if so; I'm not able to keep up with pgsql-hackers
right now.)

-- todd



On Tue, Aug 14, 2018 at 2:07 PM, Todd A. Cook <ToddA.Cook@synopsys.com> wrote:
> Sorry, I just noticed this.  Mantid is my animal, so I can set "min_parallel_table_scan_size = 0"
> on it if that would be helpful.  (Please reply directly if so; I'm not able to keep up with pgsql-hackers
> right now.)

We've already been able to rule that out as a factor here. Thanks all
the same, though.

-- 
Peter Geoghegan


Hi,

Tomas provided me with a machine where the problem was reproducible
(Thanks again!). I first started to make sure a54e1f158 is unrelated -
and indeed, the problem appears independently.

I've a few leads that I'm currently testing out. One observation I think
might be crucial is that the problem, in Tomas' testcase with just
VACUUM FULL of pg_class and INSERTs into another rel, appears to only
happen if autovacuum is enabled. The crashes are always immediately
after autovacuum did an analyze of the relation.

What I see on the apply side, I think, that invalidation records for
contents of pg_class and pg_class itself are intermixed.

I'll continue looking into this.

Greetings,

Andres Freund


Hi Everyone,


Tom, I think this could use your eyes.


On 2018-08-28 00:52:13 -0700, Andres Freund wrote:
> I've a few leads that I'm currently testing out. One observation I think
> might be crucial is that the problem, in Tomas' testcase with just
> VACUUM FULL of pg_class and INSERTs into another rel, appears to only
> happen if autovacuum is enabled. The crashes are always immediately
> after autovacuum did an analyze of the relation.

I've run this with autovacuum disabled for quite a while, without the
problem re-occuring. It does however re-occur with pgbench running an
ANALYZE on the user defined table.


> What I see on the apply side, I think, that invalidation records for
> contents of pg_class and pg_class itself are intermixed.

I've not 100% figured this out yet, and I need to switch to another task
for a bit, but what I think is happening here is that we have a
relatively fundamental issue with how rewrites of catalog tables are
handled.  A slightly instrumented version of postgres shows the
following:

2018-08-29 04:30:17.515 CEST [19551] LOG:  statement: INSERT INTO t(b,c) SELECT i,i FROM generate_series(1,59) s(i);
...
2018-08-29 04:30:17.559 CEST [19554] LOG:  statement: vacuum full pg_class;
2018-08-29 04:30:17.571 CEST [19557] LOG:  swapping mapped 1259/1281520 w 1281526/1281526
2018-08-29 04:30:17.571 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.572 CEST [19557] LOG:  reindexing 2662 to 1281529 on 1259/1281526
2018-08-29 04:30:17.572 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.576 CEST [19557] LOG:  reindexing 2663 to 1281530 on 1259/1281526
2018-08-29 04:30:17.576 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.578 CEST [19557] LOG:  reindexing 3455 to 1281531 on 1259/1281526
2018-08-29 04:30:17.578 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.582 CEST [19557] LOG:  releasing lock on 1259
2018-08-29 04:30:17.582 CEST [19557] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.582 CEST [19557] LOG:  duration: 87.389 ms
2018-08-29 04:30:17.582 CEST [19557] LOG:  statement: vacuum full pg_class;
2018-08-29 04:30:17.583 CEST [19563] LOG:  duration: 630.481 ms

2018-08-29 04:30:17.584 CEST [19563] LOG:  statement: analyze t;
2018-08-29 04:30:17.597 CEST [19555] LOG:  swapping mapped 1259/1281526 w 1281532/1281532
2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.597 CEST [19555] LOG:  reindexing 2662 to 1281535 on 1259/1281532
2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.599 CEST [19555] LOG:  reindexing 2663 to 1281536 on 1259/1281532
2018-08-29 04:30:17.599 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.601 CEST [19555] LOG:  reindexing 3455 to 1281537 on 1259/1281532
2018-08-29 04:30:17.601 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.605 CEST [19555] LOG:  releasing lock on 1259
2018-08-29 04:30:17.605 CEST [19555] STATEMENT:  vacuum full pg_class;
2018-08-29 04:30:17.605 CEST [19551] WARNING:  could not read block 3 in file "base/16384/1281529": read only 0 of 8192
bytes
(it's a warning because the process SIGSTOPs itself afterwards, for
easier debugging)

So what we can see is that a first VACUUM FULL of pg_class builds a new
pg_class_oid_index/2662 index under filefilenode 1281526.  Then a
*second* VACUUM FULL again reindexes 2663, this time to 1281532.

But *after* the VACUUM FULL has finished and released the lock (that's
printed just before the UnlockRelationIdForSession in vacuum_rel()), we
can see that the INSERT from pid 19551 still accesses the OLD index,
from *before* the second VACUUM FULL.  Which at that point doesn't exist
anymore.

I think the reason for that is that an invalidation for the relmapper
changes due to the second VACUUM has been queued, but currently the
invalidation processing hasn't yet gotten to that entry yet. Instead
currently the entry for the user defined table is being rebuilt, which
then tries to access pg_class, but under its "old" relfilenode. Which
then fails.  The fact that inex scans trigger this problem isn't
relevant - the heap entry for pg_class also refers to the old entry, as
I've verified using gdb.


I'm unclear on how can fix this, as long as we rebuild cache entries
during invalidation, rather than during access. For most cache entries
it's not a problem to rebuild them relying on other outdated content,
we'll just rebuild them again later.  But for things like the relation
mapper that's not true, unfortunately.


Greetings,

Andres Freund


On 2018-08-28 19:56:58 -0700, Andres Freund wrote:
> Hi Everyone,
> 
> 
> Tom, I think this could use your eyes.
> 
> 
> On 2018-08-28 00:52:13 -0700, Andres Freund wrote:
> > I've a few leads that I'm currently testing out. One observation I think
> > might be crucial is that the problem, in Tomas' testcase with just
> > VACUUM FULL of pg_class and INSERTs into another rel, appears to only
> > happen if autovacuum is enabled. The crashes are always immediately
> > after autovacuum did an analyze of the relation.
> 
> I've run this with autovacuum disabled for quite a while, without the
> problem re-occuring. It does however re-occur with pgbench running an
> ANALYZE on the user defined table.
> 
> 
> > What I see on the apply side, I think, that invalidation records for
> > contents of pg_class and pg_class itself are intermixed.
> 
> I've not 100% figured this out yet, and I need to switch to another task
> for a bit, but what I think is happening here is that we have a
> relatively fundamental issue with how rewrites of catalog tables are
> handled.  A slightly instrumented version of postgres shows the
> following:
> 
> 2018-08-29 04:30:17.515 CEST [19551] LOG:  statement: INSERT INTO t(b,c) SELECT i,i FROM generate_series(1,59) s(i);
> ...
> 2018-08-29 04:30:17.559 CEST [19554] LOG:  statement: vacuum full pg_class;
> 2018-08-29 04:30:17.571 CEST [19557] LOG:  swapping mapped 1259/1281520 w 1281526/1281526
> 2018-08-29 04:30:17.571 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.572 CEST [19557] LOG:  reindexing 2662 to 1281529 on 1259/1281526
> 2018-08-29 04:30:17.572 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.576 CEST [19557] LOG:  reindexing 2663 to 1281530 on 1259/1281526
> 2018-08-29 04:30:17.576 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.578 CEST [19557] LOG:  reindexing 3455 to 1281531 on 1259/1281526
> 2018-08-29 04:30:17.578 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.582 CEST [19557] LOG:  releasing lock on 1259
> 2018-08-29 04:30:17.582 CEST [19557] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.582 CEST [19557] LOG:  duration: 87.389 ms
> 2018-08-29 04:30:17.582 CEST [19557] LOG:  statement: vacuum full pg_class;
> 2018-08-29 04:30:17.583 CEST [19563] LOG:  duration: 630.481 ms
> 
> 2018-08-29 04:30:17.584 CEST [19563] LOG:  statement: analyze t;
> 2018-08-29 04:30:17.597 CEST [19555] LOG:  swapping mapped 1259/1281526 w 1281532/1281532
> 2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.597 CEST [19555] LOG:  reindexing 2662 to 1281535 on 1259/1281532
> 2018-08-29 04:30:17.597 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.599 CEST [19555] LOG:  reindexing 2663 to 1281536 on 1259/1281532
> 2018-08-29 04:30:17.599 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.601 CEST [19555] LOG:  reindexing 3455 to 1281537 on 1259/1281532
> 2018-08-29 04:30:17.601 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.605 CEST [19555] LOG:  releasing lock on 1259
> 2018-08-29 04:30:17.605 CEST [19555] STATEMENT:  vacuum full pg_class;
> 2018-08-29 04:30:17.605 CEST [19551] WARNING:  could not read block 3 in file "base/16384/1281529": read only 0 of
8192bytes
 
> (it's a warning because the process SIGSTOPs itself afterwards, for
> easier debugging)
> 
> So what we can see is that a first VACUUM FULL of pg_class builds a new
> pg_class_oid_index/2662 index under filefilenode 1281526.  Then a
> *second* VACUUM FULL again reindexes 2663, this time to 1281532.
> 
> But *after* the VACUUM FULL has finished and released the lock (that's
> printed just before the UnlockRelationIdForSession in vacuum_rel()), we
> can see that the INSERT from pid 19551 still accesses the OLD index,
> from *before* the second VACUUM FULL.  Which at that point doesn't exist
> anymore.
> 
> I think the reason for that is that an invalidation for the relmapper
> changes due to the second VACUUM has been queued, but currently the
> invalidation processing hasn't yet gotten to that entry yet. Instead
> currently the entry for the user defined table is being rebuilt, which
> then tries to access pg_class, but under its "old" relfilenode. Which
> then fails.  The fact that inex scans trigger this problem isn't
> relevant - the heap entry for pg_class also refers to the old entry, as
> I've verified using gdb.
> 
> 
> I'm unclear on how can fix this, as long as we rebuild cache entries
> during invalidation, rather than during access. For most cache entries
> it's not a problem to rebuild them relying on other outdated content,
> we'll just rebuild them again later.  But for things like the relation
> mapper that's not true, unfortunately.

Fwiw, if I use gdb to issue InvalidateSystemCaches(),
RelationInitPhysicalAddr(sysscan->irel) makes the relcache entry have
the correct relfilenode, and restarting the relcache invalidation (by
judicious use of 'ret') "fixes" the issue.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> Tom, I think this could use your eyes.

I've had no luck reproducing it locally ... do you have a recipe
for that?

            regards, tom lane


On 2018-08-28 23:18:25 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Tom, I think this could use your eyes.
> 
> I've had no luck reproducing it locally ... do you have a recipe
> for that?

It can reproduce reliably with the three scripts attached:

psql -c' drop table if exists t; create table t (a serial primary key, b int, c int);' && pgbench -n -c 4 -T300 -f
/tmp/insert.sql
pgbench -n -c 4 -T300 -f /tmp/vacuum.sql
pgbench -n -c 4 -T300 -f /tmp/analyze.sql

Locally that triggers the problem within usually a few seconds.

Greetings,

Andres Freund

Attachment
On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
> Locally that triggers the problem within usually a few seconds.

FWIW, it does so including versions as old as 9.2.

Now I need to look for power for my laptop and some for me ;)


Andres Freund <andres@anarazel.de> writes:
> On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
>> Locally that triggers the problem within usually a few seconds.

> FWIW, it does so including versions as old as 9.2.

Interesting.  One thing I'd like to know is why this only started
showing up in the buildfarm a few weeks ago.  Maybe that's just a
timing effect, but it seems odd.

> Now I need to look for power for my laptop and some for me ;)

Yeah, I'm not going to look harder till tomorrow either.

            regards, tom lane


On 2018-08-28 23:32:51 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
> >> Locally that triggers the problem within usually a few seconds.
> 
> > FWIW, it does so including versions as old as 9.2.

9.0 as well, so it's definitely not some recently backpatched change.


> Interesting.  One thing I'd like to know is why this only started
> showing up in the buildfarm a few weeks ago.  Maybe that's just a
> timing effect, but it seems odd.

I suspect it's largely timing and invalidation traffic.  It's been
really hard to reproduce with core regression tests after all.  If my
theory is right - and I'm getting more and more certain - that we're
trying to access a remapped relation during invalidation, before reading
the relevant relmapper invalidation, that also makes sense: You need to
be unlucky enough that there's a relmapper invalidation *after* an
invalidation for a currently "open" relation (otherwise we'd not do
inval processing). Most of the time we'll likely just be "too fast" to
process all the invalidations (thereby not getting to the point that
pg_class has been remapped later in the queue).

I think it might just be related to some modifed tests changing when
exactly autovacuum is triggered, which then in turn triggers the
invalidation.

Unfortunately the only "real" fix I can think of is to change the
relcache invalidation logic to only ever mark entries as 'invalid', and
not do any eager rebuilding of the entries.  That might be a good idea
reasons for performance anyway, but is quite scary to imagine doing in a
minor release.

Greetings,

Andres Freund


On 2018-08-28 20:29:08 -0700, Andres Freund wrote:
> On 2018-08-28 20:27:14 -0700, Andres Freund wrote:
> > Locally that triggers the problem within usually a few seconds.
> 
> FWIW, it does so including versions as old as 9.2.
> 
> Now I need to look for power for my laptop and some for me ;)

A bit of food, a coke and a talk later, here's a first draft *prototype*
of how this could be solved.

It's not OK to rebuild relcache entries in the middle of
ReceiveSharedInvalidMessages() - a later entry in the invalidation queue
might be relmapper invalidation, and thus immediately processing a
relcache invalidation might attempt to scan a relation that does not
exist anymore.

Instead, receiving a relcache invalidation now just queues an entry onto
a new "pending rebuilds" list, that is then processed in a second stage,
after emptying the entire sinval queue.  At that stage we're guaranteed
that the relmapper is up2date.

This might actually have performance benefits in some realistic
scenarios - while RelationFlushRelation() marks the relcache entry
invalid for each of the received entries,
ProcessPendingRelcacheRebuilds() can skip rebuilding if the entry has
since been rebuild.


Obviously this is far from clean enough, but what do you think about the
basic approach?  It does, in my limited testing, indeed solve the "could
not read block" issue.


Greetings,

Andres Freund

Attachment
Andres Freund <andres@anarazel.de> writes:
> It's not OK to rebuild relcache entries in the middle of
> ReceiveSharedInvalidMessages() - a later entry in the invalidation queue
> might be relmapper invalidation, and thus immediately processing a
> relcache invalidation might attempt to scan a relation that does not
> exist anymore.

Check, but the bug is actually more subtle than that.  I've reproduced the
problem using your scripts, changed the ERROR to PANIC, and studied the
stack traces of a few core dumps, and here's what I find is the scenario:

* Code tries to open some relation or other.  After acquiring lock, it
goes off to do AcceptInvalidationMessages (cf. LockRelationOid).  That
reads a relcache inval message for "t", and because "t"'s relcache entry
is already open, it decides to rebuild it immediately.

* Therefore it calls ScanPgRelation, which needs to do a search using
pg_class_oid_index, so it opens and locks that.  Again, having acquired
the lock in LockRelationOid, it goes off to do AcceptInvalidationMessages
(recursively).

* Now we read another relcache inval (for "t_pkey", in the cases I've
traced through in detail).  Again, that's open and we decide we'd better
do RelationReloadIndexInfo for it.

* We now recursively enter ScanPgRelation, which (again) needs to do a
search using pg_class_oid_index, so it (again) opens and locks that.
BUT: LockRelationOid sees that *this process already has share lock on
pg_class_oid_index*, so it figures it can skip AcceptInvalidationMessages.

Had we done AcceptInvalidationMessages at this point, we'd have found
the relcache inval message that's waiting for us about pg_class_oid_index,
and we'd have processed it before attempting to read pg_class_oid_index,
and all would be well.  It's the missed AIM call due to the recursive
entry to ScanPgRelation that is causing us to fail to read all available
inval messages before believing that our info about pg_class_oid_index
is up to date.

> Instead, receiving a relcache invalidation now just queues an entry onto
> a new "pending rebuilds" list, that is then processed in a second stage,
> after emptying the entire sinval queue.  At that stage we're guaranteed
> that the relmapper is up2date.

More to the point, we're guaranteed that we've seen all inval messages
that were sent before we acquired whatever lock we just acquired.
(In the cases I've looked at, the local relmapper *is* up to date,
it's the relcache entry for pg_class_oid_index that is not.)

My first gut feeling after understanding the recursion aspect was that
what you're suggesting would not be enough to be bulletproof.  We will
still have the hazard that while we're rebuilding relcache entries,
we may end up doing recursive ScanPgRelation calls and the inner ones
will decide they needn't call AcceptInvalidationMessages.  So we might
well still be rebuilding stuff without having seen every available
inval message.  But it should be OK, because any inval messages we could
possibly need to worry about will have been scanned and queued by the
outer call.

We could perhaps fix this with a less invasive change than what you
suggest here, by attacking the missed-call-due-to-recursion aspect
rather than monkeying with how relcache rebuild itself works.  But I'm
thinking this way might be a good idea anyhow, in order to reduce the
depth of recursion that occurs in scenarios like this.

I've not read your patch yet, so no comment on specifics.

            regards, tom lane


I wrote:
> * We now recursively enter ScanPgRelation, which (again) needs to do a
> search using pg_class_oid_index, so it (again) opens and locks that.
> BUT: LockRelationOid sees that *this process already has share lock on
> pg_class_oid_index*, so it figures it can skip AcceptInvalidationMessages.

BTW, I now have a theory for why we suddenly started seeing this problem
in mid-June: commits a54e1f158 et al added a ScanPgRelation call where
there had been none before (in RelationReloadNailed, for non-index rels).
That didn't create the problem, but it probably increased the odds of
seeing it happen.

Also ... isn't the last "relation->rd_isvalid = true" in
RelationReloadNailed wrong?  If it got cleared during ScanPgRelation,
I do not think we want to believe that we got an up-to-date row.

            regards, tom lane


Andres Freund <andres@anarazel.de> writes:
> A bit of food, a coke and a talk later, here's a first draft *prototype*
> of how this could be solved. ...
> Obviously this is far from clean enough, but what do you think about the
> basic approach?  It does, in my limited testing, indeed solve the "could
> not read block" issue.

A couple thoughts after reading and reflecting for awhile:

1. I don't much like the pending_rebuilds list, mainly because of this
consideration: what happens if we hit an OOM error trying to add an entry
to that list?  As you've drafted the patch, we don't even mark the
relevant relcache entry rd_invalid before that fails, so that's surely
bad.  Now, I'm not sure how bulletproof relcache inval is in general
with respect to OOM failures, but let's not add more hazards.

2. I think we may need to address the same order-of-operations hazards
as RelationCacheInvalidate() worries about.  Alternatively, maybe we
could simplify that function by making it use the same
delayed-revalidation logic as we're going to develop for this.

3. I don't at all like the ProcessPendingRelcacheRebuilds call you added
to ProcessInvalidationMessages.  That's effectively assuming that the
"func" *must* be LocalExecuteInvalidationMessage and not anything else;
likewise, the lack of such a call inside ProcessInvalidationMessagesMulti
presumes that that one is never called to actually execute invalidations.
(While those are true statements, it's a horrible violation of modularity
for these two functions to know it.)  Probably better to put this into the
callers, which will know what the actual semantics are.

4. The call added to the middle of ReceiveSharedInvalidMessages doesn't
seem all that safe either; the problem is its relationship to the
"catchup" processing.  We are caught up at the moment we exit the loop,
but are we sure we still would be after doing assorted work for relcache
rebuild?  Swapping the order of the two steps might help, but then we
have to consider what happens if we error out from SICleanupQueue.

(In general, the hard part of all this stuff is being sure that sane
things happen if you error out part way through ...)

            regards, tom lane


Hi,

On 2018-08-29 12:56:07 -0400, Tom Lane wrote:
> I wrote:
> > * We now recursively enter ScanPgRelation, which (again) needs to do a
> > search using pg_class_oid_index, so it (again) opens and locks that.
> > BUT: LockRelationOid sees that *this process already has share lock on
> > pg_class_oid_index*, so it figures it can skip AcceptInvalidationMessages.
> 
> BTW, I now have a theory for why we suddenly started seeing this problem
> in mid-June: commits a54e1f158 et al added a ScanPgRelation call where
> there had been none before (in RelationReloadNailed, for non-index rels).
> That didn't create the problem, but it probably increased the odds of
> seeing it happen.

Yea.  Doesn't explain why it's only really visible on the BF in
11/master though :/


> Also ... isn't the last "relation->rd_isvalid = true" in
> RelationReloadNailed wrong?  If it got cleared during ScanPgRelation,
> I do not think we want to believe that we got an up-to-date row.

I don't really think so - note how a normal relcache inval essentially
does the same. RelationClearRelation() first marks the entry as invalid,
then goes and builds a new entry that's *not* hooked into the hashtable
(therefore doesn't receive new invals), and then moves the contents
over. That overwrites rd_isvalid to true, as that's guaranteed to be set
by by RelationBuildDesc(). During the move no new invalidations are
accepted.   So this really is just behaving equivalently.

The harder question is why that's safe. I think I convinced myself that
it is a couple times over the years, but I don't think we've properly
documented it. As the header says:
 *        The following code contains many undocumented hacks.  Please be
 *        careful....

We definitely relied on RelationClearRelation() always returning a valid
record for a while, c.f. RelationIdGetRelation()'s rd_isvalid assertion,
and the lack of a loop in that function.

(There's no coffee in this hotel at 4am. Shame.)

Ah, yes.  This assumption is currently safe because the locking on
relations being looked up, better guarantees that there's no critical
changes to relcache entries while the entry is being rebuilt.

I think we'd also run into trouble with clobber cache recursively etc
without it.

Greetings,

Andres Freund


Hi,

On 2018-08-29 14:00:12 -0400, Tom Lane wrote:
> A couple thoughts after reading and reflecting for awhile:

Thanks. This definitely is too complicated for a single brain :(


> 1. I don't much like the pending_rebuilds list, mainly because of this
> consideration: what happens if we hit an OOM error trying to add an entry
> to that list?  As you've drafted the patch, we don't even mark the
> relevant relcache entry rd_invalid before that fails, so that's surely
> bad.  Now, I'm not sure how bulletproof relcache inval is in general
> with respect to OOM failures, but let's not add more hazards.

Obviously you're right and we first need to mark the entry as invalid -
and I think that should also mostly [1] address the OOM issue. If we
error out there's no need to eagerly rebuild the entry during
invalidation - subsequent RelationIdGetRelation() calls will trigger the
necessary rebuild.

But it sounds like your concern might be larger?  I don't quite see how
we could get around without such a list?  I mean I think it'd better if
it were an array, for efficiency, but that seems like a fairly minor
point?

[1] I'm kinda doubful that's entirely bulletproof in case an OOM, or
    similar, error is caught with a savepoint, and then execution
    continues outside. That in turn might access the relcache entry
    without a RelationIdGetRelation call. But that doesn't seem new.


> 2. I think we may need to address the same order-of-operations hazards
> as RelationCacheInvalidate() worries about.  Alternatively, maybe we
> could simplify that function by making it use the same
> delayed-revalidation logic as we're going to develop for this.

Hm, just to make sure I understand correctly: You're thinking about
having to rebuild various nailed entries in a specific order [2]?  I'm
not quite sure I really see a problem here - in contrast to the current
RelationCacheInvalidate() logic, what I'm proposing marks the entries as
invalid in the first phase, so any later access guarantees will rebuild
the entry as necessary.

[2] Hm, I'm right now a bit confused as to why "Other nailed relations
go to the front of rebuildList" is a good idea. That means they might be
built while ClassOidIndexId isn't yet rebuilt?  Practically I don't
think it matters, because the lookup for RelationRelationId will rebuild
it, but I don't quite understand what that logic is precisely getting
at.


> 3. I don't at all like the ProcessPendingRelcacheRebuilds call you added
> to ProcessInvalidationMessages.  That's effectively assuming that the
> "func" *must* be LocalExecuteInvalidationMessage and not anything else;
> likewise, the lack of such a call inside ProcessInvalidationMessagesMulti
> presumes that that one is never called to actually execute invalidations.
> (While those are true statements, it's a horrible violation of modularity
> for these two functions to know it.)  Probably better to put this into the
> callers, which will know what the actual semantics are.

Yea, I mostly hacked this together quickly to have a proposal on the
table, I'm not happy with this as it stands.


> 4. The call added to the middle of ReceiveSharedInvalidMessages doesn't
> seem all that safe either; the problem is its relationship to the
> "catchup" processing.  We are caught up at the moment we exit the loop,
> but are we sure we still would be after doing assorted work for relcache
> rebuild?  Swapping the order of the two steps might help, but then we
> have to consider what happens if we error out from SICleanupQueue.

I'm not sure I understand the problem here. If there's new invalidation
processing inside the rebuilds, that'd mean there was another
ReceiveSharedInvalidMessages, which then would have triggered the "new"
ReceiveSharedInvalidMessages, to then process the new pending why I made
ProcessPendingRelcacheRebuilds() unlink the queue it processes first.  I
think locking should prevent issues with entries that are currently
accessed during rebuild from being re-invalidated.

Did I misunderstand?

As the comment in ProcessPendingRelcacheRebuilds() notes, I'm concerned
what happens if we error out after unlinking the current pending
items. But I think, if we handle the memory leak with a PG_CATCH, that
should be ok, because after the error new accesses should trigger the
rebuilds.


> (In general, the hard part of all this stuff is being sure that sane
> things happen if you error out part way through ...)

Yea :/

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> On 2018-08-29 12:56:07 -0400, Tom Lane wrote:
>> BTW, I now have a theory for why we suddenly started seeing this problem
>> in mid-June: commits a54e1f158 et al added a ScanPgRelation call where
>> there had been none before (in RelationReloadNailed, for non-index rels).
>> That didn't create the problem, but it probably increased the odds of
>> seeing it happen.

> Yea.  Doesn't explain why it's only really visible on the BF in
> 11/master though :/

Probably we changed around the tests that run in parallel with vacuum.sql
enough during v11 to make it happen with noticeable probability.  I think
this is sufficient evidence that it's happening now in the back branches:

https://www.postgresql.org/message-id/20180829140149.GO23024@telsasoft.com

            regards, tom lane


Andres Freund <andres@anarazel.de> writes:
> On 2018-08-29 14:00:12 -0400, Tom Lane wrote:
>> 2. I think we may need to address the same order-of-operations hazards
>> as RelationCacheInvalidate() worries about.  Alternatively, maybe we
>> could simplify that function by making it use the same
>> delayed-revalidation logic as we're going to develop for this.

> Hm, just to make sure I understand correctly: You're thinking about
> having to rebuild various nailed entries in a specific order [2]?

> [2] Hm, I'm right now a bit confused as to why "Other nailed relations
> go to the front of rebuildList" is a good idea.

I'm not recalling the reason right now either, but I'm pretty sure that
logic was added out of necessity ... it hasn't always been like that.

            regards, tom lane


I wrote:
> We could perhaps fix this with a less invasive change than what you
> suggest here, by attacking the missed-call-due-to-recursion aspect
> rather than monkeying with how relcache rebuild itself works.

Seeing that rearranging the relcache rebuild logic is looking less than
trivial, I thought it'd be a good idea to investigate this alternative
approach.

Essentially, the problem here is that lmgr.c supposes that
AcceptInvalidationMessages is an atomic operation, which it most
certainly isn't.  In reality, it's unsafe to suppose that we can skip
reading incoming inval messages until we have *completed*
AcceptInvalidationMessages, not just invoked it.

However, we can fix that very easily, by retaining one additional bit
of state in each LOCALLOCK entry, which records whether we know we have
completed AcceptInvalidationMessages at least once subsequent to obtaining
that lock.  In the attached draft patch, I refer to that state as having
"cleared" the lock, but I'm not super pleased with that terminology ---
anybody have a better idea?

A small problem with the lock.c API as it stands is that we'd have to do
an additional hashtable lookup to re-find the relevant LOCALLOCK entry.
In the attached, I fixed that by extending LockAcquireExtended's API
to allow the caller to obtain a pointer to the LOCALLOCK entry, making
it trivially cheap to set the flag after we finish
AcceptInvalidationMessages.  LockAcquireExtended has only one external
caller right now, which makes me think it's OK to change its API rather
than introduce YA variant entry point, but that could be argued.

There are two ways we could adjust the return value from
LockAcquire(Extended) to cope with this requirement.  What I did here
was to add an additional LockAcquireResult code, so that "already held"
can be distinguished from "already held and cleared".  But we don't
really need to make that distinction, at least not in the core code.
So another way would be to redefine LOCKACQUIRE_ALREADY_HELD as meaning
"held and cleared", and then just return LOCKACQUIRE_OK if you haven't
called MarkLockClear for the lock.  I'm not entirely sure which way is
less likely to break any third-party code that might be inspecting the
result of LockAcquire.  You could probably argue it either way depending
on what you think the third-party code is doing with the result.

Anyway, the attached appears to fix the problem: Andres' test script
fails in fractions of a second with HEAD on my workstation, but it's
run error-free for quite awhile now with this patch.  And this is sure
a lot simpler than any relcache rebuild refactoring we're likely to come
up with.

Discuss.

            regards, tom lane

PS: this also fixes what seems to me to be a buglet in the fast-path
locks stuff: there are failure paths out of LockAcquire that don't
remove the failed LOCALLOCK entry.  Not any more.

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 2e07702..e924c96 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -661,7 +661,8 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)

     SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid);

-    LockAcquireExtended(&locktag, AccessExclusiveLock, true, false, false);
+    LockAcquireExtended(&locktag, AccessExclusiveLock, true, false,
+                        false, NULL);
 }

 static void
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 7b2dcb6..dc0a439 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -105,11 +105,12 @@ void
 LockRelationOid(Oid relid, LOCKMODE lockmode)
 {
     LOCKTAG        tag;
+    LOCALLOCK  *locallock;
     LockAcquireResult res;

     SetLocktagRelationOid(&tag, relid);

-    res = LockAcquire(&tag, lockmode, false, false);
+    res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);

     /*
      * Now that we have the lock, check for invalidation messages, so that we
@@ -120,9 +121,18 @@ LockRelationOid(Oid relid, LOCKMODE lockmode)
      * relcache entry in an undesirable way.  (In the case where our own xact
      * modifies the rel, the relcache update happens via
      * CommandCounterIncrement, not here.)
+     *
+     * However, in corner cases where code acts on tables (usually catalogs)
+     * recursively, we might get here while still processing invalidation
+     * messages in some outer execution of this function or a sibling.  The
+     * "cleared" status of the lock tells us whether we really are done
+     * absorbing relevant inval messages.
      */
-    if (res != LOCKACQUIRE_ALREADY_HELD)
+    if (res != LOCKACQUIRE_ALREADY_CLEAR)
+    {
         AcceptInvalidationMessages();
+        MarkLockClear(locallock);
+    }
 }

 /*
@@ -138,11 +148,12 @@ bool
 ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
 {
     LOCKTAG        tag;
+    LOCALLOCK  *locallock;
     LockAcquireResult res;

     SetLocktagRelationOid(&tag, relid);

-    res = LockAcquire(&tag, lockmode, false, true);
+    res = LockAcquireExtended(&tag, lockmode, false, true, true, &locallock);

     if (res == LOCKACQUIRE_NOT_AVAIL)
         return false;
@@ -151,8 +162,11 @@ ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode)
      * Now that we have the lock, check for invalidation messages; see notes
      * in LockRelationOid.
      */
-    if (res != LOCKACQUIRE_ALREADY_HELD)
+    if (res != LOCKACQUIRE_ALREADY_CLEAR)
+    {
         AcceptInvalidationMessages();
+        MarkLockClear(locallock);
+    }

     return true;
 }
@@ -199,20 +213,24 @@ void
 LockRelation(Relation relation, LOCKMODE lockmode)
 {
     LOCKTAG        tag;
+    LOCALLOCK  *locallock;
     LockAcquireResult res;

     SET_LOCKTAG_RELATION(tag,
                          relation->rd_lockInfo.lockRelId.dbId,
                          relation->rd_lockInfo.lockRelId.relId);

-    res = LockAcquire(&tag, lockmode, false, false);
+    res = LockAcquireExtended(&tag, lockmode, false, false, true, &locallock);

     /*
      * Now that we have the lock, check for invalidation messages; see notes
      * in LockRelationOid.
      */
-    if (res != LOCKACQUIRE_ALREADY_HELD)
+    if (res != LOCKACQUIRE_ALREADY_CLEAR)
+    {
         AcceptInvalidationMessages();
+        MarkLockClear(locallock);
+    }
 }

 /*
@@ -226,13 +244,14 @@ bool
 ConditionalLockRelation(Relation relation, LOCKMODE lockmode)
 {
     LOCKTAG        tag;
+    LOCALLOCK  *locallock;
     LockAcquireResult res;

     SET_LOCKTAG_RELATION(tag,
                          relation->rd_lockInfo.lockRelId.dbId,
                          relation->rd_lockInfo.lockRelId.relId);

-    res = LockAcquire(&tag, lockmode, false, true);
+    res = LockAcquireExtended(&tag, lockmode, false, true, true, &locallock);

     if (res == LOCKACQUIRE_NOT_AVAIL)
         return false;
@@ -241,8 +260,11 @@ ConditionalLockRelation(Relation relation, LOCKMODE lockmode)
      * Now that we have the lock, check for invalidation messages; see notes
      * in LockRelationOid.
      */
-    if (res != LOCKACQUIRE_ALREADY_HELD)
+    if (res != LOCKACQUIRE_ALREADY_CLEAR)
+    {
         AcceptInvalidationMessages();
+        MarkLockClear(locallock);
+    }

     return true;
 }
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index dc3d8d9..58d1f32 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -669,6 +669,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
  *        LOCKACQUIRE_NOT_AVAIL        lock not available, and dontWait=true
  *        LOCKACQUIRE_OK                lock successfully acquired
  *        LOCKACQUIRE_ALREADY_HELD    incremented count for lock already held
+ *        LOCKACQUIRE_ALREADY_CLEAR    incremented count for lock already clear
  *
  * In the normal case where dontWait=false and the caller doesn't need to
  * distinguish a freshly acquired lock from one already taken earlier in
@@ -685,7 +686,8 @@ LockAcquire(const LOCKTAG *locktag,
             bool sessionLock,
             bool dontWait)
 {
-    return LockAcquireExtended(locktag, lockmode, sessionLock, dontWait, true);
+    return LockAcquireExtended(locktag, lockmode, sessionLock, dontWait,
+                               true, NULL);
 }

 /*
@@ -696,13 +698,17 @@ LockAcquire(const LOCKTAG *locktag,
  * caller to note that the lock table is full and then begin taking
  * extreme action to reduce the number of other lock holders before
  * retrying the action.
+ *
+ * If locallockp isn't NULL, *locallockp receives a pointer to the LOCALLOCK
+ * table entry if a lock is successfully acquired, or NULL if not.
  */
 LockAcquireResult
 LockAcquireExtended(const LOCKTAG *locktag,
                     LOCKMODE lockmode,
                     bool sessionLock,
                     bool dontWait,
-                    bool reportMemoryError)
+                    bool reportMemoryError,
+                    LOCALLOCK **locallockp)
 {
     LOCKMETHODID lockmethodid = locktag->locktag_lockmethodid;
     LockMethod    lockMethodTable;
@@ -766,9 +772,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
         locallock->proclock = NULL;
         locallock->hashcode = LockTagHashCode(&(localtag.lock));
         locallock->nLocks = 0;
+        locallock->holdsStrongLockCount = false;
+        locallock->lockCleared = false;
         locallock->numLockOwners = 0;
         locallock->maxLockOwners = 8;
-        locallock->holdsStrongLockCount = false;
         locallock->lockOwners = NULL;    /* in case next line fails */
         locallock->lockOwners = (LOCALLOCKOWNER *)
             MemoryContextAlloc(TopMemoryContext,
@@ -789,13 +796,22 @@ LockAcquireExtended(const LOCKTAG *locktag,
     }
     hashcode = locallock->hashcode;

+    if (locallockp)
+        *locallockp = locallock;
+
     /*
      * If we already hold the lock, we can just increase the count locally.
+     *
+     * If lockCleared is already set, caller need not worry about absorbing
+     * sinval messages related to the lock's object.
      */
     if (locallock->nLocks > 0)
     {
         GrantLockLocal(locallock, owner);
-        return LOCKACQUIRE_ALREADY_HELD;
+        if (locallock->lockCleared)
+            return LOCKACQUIRE_ALREADY_CLEAR;
+        else
+            return LOCKACQUIRE_ALREADY_HELD;
     }

     /*
@@ -877,6 +893,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
                                            hashcode))
         {
             AbortStrongLockAcquire();
+            if (locallock->nLocks == 0)
+                RemoveLocalLock(locallock);
+            if (locallockp)
+                *locallockp = NULL;
             if (reportMemoryError)
                 ereport(ERROR,
                         (errcode(ERRCODE_OUT_OF_MEMORY),
@@ -911,6 +931,10 @@ LockAcquireExtended(const LOCKTAG *locktag,
     {
         AbortStrongLockAcquire();
         LWLockRelease(partitionLock);
+        if (locallock->nLocks == 0)
+            RemoveLocalLock(locallock);
+        if (locallockp)
+            *locallockp = NULL;
         if (reportMemoryError)
             ereport(ERROR,
                     (errcode(ERRCODE_OUT_OF_MEMORY),
@@ -976,6 +1000,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
             LWLockRelease(partitionLock);
             if (locallock->nLocks == 0)
                 RemoveLocalLock(locallock);
+            if (locallockp)
+                *locallockp = NULL;
             return LOCKACQUIRE_NOT_AVAIL;
         }

@@ -1646,6 +1672,20 @@ GrantAwaitedLock(void)
 }

 /*
+ * MarkLockClear -- mark an acquired lock as "clear"
+ *
+ * This means that we know we have absorbed all sinval messages that other
+ * sessions generated before we acquired this lock, and so we can confidently
+ * assume we know about any catalog changes protected by this lock.
+ */
+void
+MarkLockClear(LOCALLOCK *locallock)
+{
+    Assert(locallock->nLocks > 0);
+    locallock->lockCleared = true;
+}
+
+/*
  * WaitOnLock -- wait to acquire a lock
  *
  * Caller must have set MyProc->heldLocks to reflect locks already held
@@ -1909,6 +1949,15 @@ LockRelease(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock)
     if (locallock->nLocks > 0)
         return true;

+    /*
+     * At this point we can no longer suppose we are clear of invalidation
+     * messages related to this lock.  Although we'll delete the LOCALLOCK
+     * object before any intentional return from this routine, it seems worth
+     * the trouble to explicitly reset lockCleared right now, just in case
+     * some error prevents us from deleting the LOCALLOCK.
+     */
+    locallock->lockCleared = false;
+
     /* Attempt fast release of any lock eligible for the fast path. */
     if (EligibleForRelationFastPath(locktag, lockmode) &&
         FastPathLocalUseCount > 0)
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 777da71..ff4df7f 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -408,9 +408,10 @@ typedef struct LOCALLOCK
     PROCLOCK   *proclock;        /* associated PROCLOCK object, if any */
     uint32        hashcode;        /* copy of LOCKTAG's hash value */
     int64        nLocks;            /* total number of times lock is held */
+    bool        holdsStrongLockCount;    /* bumped FastPathStrongRelationLocks */
+    bool        lockCleared;    /* we read all sinval msgs for lock */
     int            numLockOwners;    /* # of relevant ResourceOwners */
     int            maxLockOwners;    /* allocated size of array */
-    bool        holdsStrongLockCount;    /* bumped FastPathStrongRelationLocks */
     LOCALLOCKOWNER *lockOwners; /* dynamically resizable array */
 } LOCALLOCK;

@@ -472,7 +473,8 @@ typedef enum
 {
     LOCKACQUIRE_NOT_AVAIL,        /* lock not available, and dontWait=true */
     LOCKACQUIRE_OK,                /* lock successfully acquired */
-    LOCKACQUIRE_ALREADY_HELD    /* incremented count for lock already held */
+    LOCKACQUIRE_ALREADY_HELD,    /* incremented count for lock already held */
+    LOCKACQUIRE_ALREADY_CLEAR    /* incremented count for lock already clear */
 } LockAcquireResult;

 /* Deadlock states identified by DeadLockCheck() */
@@ -528,8 +530,10 @@ extern LockAcquireResult LockAcquireExtended(const LOCKTAG *locktag,
                     LOCKMODE lockmode,
                     bool sessionLock,
                     bool dontWait,
-                    bool report_memory_error);
+                    bool reportMemoryError,
+                    LOCALLOCK **locallockp);
 extern void AbortStrongLockAcquire(void);
+extern void MarkLockClear(LOCALLOCK *locallock);
 extern bool LockRelease(const LOCKTAG *locktag,
             LOCKMODE lockmode, bool sessionLock);
 extern void LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks);

On 2018-08-29 17:58:19 -0400, Tom Lane wrote:
> I wrote:
> > We could perhaps fix this with a less invasive change than what you
> > suggest here, by attacking the missed-call-due-to-recursion aspect
> > rather than monkeying with how relcache rebuild itself works.
> 
> Seeing that rearranging the relcache rebuild logic is looking less than
> trivial, I thought it'd be a good idea to investigate this alternative
> approach.
> 
> Essentially, the problem here is that lmgr.c supposes that
> AcceptInvalidationMessages is an atomic operation, which it most
> certainly isn't.  In reality, it's unsafe to suppose that we can skip
> reading incoming inval messages until we have *completed*
> AcceptInvalidationMessages, not just invoked it.
> 
> However, we can fix that very easily, by retaining one additional bit
> of state in each LOCALLOCK entry, which records whether we know we have
> completed AcceptInvalidationMessages at least once subsequent to obtaining
> that lock.  In the attached draft patch, I refer to that state as having
> "cleared" the lock, but I'm not super pleased with that terminology ---
> anybody have a better idea?
> 
> A small problem with the lock.c API as it stands is that we'd have to do
> an additional hashtable lookup to re-find the relevant LOCALLOCK entry.
> In the attached, I fixed that by extending LockAcquireExtended's API
> to allow the caller to obtain a pointer to the LOCALLOCK entry, making
> it trivially cheap to set the flag after we finish
> AcceptInvalidationMessages.  LockAcquireExtended has only one external
> caller right now, which makes me think it's OK to change its API rather
> than introduce YA variant entry point, but that could be argued.
> 
> There are two ways we could adjust the return value from
> LockAcquire(Extended) to cope with this requirement.  What I did here
> was to add an additional LockAcquireResult code, so that "already held"
> can be distinguished from "already held and cleared".  But we don't
> really need to make that distinction, at least not in the core code.
> So another way would be to redefine LOCKACQUIRE_ALREADY_HELD as meaning
> "held and cleared", and then just return LOCKACQUIRE_OK if you haven't
> called MarkLockClear for the lock.  I'm not entirely sure which way is
> less likely to break any third-party code that might be inspecting the
> result of LockAcquire.  You could probably argue it either way depending
> on what you think the third-party code is doing with the result.
> 
> Anyway, the attached appears to fix the problem: Andres' test script
> fails in fractions of a second with HEAD on my workstation, but it's
> run error-free for quite awhile now with this patch.  And this is sure
> a lot simpler than any relcache rebuild refactoring we're likely to come
> up with.
> 
> Discuss.

This is a neat idea.

I've one intuitive correctness concern that I can't quite nail down, I
plan to spend more time on those aspects (Partially around previously
held locks and invalidations generated with non-conflicting lock-modes).

Leaving that aside, I think there's one architectural aspect of my
approach that I prefer over yours: Deduplicating eager cache rebuilds
like my approach does seems quite advantageous. There's a lot of cases
where we end up sending out many cache rebuilds that invalidate the same
entries - as we do not defer rebuilds, that leads to repeated
rebuilds. My approach avoids overhead associated with that, as long as
the invalidations are queued close enough together.

We could obviously just do both, but that seems unnecessary.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> Leaving that aside, I think there's one architectural aspect of my
> approach that I prefer over yours: Deduplicating eager cache rebuilds
> like my approach does seems quite advantageous.

That is attractive, for sure, but the other side of the coin is that
getting there seems to require a lot of ticklish redesign.  We would
certainly not consider back-patching such a change normally, and I'm
unconvinced that we should do so in this case.

My thought is to do (and back-patch) my change, and then work on yours
as a performance improvement for HEAD only.  I don't believe that yours
would make mine redundant, either --- they are good complementary changes
to make real sure we have no remaining bugs of this ilk.  (In particular,
no matter how much de-duplication we do, we'll still have scenarios with
recursive cache flushes; so I'm not quite convinced that your solution
provides a 100% fix by itself.)

            regards, tom lane


Hi,

On 2018-08-31 19:53:43 -0400, Tom Lane wrote:
> My thought is to do (and back-patch) my change, and then work on yours
> as a performance improvement for HEAD only.

That does make sense.

> I don't believe that yours would make mine redundant, either --- they
> are good complementary changes to make real sure we have no remaining
> bugs of this ilk.  (In particular, no matter how much de-duplication
> we do, we'll still have scenarios with recursive cache flushes; so I'm
> not quite convinced that your solution provides a 100% fix by itself.)

Yea this is a fair concern.

One concern I have with your approach is that it isn't particularly
bullet-proof for cases where the rebuild is triggered by something that
doesn't hold a conflicting lock.  If there's an independent cause to
reload - something that very commonly happens - we might not necessarily
re-trigger the recursive reloads.  That *should* be safe due such
changes hopefully being "harmless", but it's a bit disconcerting
nonetheless.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> One concern I have with your approach is that it isn't particularly
> bullet-proof for cases where the rebuild is triggered by something that
> doesn't hold a conflicting lock.

Wouldn't that be a bug in the something-else?  The entire relation cache
system is based on the assumptions that (a) if you hold lock, you can read
a consistent and valid set of information about the rel from the catalogs,
and (b) anyone changing that info must hold a conflicting lock and send an
SINVAL message *before* releasing said lock.  I'm not prepared to consider
a redesign of those assumptions, especially not for back-patching.

            regards, tom lane


I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> One concern I have with your approach is that it isn't particularly
>> bullet-proof for cases where the rebuild is triggered by something that
>> doesn't hold a conflicting lock.

> Wouldn't that be a bug in the something-else?

So where are we on this?  Should I proceed with my patch, or are we
going to do further investigation?  Does anyone want to do an actual
patch review?

I think it's important to have some fix in place in time for the next
11beta release, so time grows short ...

            regards, tom lane


I wrote:
> So where are we on this?  Should I proceed with my patch, or are we
> going to do further investigation?  Does anyone want to do an actual
> patch review?

[ crickets... ]

So I took that as license to proceed, but while doing a final round of
testing I found out that a CLOBBER_CACHE_RECURSIVELY build fails,
because now that's an infinite recursion.  On reflection it's a bit
surprising that it wasn't so all along.  What I'm inclined to do about
it is to adjust AcceptInvalidationMessages so that there's a finite
recursion depth limit in the CLOBBER_CACHE_RECURSIVELY case, as there
already is in the CLOBBER_CACHE_ALWAYS case.  Maybe 3 or so levels
would be enough.

            regards, tom lane


On 2018-09-06 17:38:38 -0400, Tom Lane wrote:
> I wrote:
> > So where are we on this?  Should I proceed with my patch, or are we
> > going to do further investigation?  Does anyone want to do an actual
> > patch review?
> 
> [ crickets... ]

Sorry, bit busy with postgres open, and a few people being in town due
to that.

Could you attach the current version of the patch, or were there no
meaningful changes?

> So I took that as license to proceed, but while doing a final round of
> testing I found out that a CLOBBER_CACHE_RECURSIVELY build fails,
> because now that's an infinite recursion.  On reflection it's a bit
> surprising that it wasn't so all along.  What I'm inclined to do about
> it is to adjust AcceptInvalidationMessages so that there's a finite
> recursion depth limit in the CLOBBER_CACHE_RECURSIVELY case, as there
> already is in the CLOBBER_CACHE_ALWAYS case.  Maybe 3 or so levels
> would be enough.

Hm, but wouldn't that pretty much mean that the risk exposed in this
thread would still be present?  The reason your approach fixes things is
that if we're processing invalidation event N, which depends on
invalidation changes at N+y being processed, is that the recursion leads
to N+y being processed before the invalidation processing of N finishes,
due to the recursion.  Now you can argue that your restriction would
only apply to CLOBBER_CACHE_RECURSIVELY, and thus not in normal builds,
but that still makes me uncomfortable.

Greetings,

Andres Freund


Andres Freund <andres@anarazel.de> writes:
> Could you attach the current version of the patch, or were there no
> meaningful changes?

No changes.

>> So I took that as license to proceed, but while doing a final round of
>> testing I found out that a CLOBBER_CACHE_RECURSIVELY build fails,
>> because now that's an infinite recursion.  On reflection it's a bit
>> surprising that it wasn't so all along.  What I'm inclined to do about
>> it is to adjust AcceptInvalidationMessages so that there's a finite
>> recursion depth limit in the CLOBBER_CACHE_RECURSIVELY case, as there
>> already is in the CLOBBER_CACHE_ALWAYS case.  Maybe 3 or so levels
>> would be enough.

> Hm, but wouldn't that pretty much mean that the risk exposed in this
> thread would still be present?  The reason your approach fixes things is
> that if we're processing invalidation event N, which depends on
> invalidation changes at N+y being processed, is that the recursion leads
> to N+y being processed before the invalidation processing of N finishes,
> due to the recursion.

Right, which makes the maximum recursion depth equal to basically whatever
you think the "backlog" of relevant catalog inval messages could be.  It's
finite, because we do after all have lock on one more catalog at each
level, and eventually we'd hold lock on every system catalog or index
that could be referenced inside AcceptInvalidationMessages.  In practice
I doubt the worst-case nesting level is more than two or three, but
it's not clear how to establish just what it is.

CLOBBER_CACHE_RECURSIVELY breaks this because it recurses whether or not
any relevant inval activity occurred.  I think that's good for verifying
that recursion of AcceptInvalidationMessages works at all, but it's
not relevant to whether infinite recursion could happen without that.

            regards, tom lane


I went ahead and pushed this, since the window for getting buildfarm
testing done before Monday's wrap is closing fast.  We can always
improve on it later, but I think beta3 ought to carry some fix
for the problem.

            regards, tom lane