Thread: BUG #15350: Getting invalid cache ID: 11 Errors

BUG #15350: Getting invalid cache ID: 11 Errors

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15350
Logged by:          Kieran McCusker
Email address:      kieran.mccusker@gmail.com
PostgreSQL version: 10.5
Operating system:   Fedora 28
Description:

Hi

We're using 10.5 with parallel queries enabled and the config options
#max_worker_processes = 8
#max_parallel_workers_per_gather = 2
#max_parallel_workers = 8

I'm seeing invalid cache ID: 11 errors in the log. It's only happening
occasionally (15 times today on a not very busy system). I've been following
the thread about making windowing functions et al parallel restricted - Is
this another manifestation of the problem? 

On the first query setting set session max_parallel_workers to 0 more than
doubles the execution time from 208ms to 580ms so that doesn't seem
desirable, and it's not obvious to me why the second would have failed.

I'm probably being dense and if so apologies for wasting your time. I also
don't see how to create a test case sorry.

Many thanks

Kieran

A couple of example queries are :-
-----------------------------------------------------------------------------------------------
with limits as (
                      select min(contact_date), max(contact_date)
                      from q366midl.q366m1
                      where contact_date is not null 
                    )
                    select 
                    to_char(min,'YYYY-MM'),
                    'text' as format,
                    to_char(min,substr('YYYY-MM',1,7)) as code,
--
                    from limits
                    union all
                    select
                    to_char(max,'YYYY-MM'),
                    'text' as format,
                    to_char(max,substr('YYYY-MM',1,7)) as code,
                    to_char(max,'Mon YYYY') as text 
                    from limits
                    order by 1

2018-08-24 13:50:28.325 BST [68794] ERROR:  invalid cache ID: 11
2018-08-24 13:50:28.325 BST [68844] LOG:  worker process: parallel worker
for PID 67216 (PID 68793) exited with exit code 1
2018-08-24 13:50:28.327 BST [68844] LOG:  worker process: parallel worker
for PID 67216 (PID 68794) exited with exit code 1


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2018-08-23 11:04:23.020 BST [80148] ERROR:  invalid cache ID: 11
2018-08-23 11:04:23.020 BST [80148] CONTEXT:  parallel worker
2018-08-23 11:04:23.020 BST [80148] STATEMENT:  SELECT "survey"."calls".*
FROM "survey"."calls" WHERE ((survey_state is null or survey_state not in
(500,501)) and phone_numbers @> array['07854 xxxxxx']) ORDER BY created_at
desc
2018-08-23 11:04:23.021 BST [81154] ERROR:  invalid cache ID: 11
2018-08-23 11:04:23.023 BST [68844] LOG:  worker process: parallel worker
for PID 80148 (PID 81153) exited with exit code 1
2018-08-23 11:04:23.023 BST [68844] LOG:  worker process: parallel worker
for PID 80148 (PID 81154) exited with exit code 1


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Tom Lane
Date:
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> We're using 10.5 with parallel queries enabled and the config options
> #max_worker_processes = 8
> #max_parallel_workers_per_gather = 2
> #max_parallel_workers = 8

> I'm seeing invalid cache ID: 11 errors in the log. It's only happening
> occasionally (15 times today on a not very busy system).

Interesting.  Syscache 11 would be AUTHOID, which seems to be consulted
mostly for privilege checks, though there's at least one reference
during process startup.

> I've been following
> the thread about making windowing functions et al parallel restricted - Is
> this another manifestation of the problem? 

No, I don't think it's that.  We've seen a couple of reports of similar
failures referencing cache 42, but this is a new symptom.

> I also don't see how to create a test case sorry.

Don't suppose you'd be prepared to recompile PG to turn that ERROR into
a PANIC, and then capture a stack trace from the core dump next time
it happens?

            regards, tom lane


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Thomas Munro
Date:
On Sat, Aug 25, 2018 at 3:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> =?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> > We're using 10.5 with parallel queries enabled and the config options
> > #max_worker_processes = 8
> > #max_parallel_workers_per_gather = 2
> > #max_parallel_workers = 8
>
> > I'm seeing invalid cache ID: 11 errors in the log. It's only happening
> > occasionally (15 times today on a not very busy system).
>
> Interesting.  Syscache 11 would be AUTHOID, which seems to be consulted
> mostly for privilege checks, though there's at least one reference
> during process startup.

Hi Kieran,

Are you using extensions, by any chance?  If an extension were to
access the AUTHOID syscache during _PG_init(), it would fail like this
in parallel workers, because they run RestoreLibraryState() before
they run BackgroundWorkerInitializeConnectionByOid() (which runs
InitPostgres() which runs InitCatalogCache()).  Oracle_fdw has this
problem (see nearby thread) and there may be others out there.  The
extension wouldn't have to be used by the query that exhibited the
symptom... it could have been loaded earlier in the life of the leader
backend but caused no problem until eventually a parallel query was
launched.

We could probably improve that situation by making syscache lookups
(and probably other things too) fail when called from _PG_init() in
regular backends so that extension authors are made aware of this
hazard, or perhaps go the other way and change the order we do things
in parallel workers.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> We could probably improve that situation by making syscache lookups
> (and probably other things too) fail when called from _PG_init() in
> regular backends so that extension authors are made aware of this
> hazard, or perhaps go the other way and change the order we do things
> in parallel workers.

Hmm.  There's an argument to be made for the latter: we don't really
want stuff failing in parallel workers if it works fine normally.

On the other hand, it seems clear to me that we *don't* want extensions to
be doing stuff like syscache lookups in _PG_init(), because that would
prevent them from working as shared_preload_libraries entries.

And on the third hand, intentionally breaking code that used to work
isn't likely to win us many friends either.  So I'm not sure that your
first option is really tenable.  Perhaps we could get away with doing
it in HEAD and not back-patching ... but that does little for existing
problems.

            regards, tom lane


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Kieran McCusker
Date:
Yes I'm using extensions - A bunch of them but I'm guessing it's likely ogr_fdw that will be causing the issue as I've only seen this on days when someone has imported data using ogr_fdw.


On Tue, 28 Aug 2018, 03:40 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> We could probably improve that situation by making syscache lookups
> (and probably other things too) fail when called from _PG_init() in
> regular backends so that extension authors are made aware of this
> hazard, or perhaps go the other way and change the order we do things
> in parallel workers.

Hmm.  There's an argument to be made for the latter: we don't really
want stuff failing in parallel workers if it works fine normally.

On the other hand, it seems clear to me that we *don't* want extensions to
be doing stuff like syscache lookups in _PG_init(), because that would
prevent them from working as shared_preload_libraries entries.

And on the third hand, intentionally breaking code that used to work
isn't likely to win us many friends either.  So I'm not sure that your
first option is really tenable.  Perhaps we could get away with doing
it in HEAD and not back-patching ... but that does little for existing
problems.

                        regards, tom lane

Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Thomas Munro
Date:
On Wed, Aug 29, 2018 at 2:31 AM Kieran McCusker
<kieran.mccusker@gmail.com> wrote:
> Yes I'm using extensions - A bunch of them but I'm guessing it's likely ogr_fdw that will be causing the issue as
I'veonly seen this on days when someone has imported data using ogr_fdw.
 

Bingo:

https://github.com/pramsey/pgsql-ogr-fdw/blob/v1.0.6/ogr_fdw.c#L195

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Thomas Munro
Date:
On Tue, Aug 28, 2018 at 2:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > We could probably improve that situation by making syscache lookups
> > (and probably other things too) fail when called from _PG_init() in
> > regular backends so that extension authors are made aware of this
> > hazard, or perhaps go the other way and change the order we do things
> > in parallel workers.
>
> Hmm.  There's an argument to be made for the latter: we don't really
> want stuff failing in parallel workers if it works fine normally.
>
> On the other hand, it seems clear to me that we *don't* want extensions to
> be doing stuff like syscache lookups in _PG_init(), because that would
> prevent them from working as shared_preload_libraries entries.
>
> And on the third hand, intentionally breaking code that used to work
> isn't likely to win us many friends either.  So I'm not sure that your
> first option is really tenable.  Perhaps we could get away with doing
> it in HEAD and not back-patching ... but that does little for existing
> problems.

Ok, so I guess we should just  make it work then.  Here's throw-away
test patch to make "unaccent" blow up in this way, and a draft patch
to reorder things so that it works.

CREATE EXTENSION unaccent;
CREATE TABLE t AS SELECT 'hello world' AS v FROM generate_series(1, 1000000);
ANALYZE t;
EXPLAIN ANALYZE SELECT COUNT(unaccent(v)) FROM t;

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Amit Kapila
Date:
On Thu, Aug 30, 2018 at 4:16 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>
> On Tue, Aug 28, 2018 at 2:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > > We could probably improve that situation by making syscache lookups
> > > (and probably other things too) fail when called from _PG_init() in
> > > regular backends so that extension authors are made aware of this
> > > hazard, or perhaps go the other way and change the order we do things
> > > in parallel workers.
> >
> > Hmm.  There's an argument to be made for the latter: we don't really
> > want stuff failing in parallel workers if it works fine normally.
> >
> > On the other hand, it seems clear to me that we *don't* want extensions to
> > be doing stuff like syscache lookups in _PG_init(), because that would
> > prevent them from working as shared_preload_libraries entries.
> >
> > And on the third hand, intentionally breaking code that used to work
> > isn't likely to win us many friends either.  So I'm not sure that your
> > first option is really tenable.  Perhaps we could get away with doing
> > it in HEAD and not back-patching ... but that does little for existing
> > problems.
>
> Ok, so I guess we should just  make it work then.  Here's throw-away
> test patch to make "unaccent" blow up in this way, and a draft patch
> to reorder things so that it works.
>

+1 for reordering the things in parallel-workers.  It will make the
behavior of parallel workers similar to backends.  In fact, I have
proposed that myself sometime back [1] for the similar problem.

[1] - https://www.postgresql.org/message-id/CAA4eK1KH0wd3%2BtUDxMuMbqhLh0kKMu_TOfAcSVWuyfYVt9hfzQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Andres Freund
Date:
On 2018-08-27 22:40:17 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > We could probably improve that situation by making syscache lookups
> > (and probably other things too) fail when called from _PG_init() in
> > regular backends so that extension authors are made aware of this
> > hazard, or perhaps go the other way and change the order we do things
> > in parallel workers.
> 
> Hmm.  There's an argument to be made for the latter: we don't really
> want stuff failing in parallel workers if it works fine normally.

Yea, I guess there's an argument to be made for that.


> On the other hand, it seems clear to me that we *don't* want extensions to
> be doing stuff like syscache lookups in _PG_init(), because that would
> prevent them from working as shared_preload_libraries entries.

Yea, that doesn't seem great.


> And on the third hand, intentionally breaking code that used to work
> isn't likely to win us many friends either.  So I'm not sure that your
> first option is really tenable.  Perhaps we could get away with doing
> it in HEAD and not back-patching ... but that does little for existing
> problems.

I wonder if we could make it warn in all branches and error out hard in
master?  That way hopefully most extensions would be fixed long before
the next release comes out, and people don't load new libraries at that
crazy a rate...

Greetings,

Andres Freund


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Thomas Munro
Date:
On Mon, Sep 3, 2018 at 5:05 AM Andres Freund <andres@anarazel.de> wrote:
> On 2018-08-27 22:40:17 -0400, Tom Lane wrote:
> > Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > > We could probably improve that situation by making syscache lookups
> > > (and probably other things too) fail when called from _PG_init() in
> > > regular backends so that extension authors are made aware of this
> > > hazard, or perhaps go the other way and change the order we do things
> > > in parallel workers.
> >
> > Hmm.  There's an argument to be made for the latter: we don't really
> > want stuff failing in parallel workers if it works fine normally.
>
> Yea, I guess there's an argument to be made for that.
>
>
> > On the other hand, it seems clear to me that we *don't* want extensions to
> > be doing stuff like syscache lookups in _PG_init(), because that would
> > prevent them from working as shared_preload_libraries entries.
>
> Yea, that doesn't seem great.
>
>
> > And on the third hand, intentionally breaking code that used to work
> > isn't likely to win us many friends either.  So I'm not sure that your
> > first option is really tenable.  Perhaps we could get away with doing
> > it in HEAD and not back-patching ... but that does little for existing
> > problems.
>
> I wonder if we could make it warn in all branches and error out hard in
> master?  That way hopefully most extensions would be fixed long before
> the next release comes out, and people don't load new libraries at that
> crazy a rate...

Any other views?  It sounds like we do want a patch like the one I
posted earlier.  The question is whether we also want a new warning,
and in later releases an error.  Do we really want to crack down on
extension libraries that wouldn't work in shared_preload_libraries?
What about hypothetical libraries that are smart enough to avoid
making such calls when they detect that they're running from
shared_preload_libraries -- you'd generate false warnings when loaded
the regular way.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Tom Lane
Date:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> ... Do we really want to crack down on
> extension libraries that wouldn't work in shared_preload_libraries?

That is, perhaps, a fair question; but I see little benefit to users
in such a library design.  Most people who are using a library heavily
would likely wish to have it preloaded.

> What about hypothetical libraries that are smart enough to avoid
> making such calls when they detect that they're running from
> shared_preload_libraries -- you'd generate false warnings when loaded
> the regular way.

I find that argument too thin to take seriously.  If the library
can manage that, it might as well just postpone the calls all the
time, rather than have two different operating modes that would
require separate testing.  What would be the advantage of the
extra complication?

            regards, tom lane


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Thomas Munro
Date:
On Thu, Sep 13, 2018 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > ... Do we really want to crack down on
> > extension libraries that wouldn't work in shared_preload_libraries?
>
> That is, perhaps, a fair question; but I see little benefit to users
> in such a library design.  Most people who are using a library heavily
> would likely wish to have it preloaded.
>
> > What about hypothetical libraries that are smart enough to avoid
> > making such calls when they detect that they're running from
> > shared_preload_libraries -- you'd generate false warnings when loaded
> > the regular way.
>
> I find that argument too thin to take seriously.  If the library
> can manage that, it might as well just postpone the calls all the
> time, rather than have two different operating modes that would
> require separate testing.  What would be the advantage of the
> extra complication?

Ok, 0003 (attached) is a straw-man patch to log a warning.  I guess it
won't be too popular with anyone who has worked on making syscache as
fast as possible.

postgres=# select unaccent('hello');
WARNING:  syscache access from _PG_init() is deprecated and will not
be allowed in a future release of PostgreSQL
 unaccent
----------
 hello
(1 row)

You see this warning only once per library per backend (including
parallel workers, as can be tested with the same trivial query with
force_parallel_mode = on, assuming 0002 is also applied so it doesn't
crash).

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Andrew Gierth
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 Tom> That is, perhaps, a fair question; but I see little benefit to
 Tom> users in such a library design. Most people who are using a
 Tom> library heavily would likely wish to have it preloaded.

possible counterexample: preprepare

-- 
Andrew (irc:RhodiumToad)


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Thomas Munro
Date:
On Thu, Sep 13, 2018 at 4:22 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Sep 13, 2018 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@enterprisedb.com> writes:
> > > ... Do we really want to crack down on
> > > extension libraries that wouldn't work in shared_preload_libraries?
> >
> > That is, perhaps, a fair question; but I see little benefit to users
> > in such a library design.  Most people who are using a library heavily
> > would likely wish to have it preloaded.
> >
> > > What about hypothetical libraries that are smart enough to avoid
> > > making such calls when they detect that they're running from
> > > shared_preload_libraries -- you'd generate false warnings when loaded
> > > the regular way.
> >
> > I find that argument too thin to take seriously.  If the library
> > can manage that, it might as well just postpone the calls all the
> > time, rather than have two different operating modes that would
> > require separate testing.  What would be the advantage of the
> > extra complication?
>
> Ok, 0003 (attached) is a straw-man patch to log a warning.  I guess it
> won't be too popular with anyone who has worked on making syscache as
> fast as possible.

I would like to close this case, so unless anyone objects, tomorrow
I'm going to push the reordering patch (0002 in my earlier email) so
that we stop getting bug reports about extensions that crash in
parallel workers (even though we consider those extensions to be
operating out of contract).  If someone wants to do a patch that will
display warnings when extensions break that rule (something like the
0003 patch I sketched), that can be done separately.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15350: Getting invalid cache ID: 11 Errors

From
Thomas Munro
Date:
On Wed, Sep 19, 2018 at 6:10 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I would like to close this case, so unless anyone objects, tomorrow
> I'm going to push the reordering patch (0002 in my earlier email) so
> that we stop getting bug reports about extensions that crash in
> parallel workers (even though we consider those extensions to be
> operating out of contract).  If someone wants to do a patch that will
> display warnings when extensions break that rule (something like the
> 0003 patch I sketched), that can be done separately.

Done.

-- 
Thomas Munro
http://www.enterprisedb.com