Thread: BUG #15350: Getting invalid cache ID: 11 Errors
The following bug has been logged on the website: Bug reference: 15350 Logged by: Kieran McCusker Email address: kieran.mccusker@gmail.com PostgreSQL version: 10.5 Operating system: Fedora 28 Description: Hi We're using 10.5 with parallel queries enabled and the config options #max_worker_processes = 8 #max_parallel_workers_per_gather = 2 #max_parallel_workers = 8 I'm seeing invalid cache ID: 11 errors in the log. It's only happening occasionally (15 times today on a not very busy system). I've been following the thread about making windowing functions et al parallel restricted - Is this another manifestation of the problem? On the first query setting set session max_parallel_workers to 0 more than doubles the execution time from 208ms to 580ms so that doesn't seem desirable, and it's not obvious to me why the second would have failed. I'm probably being dense and if so apologies for wasting your time. I also don't see how to create a test case sorry. Many thanks Kieran A couple of example queries are :- ----------------------------------------------------------------------------------------------- with limits as ( select min(contact_date), max(contact_date) from q366midl.q366m1 where contact_date is not null ) select to_char(min,'YYYY-MM'), 'text' as format, to_char(min,substr('YYYY-MM',1,7)) as code, -- from limits union all select to_char(max,'YYYY-MM'), 'text' as format, to_char(max,substr('YYYY-MM',1,7)) as code, to_char(max,'Mon YYYY') as text from limits order by 1 2018-08-24 13:50:28.325 BST [68794] ERROR: invalid cache ID: 11 2018-08-24 13:50:28.325 BST [68844] LOG: worker process: parallel worker for PID 67216 (PID 68793) exited with exit code 1 2018-08-24 13:50:28.327 BST [68844] LOG: worker process: parallel worker for PID 67216 (PID 68794) exited with exit code 1 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 2018-08-23 11:04:23.020 BST [80148] ERROR: invalid cache ID: 11 2018-08-23 11:04:23.020 BST [80148] CONTEXT: parallel worker 2018-08-23 11:04:23.020 BST [80148] STATEMENT: SELECT "survey"."calls".* FROM "survey"."calls" WHERE ((survey_state is null or survey_state not in (500,501)) and phone_numbers @> array['07854 xxxxxx']) ORDER BY created_at desc 2018-08-23 11:04:23.021 BST [81154] ERROR: invalid cache ID: 11 2018-08-23 11:04:23.023 BST [68844] LOG: worker process: parallel worker for PID 80148 (PID 81153) exited with exit code 1 2018-08-23 11:04:23.023 BST [68844] LOG: worker process: parallel worker for PID 80148 (PID 81154) exited with exit code 1
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes: > We're using 10.5 with parallel queries enabled and the config options > #max_worker_processes = 8 > #max_parallel_workers_per_gather = 2 > #max_parallel_workers = 8 > I'm seeing invalid cache ID: 11 errors in the log. It's only happening > occasionally (15 times today on a not very busy system). Interesting. Syscache 11 would be AUTHOID, which seems to be consulted mostly for privilege checks, though there's at least one reference during process startup. > I've been following > the thread about making windowing functions et al parallel restricted - Is > this another manifestation of the problem? No, I don't think it's that. We've seen a couple of reports of similar failures referencing cache 42, but this is a new symptom. > I also don't see how to create a test case sorry. Don't suppose you'd be prepared to recompile PG to turn that ERROR into a PANIC, and then capture a stack trace from the core dump next time it happens? regards, tom lane
On Sat, Aug 25, 2018 at 3:16 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > =?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes: > > We're using 10.5 with parallel queries enabled and the config options > > #max_worker_processes = 8 > > #max_parallel_workers_per_gather = 2 > > #max_parallel_workers = 8 > > > I'm seeing invalid cache ID: 11 errors in the log. It's only happening > > occasionally (15 times today on a not very busy system). > > Interesting. Syscache 11 would be AUTHOID, which seems to be consulted > mostly for privilege checks, though there's at least one reference > during process startup. Hi Kieran, Are you using extensions, by any chance? If an extension were to access the AUTHOID syscache during _PG_init(), it would fail like this in parallel workers, because they run RestoreLibraryState() before they run BackgroundWorkerInitializeConnectionByOid() (which runs InitPostgres() which runs InitCatalogCache()). Oracle_fdw has this problem (see nearby thread) and there may be others out there. The extension wouldn't have to be used by the query that exhibited the symptom... it could have been loaded earlier in the life of the leader backend but caused no problem until eventually a parallel query was launched. We could probably improve that situation by making syscache lookups (and probably other things too) fail when called from _PG_init() in regular backends so that extension authors are made aware of this hazard, or perhaps go the other way and change the order we do things in parallel workers. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > We could probably improve that situation by making syscache lookups > (and probably other things too) fail when called from _PG_init() in > regular backends so that extension authors are made aware of this > hazard, or perhaps go the other way and change the order we do things > in parallel workers. Hmm. There's an argument to be made for the latter: we don't really want stuff failing in parallel workers if it works fine normally. On the other hand, it seems clear to me that we *don't* want extensions to be doing stuff like syscache lookups in _PG_init(), because that would prevent them from working as shared_preload_libraries entries. And on the third hand, intentionally breaking code that used to work isn't likely to win us many friends either. So I'm not sure that your first option is really tenable. Perhaps we could get away with doing it in HEAD and not back-patching ... but that does little for existing problems. regards, tom lane
Yes I'm using extensions - A bunch of them but I'm guessing it's likely ogr_fdw that will be causing the issue as I've only seen this on days when someone has imported data using ogr_fdw.
On Tue, 28 Aug 2018, 03:40 Tom Lane, <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> We could probably improve that situation by making syscache lookups
> (and probably other things too) fail when called from _PG_init() in
> regular backends so that extension authors are made aware of this
> hazard, or perhaps go the other way and change the order we do things
> in parallel workers.
Hmm. There's an argument to be made for the latter: we don't really
want stuff failing in parallel workers if it works fine normally.
On the other hand, it seems clear to me that we *don't* want extensions to
be doing stuff like syscache lookups in _PG_init(), because that would
prevent them from working as shared_preload_libraries entries.
And on the third hand, intentionally breaking code that used to work
isn't likely to win us many friends either. So I'm not sure that your
first option is really tenable. Perhaps we could get away with doing
it in HEAD and not back-patching ... but that does little for existing
problems.
regards, tom lane
On Wed, Aug 29, 2018 at 2:31 AM Kieran McCusker <kieran.mccusker@gmail.com> wrote: > Yes I'm using extensions - A bunch of them but I'm guessing it's likely ogr_fdw that will be causing the issue as I'veonly seen this on days when someone has imported data using ogr_fdw. Bingo: https://github.com/pramsey/pgsql-ogr-fdw/blob/v1.0.6/ogr_fdw.c#L195 -- Thomas Munro http://www.enterprisedb.com
On Tue, Aug 28, 2018 at 2:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > We could probably improve that situation by making syscache lookups > > (and probably other things too) fail when called from _PG_init() in > > regular backends so that extension authors are made aware of this > > hazard, or perhaps go the other way and change the order we do things > > in parallel workers. > > Hmm. There's an argument to be made for the latter: we don't really > want stuff failing in parallel workers if it works fine normally. > > On the other hand, it seems clear to me that we *don't* want extensions to > be doing stuff like syscache lookups in _PG_init(), because that would > prevent them from working as shared_preload_libraries entries. > > And on the third hand, intentionally breaking code that used to work > isn't likely to win us many friends either. So I'm not sure that your > first option is really tenable. Perhaps we could get away with doing > it in HEAD and not back-patching ... but that does little for existing > problems. Ok, so I guess we should just make it work then. Here's throw-away test patch to make "unaccent" blow up in this way, and a draft patch to reorder things so that it works. CREATE EXTENSION unaccent; CREATE TABLE t AS SELECT 'hello world' AS v FROM generate_series(1, 1000000); ANALYZE t; EXPLAIN ANALYZE SELECT COUNT(unaccent(v)) FROM t; -- Thomas Munro http://www.enterprisedb.com
Attachment
On Thu, Aug 30, 2018 at 4:16 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Tue, Aug 28, 2018 at 2:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > > We could probably improve that situation by making syscache lookups > > > (and probably other things too) fail when called from _PG_init() in > > > regular backends so that extension authors are made aware of this > > > hazard, or perhaps go the other way and change the order we do things > > > in parallel workers. > > > > Hmm. There's an argument to be made for the latter: we don't really > > want stuff failing in parallel workers if it works fine normally. > > > > On the other hand, it seems clear to me that we *don't* want extensions to > > be doing stuff like syscache lookups in _PG_init(), because that would > > prevent them from working as shared_preload_libraries entries. > > > > And on the third hand, intentionally breaking code that used to work > > isn't likely to win us many friends either. So I'm not sure that your > > first option is really tenable. Perhaps we could get away with doing > > it in HEAD and not back-patching ... but that does little for existing > > problems. > > Ok, so I guess we should just make it work then. Here's throw-away > test patch to make "unaccent" blow up in this way, and a draft patch > to reorder things so that it works. > +1 for reordering the things in parallel-workers. It will make the behavior of parallel workers similar to backends. In fact, I have proposed that myself sometime back [1] for the similar problem. [1] - https://www.postgresql.org/message-id/CAA4eK1KH0wd3%2BtUDxMuMbqhLh0kKMu_TOfAcSVWuyfYVt9hfzQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2018-08-27 22:40:17 -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > We could probably improve that situation by making syscache lookups > > (and probably other things too) fail when called from _PG_init() in > > regular backends so that extension authors are made aware of this > > hazard, or perhaps go the other way and change the order we do things > > in parallel workers. > > Hmm. There's an argument to be made for the latter: we don't really > want stuff failing in parallel workers if it works fine normally. Yea, I guess there's an argument to be made for that. > On the other hand, it seems clear to me that we *don't* want extensions to > be doing stuff like syscache lookups in _PG_init(), because that would > prevent them from working as shared_preload_libraries entries. Yea, that doesn't seem great. > And on the third hand, intentionally breaking code that used to work > isn't likely to win us many friends either. So I'm not sure that your > first option is really tenable. Perhaps we could get away with doing > it in HEAD and not back-patching ... but that does little for existing > problems. I wonder if we could make it warn in all branches and error out hard in master? That way hopefully most extensions would be fixed long before the next release comes out, and people don't load new libraries at that crazy a rate... Greetings, Andres Freund
On Mon, Sep 3, 2018 at 5:05 AM Andres Freund <andres@anarazel.de> wrote: > On 2018-08-27 22:40:17 -0400, Tom Lane wrote: > > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > > We could probably improve that situation by making syscache lookups > > > (and probably other things too) fail when called from _PG_init() in > > > regular backends so that extension authors are made aware of this > > > hazard, or perhaps go the other way and change the order we do things > > > in parallel workers. > > > > Hmm. There's an argument to be made for the latter: we don't really > > want stuff failing in parallel workers if it works fine normally. > > Yea, I guess there's an argument to be made for that. > > > > On the other hand, it seems clear to me that we *don't* want extensions to > > be doing stuff like syscache lookups in _PG_init(), because that would > > prevent them from working as shared_preload_libraries entries. > > Yea, that doesn't seem great. > > > > And on the third hand, intentionally breaking code that used to work > > isn't likely to win us many friends either. So I'm not sure that your > > first option is really tenable. Perhaps we could get away with doing > > it in HEAD and not back-patching ... but that does little for existing > > problems. > > I wonder if we could make it warn in all branches and error out hard in > master? That way hopefully most extensions would be fixed long before > the next release comes out, and people don't load new libraries at that > crazy a rate... Any other views? It sounds like we do want a patch like the one I posted earlier. The question is whether we also want a new warning, and in later releases an error. Do we really want to crack down on extension libraries that wouldn't work in shared_preload_libraries? What about hypothetical libraries that are smart enough to avoid making such calls when they detect that they're running from shared_preload_libraries -- you'd generate false warnings when loaded the regular way. -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > ... Do we really want to crack down on > extension libraries that wouldn't work in shared_preload_libraries? That is, perhaps, a fair question; but I see little benefit to users in such a library design. Most people who are using a library heavily would likely wish to have it preloaded. > What about hypothetical libraries that are smart enough to avoid > making such calls when they detect that they're running from > shared_preload_libraries -- you'd generate false warnings when loaded > the regular way. I find that argument too thin to take seriously. If the library can manage that, it might as well just postpone the calls all the time, rather than have two different operating modes that would require separate testing. What would be the advantage of the extra complication? regards, tom lane
On Thu, Sep 13, 2018 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > ... Do we really want to crack down on > > extension libraries that wouldn't work in shared_preload_libraries? > > That is, perhaps, a fair question; but I see little benefit to users > in such a library design. Most people who are using a library heavily > would likely wish to have it preloaded. > > > What about hypothetical libraries that are smart enough to avoid > > making such calls when they detect that they're running from > > shared_preload_libraries -- you'd generate false warnings when loaded > > the regular way. > > I find that argument too thin to take seriously. If the library > can manage that, it might as well just postpone the calls all the > time, rather than have two different operating modes that would > require separate testing. What would be the advantage of the > extra complication? Ok, 0003 (attached) is a straw-man patch to log a warning. I guess it won't be too popular with anyone who has worked on making syscache as fast as possible. postgres=# select unaccent('hello'); WARNING: syscache access from _PG_init() is deprecated and will not be allowed in a future release of PostgreSQL unaccent ---------- hello (1 row) You see this warning only once per library per backend (including parallel workers, as can be tested with the same trivial query with force_parallel_mode = on, assuming 0002 is also applied so it doesn't crash). -- Thomas Munro http://www.enterprisedb.com
Attachment
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes: Tom> That is, perhaps, a fair question; but I see little benefit to Tom> users in such a library design. Most people who are using a Tom> library heavily would likely wish to have it preloaded. possible counterexample: preprepare -- Andrew (irc:RhodiumToad)
On Thu, Sep 13, 2018 at 4:22 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Sep 13, 2018 at 1:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@enterprisedb.com> writes: > > > ... Do we really want to crack down on > > > extension libraries that wouldn't work in shared_preload_libraries? > > > > That is, perhaps, a fair question; but I see little benefit to users > > in such a library design. Most people who are using a library heavily > > would likely wish to have it preloaded. > > > > > What about hypothetical libraries that are smart enough to avoid > > > making such calls when they detect that they're running from > > > shared_preload_libraries -- you'd generate false warnings when loaded > > > the regular way. > > > > I find that argument too thin to take seriously. If the library > > can manage that, it might as well just postpone the calls all the > > time, rather than have two different operating modes that would > > require separate testing. What would be the advantage of the > > extra complication? > > Ok, 0003 (attached) is a straw-man patch to log a warning. I guess it > won't be too popular with anyone who has worked on making syscache as > fast as possible. I would like to close this case, so unless anyone objects, tomorrow I'm going to push the reordering patch (0002 in my earlier email) so that we stop getting bug reports about extensions that crash in parallel workers (even though we consider those extensions to be operating out of contract). If someone wants to do a patch that will display warnings when extensions break that rule (something like the 0003 patch I sketched), that can be done separately. -- Thomas Munro http://www.enterprisedb.com
On Wed, Sep 19, 2018 at 6:10 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I would like to close this case, so unless anyone objects, tomorrow > I'm going to push the reordering patch (0002 in my earlier email) so > that we stop getting bug reports about extensions that crash in > parallel workers (even though we consider those extensions to be > operating out of contract). If someone wants to do a patch that will > display warnings when extensions break that rule (something like the > 0003 patch I sketched), that can be done separately. Done. -- Thomas Munro http://www.enterprisedb.com