Thread: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker

Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker

From
jimmy
Date:
I use one normal table(postgesql table) (1.5 million records) and three foreign tables(oracle table) (1.5 million records, 5 million records, 5 million records) to query data, it works.
Then I create a new normal table, and export foreign table data(1.5 million records) into this table,
After that I use two normal tables and two foreign tables query data, it does not work. 
And display these errors,
-----------------------------
ERROR:  invalid cache ID: 42
CONTEXT:  parallel worker
-----------------------------
Same sql why it does not work. I do not know how to resolve it.

I use postgresql 10.4, and oracle_fdw 2.0.0.
I use postgresql-10.4-1-linux-x64.run to install the database.
===============================================================
bash-4.1$ psql --version
psql (PostgreSQL) 10.4

postgres=# select version();
                                                 version                                                 
---------------------------------------------------------------------------------------------------------
 PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-18), 64-bit
===============================================================

the sql is like below:
When it works, 
tableA is a normal postgresql table
tableB, tableB1, tableC are foreign tables

When it does not work,
tableA, tableC are normal postgresql tables
tableB, tableB1 are foreign tables
=====================================================================
with sql1 as(
    SELECT 
a.x1,
a.x2,
a.x3,
a.x4,
abs(b.y1) y1,
b1.z1,
c.s1,
(b1.z1-a.x3) as difference_value
    FROM tableA as a 
    left join tableB as b 
    on a.f1 = b.f1
left join tableB1 as b1 
    on a.f1 = b1.f1
and b.f1 = b1.f1
    left join tableC as c 
    on a.belong = c.belong 
    and a.no = c.no
    where a.date >= to_date('2017-12-12', 'yyyy-mm-dd') 
    and a.code='XX' 
)

SELECT a.x1,
       PERCENTILE_DISC(0.025) WITHIN GROUP(ORDER BY a.difference_value ),
       PERCENTILE_DISC(0.975) WITHIN GROUP(ORDER BY a.difference_value )
  FROM sql1 a
  GROUP BY a.x1
  union all
 SELECT a.x1,
       PERCENTILE_DISC(0.025) WITHIN GROUP(ORDER BY a.x2 ),
       PERCENTILE_DISC(0.975) WITHIN GROUP(ORDER BY a.x2 )
  FROM sql1 a
 
 GROUP BY a.x1
  union all
 SELECT a.x1,
       PERCENTILE_DISC(0.025) WITHIN GROUP(ORDER BY a.x3 ),
       PERCENTILE_DISC(0.975) WITHIN GROUP(ORDER BY a.x3 )
  FROM sql1 a
 
 GROUP BY a.x1
 union all
 SELECT a.x1,
       PERCENTILE_DISC(0.025) WITHIN GROUP(ORDER BY a.x4 ),
       PERCENTILE_DISC(0.975) WITHIN GROUP(ORDER BY a.x4)
  FROM sql1 a
 GROUP BY a.x1
  union all
 SELECT a.x1,
       PERCENTILE_DISC(0.025) WITHIN GROUP(ORDER BY a.y1 ),
       PERCENTILE_DISC(0.975) WITHIN GROUP(ORDER BY a.y1 )
  FROM sql1 a
  GROUP BY a.x1
     union all
 SELECT a.x1,
       PERCENTILE_DISC(0.025) WITHIN GROUP(ORDER BY a.x5 ),
       PERCENTILE_DISC(0.975) WITHIN GROUP(ORDER BY a.x5 )
  FROM sql1 a
  GROUP BY a.x1
    union all
 SELECT a.x1,
       PERCENTILE_DISC(0.025) WITHIN GROUP(ORDER BY a.s1 ),
       PERCENTILE_DISC(0.975) WITHIN GROUP(ORDER BY a.s1 )
  FROM sql1 a
  GROUP BY a.x1
      union all
 SELECT a.x1,
       PERCENTILE_DISC(0.025) WITHIN GROUP(ORDER BY a.z1 ),
       PERCENTILE_DISC(0.975) WITHIN GROUP(ORDER BY a.z1 )
  FROM sql1 a
  GROUP BY a.x1;
===================================================================


 

Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker

From
Tom Lane
Date:
jimmy <mpokky@126.com> writes:
> ERROR:  invalid cache ID: 42
> CONTEXT:  parallel worker

We had a similar report back in April:

https://www.postgresql.org/message-id/flat/152337570402.31228.237601111670899842%40wrigleys.postgresql.org

but that didn't have enough information to debug the problem, and
neither does this.  Can you provide a *self contained* test case,
or failing that, do some debugging of your own as suggested in the
previous thread?

https://wiki.postgresql.org/wiki/Guide_to_reporting_problems

            regards, tom lane


Re:Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker

From
jimmy
Date:
This is the debug log below. Is it useful. Thank you.


Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x0000000000828ed7 in hash_search_with_hash_value ()
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x0000000000828ed7 in hash_search_with_hash_value ()
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00000000006f16e3 in ?? ()
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00000000006f16e3 in ?? ()
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00000035848e8853 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00000035848e8853 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Continuing.

Program received signal SIGUSR1, User defined signal 1.
0x00000035848e8853 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
#0  0x00000035848e8853 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000006fd685 in WaitEventSetWait ()
#2  0x00000000006fdee9 in WaitLatchOrSocket ()
#3  0x00000000006a56c8 in WaitForBackgroundWorkerShutdown ()
#4  0x00000000004de17e in ?? ()
#5  0x00000000004de2a8 in DestroyParallelContext ()
#6  0x00000000004de343 in AtEOXact_Parallel ()
#7  0x00000000004e7597 in ?? ()
#8  0x00000000004e7be5 in AbortCurrentTransaction ()
#9  0x000000000072023d in PostgresMain ()
#10 0x00000000006b498a in PostmasterMain ()
#11 0x0000000000635720 in main ()
#0  0x00000035848e8853 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000006fd685 in WaitEventSetWait ()
#2  0x00000000006fdee9 in WaitLatchOrSocket ()
#3  0x00000000006a56c8 in WaitForBackgroundWorkerShutdown ()
#4  0x00000000004de17e in ?? ()
#5  0x00000000004de2a8 in DestroyParallelContext ()
#6  0x00000000004de343 in AtEOXact_Parallel ()
#7  0x00000000004e7597 in ?? ()
#8  0x00000000004e7be5 in AbortCurrentTransaction ()
#9  0x000000000072023d in PostgresMain ()
#10 0x00000000006b498a in PostmasterMain ()
#11 0x0000000000635720 in main ()
Continuing.

Program received signal SIGINT, Interrupt.
0x00000035848e8853 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
#0  0x00000035848e8853 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000006fd685 in WaitEventSetWait ()
#2  0x00000000006253e6 in secure_read ()
#3  0x000000000062fddb in ?? ()
#4  0x00000000006301eb in pq_getbyte ()
#5  0x000000000071f9d0 in PostgresMain ()
#6  0x00000000006b498a in PostmasterMain ()
#7  0x0000000000635720 in main ()
#0  0x00000035848e8853 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000006fd685 in WaitEventSetWait ()
#2  0x00000000006253e6 in secure_read ()
#3  0x000000000062fddb in ?? ()
#4  0x00000000006301eb in pq_getbyte ()
#5  0x000000000071f9d0 in PostgresMain ()
#6  0x00000000006b498a in PostmasterMain ()
#7  0x0000000000635720 in main ()







At 2018-08-17 21:28:17, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >jimmy <mpokky@126.com> writes: >> ERROR: invalid cache ID: 42 >> CONTEXT: parallel worker > >We had a similar report back in April: > >https://www.postgresql.org/message-id/flat/152337570402.31228.237601111670899842%40wrigleys.postgresql.org > >but that didn't have enough information to debug the problem, and >neither does this. Can you provide a *self contained* test case, >or failing that, do some debugging of your own as suggested in the >previous thread? > >https://wiki.postgresql.org/wiki/Guide_to_reporting_problems > > regards, tom lane


 

Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker

From
Thomas Munro
Date:
On Wed, Aug 22, 2018 at 2:54 PM, jimmy <mpokky@126.com> wrote:
> This is the debug log below. Is it useful. Thank you.

That's not showing the path that reaches the error.  If it's happening
in a parallel worker, that'll probably be tricky to catch with a
breakpoint.  Are you able to recompile PostgreSQL?  If you could do
that after changing all cases of elog(ERROR, "invalid cache ID: %d",
cacheId) to PANIC instead of ERROR, and then start it with ulimit -c
unlimited, you might get a core file that you can load into a debugger
to see how we reached it.

It's a strange error.  I don't think it can be coming from these
places in inval.c:

        if (cacheid < 0 || cacheid >= SysCacheSize)
                elog(ERROR, "invalid cache ID: %d", cacheid);

... because we can see that it's 42 (PROCNAMEARGSNSP, a valid cache
ID), and SysCacheSize is a compile-time constant greater than 42.  So
it must be coming from one of the places in syscache.c that look like
this:

        if (cacheId < 0 || cacheId >= SysCacheSize ||
                !PointerIsValid(SysCache[cacheId]))
                elog(ERROR, "invalid cache ID: %d", cacheId);

Since InitCatalogCache() puts a non-NULL pointer into every index from
0 to SysCacheSize - 1 without gaps (or it errors out if it fails while
trying), it seems like either InitCatalogCache() didn't run, or
SysCache[42] has later been overwritten with NULL?  I wondered if
there is some way for a parallel worker to reach shared invalidation
message processing code before the InitCatalogCache() has run, but
that doesn't seem to be an issue: SysCacheInvalidate() quietly
tolerates that.

I wonder how we could reach one of SearchSysCache(PROCNAMEARGSNSP,
...), SysCacheGetAttr(PROCNAMEARGSNSP, ...),
GetSysCacheHashValue(PROCNAMEARGSNSP, ...),
SearchSysCacheList(PROCNAMEARGSNSP, ...) before InitCatalogCache() has
finished?  The answer probably involves oracle_fdw.

Ahh, how about this line here:

https://github.com/laurenz/oracle_fdw/blob/master/oracle_fdw.c#L6237

catlist = SearchSysCacheList2(
PROCNAMEARGSNSP,
CStringGetDatum("geometry_recv"),
PointerGetDatum(buildoidvector(argtypes, argcount)));

I don't immediately see how that can be reached before
InitCatalogCache() has run, though.

-- 
Thomas Munro
http://www.enterprisedb.com


Re:Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallelworker

From
jimmy
Date:
I create two normal postgresql tables, and import datas from two foreign tables.
Then I use four normal postgresql tables to query data, and it still throw these errors, after that  I am afraid that it maybe is not connection  with oracle fdw.
In these tables, one table has over 200 fields, and the other two tables, each table has over 800 fields.
When I use left join, Would that be the problem that caused this.






At 2018-08-22 12:47:30, "Thomas Munro" <thomas.munro@enterprisedb.com> wrote: >On Wed, Aug 22, 2018 at 2:54 PM, jimmy <mpokky@126.com> wrote: >> This is the debug log below. Is it useful. Thank you. > >That's not showing the path that reaches the error. If it's happening >in a parallel worker, that'll probably be tricky to catch with a >breakpoint. Are you able to recompile PostgreSQL? If you could do >that after changing all cases of elog(ERROR, "invalid cache ID: %d", >cacheId) to PANIC instead of ERROR, and then start it with ulimit -c >unlimited, you might get a core file that you can load into a debugger >to see how we reached it. > >It's a strange error. I don't think it can be coming from these >places in inval.c: > > if (cacheid < 0 || cacheid >= SysCacheSize) > elog(ERROR, "invalid cache ID: %d", cacheid); > >... because we can see that it's 42 (PROCNAMEARGSNSP, a valid cache >ID), and SysCacheSize is a compile-time constant greater than 42. So >it must be coming from one of the places in syscache.c that look like >this: > > if (cacheId < 0 || cacheId >= SysCacheSize || > !PointerIsValid(SysCache[cacheId])) > elog(ERROR, "invalid cache ID: %d", cacheId); > >Since InitCatalogCache() puts a non-NULL pointer into every index from >0 to SysCacheSize - 1 without gaps (or it errors out if it fails while >trying), it seems like either InitCatalogCache() didn't run, or >SysCache[42] has later been overwritten with NULL? I wondered if >there is some way for a parallel worker to reach shared invalidation >message processing code before the InitCatalogCache() has run, but >that doesn't seem to be an issue: SysCacheInvalidate() quietly >tolerates that. > >I wonder how we could reach one of SearchSysCache(PROCNAMEARGSNSP, >...), SysCacheGetAttr(PROCNAMEARGSNSP, ...), >GetSysCacheHashValue(PROCNAMEARGSNSP, ...), >SearchSysCacheList(PROCNAMEARGSNSP, ...) before InitCatalogCache() has >finished? The answer probably involves oracle_fdw. > >Ahh, how about this line here: > >https://github.com/laurenz/oracle_fdw/blob/master/oracle_fdw.c#L6237 > >catlist = SearchSysCacheList2( >PROCNAMEARGSNSP, >CStringGetDatum("geometry_recv"), >PointerGetDatum(buildoidvector(argtypes, argcount))); > >I don't immediately see how that can be reached before >InitCatalogCache() has run, though. > >-- >Thomas Munro >http://www.enterprisedb.com


 

Re: Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker

From
Amit Kapila
Date:
On Fri, Aug 24, 2018 at 11:35 AM jimmy <mpokky@126.com> wrote:
>
> I create two normal postgresql tables, and import datas from two foreign tables.
> Then I use four normal postgresql tables to query data, and it still throw these errors, after that  I am afraid that
itmaybe is not connection  with oracle fdw.
 
>

So, can we assume that there is no fdw involved in the query that is
generating an error? Can you share the plan (Explain Analyze or
Explain output) of the query?

> In these tables, one table has over 200 fields, and the other two tables, each table has over 800 fields.
> When I use left join, Would that be the problem that caused this.
>

I am not sure but there is no apparent reason why on using some form
of join, it throws such an error.  I think we can make better progress
if you get us the call stack of the parallel worker/'s.  I can see
above that you have got the call stack, but it is from the master
backend and the error is generated in the parallel worker.  Thomas has
already suggested a way to get the call stack, if you can do that, it
would be great.  I think we can help you to get the call stack if you
are facing any problem, have you tried getting it?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


I use postgresql-10.4-1-linux-x64.run to install.
Would you tell me how to recompile the database.
Thank you.


At 2018-08-25 11:19:10, "Amit Kapila" <amit.kapila16@gmail.com> wrote: >On Fri, Aug 24, 2018 at 11:35 AM jimmy <mpokky@126.com> wrote: >> >> I create two normal postgresql tables, and import datas from two foreign tables. >> Then I use four normal postgresql tables to query data, and it still throw these errors, after that I am afraid that it maybe is not connection with oracle fdw. >> > >So, can we assume that there is no fdw involved in the query that is >generating an error? Can you share the plan (Explain Analyze or >Explain output) of the query? > >> In these tables, one table has over 200 fields, and the other two tables, each table has over 800 fields. >> When I use left join, Would that be the problem that caused this. >> > >I am not sure but there is no apparent reason why on using some form >of join, it throws such an error. I think we can make better progress >if you get us the call stack of the parallel worker/'s. I can see >above that you have got the call stack, but it is from the master >backend and the error is generated in the parallel worker. Thomas has >already suggested a way to get the call stack, if you can do that, it >would be great. I think we can help you to get the call stack if you >are facing any problem, have you tried getting it? > > >-- >With Regards, >Amit Kapila. >EnterpriseDB: http://www.enterprisedb.com


 

Re: Re: Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker

From
Amit Kapila
Date:
On Mon, Aug 27, 2018 at 8:07 AM jimmy <mpokky@126.com> wrote:
>
> I use postgresql-10.4-1-linux-x64.run to install.
> Would you tell me how to recompile the database.
>

You can refer PostgreSQL docs:
https://www.postgresql.org/docs/devel/static/installation.html

You might need to refer wiki to know how to get a stack trace:
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Let explain it.When I use postgresql-10.4-1-linux-x64.run to install.
There does not use source code to compile, and make , make install. It is like a windows program. I just click next step button.
The database was installed from the ui interface step by step.






At 2018-08-27 13:54:40, "Amit Kapila" <amit.kapila16@gmail.com> wrote: >On Mon, Aug 27, 2018 at 8:07 AM jimmy <mpokky@126.com> wrote: >> >> I use postgresql-10.4-1-linux-x64.run to install. >> Would you tell me how to recompile the database. >> > >You can refer PostgreSQL docs: >https://www.postgresql.org/docs/devel/static/installation.html > >You might need to refer wiki to know how to get a stack trace: >https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend > >-- >With Regards, >Amit Kapila. >EnterpriseDB: http://www.enterprisedb.com


 

Re: Re: Re: Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT:parallel worker

From
Thomas Munro
Date:
On Mon, Aug 27, 2018 at 6:07 PM jimmy <mpokky@126.com> wrote:
> Let explain it.When I use postgresql-10.4-1-linux-x64.run to install.
> There does not use source code to compile, and make , make install. It is like a windows program. I just click next
stepbutton.
 
> The database was installed from the ui interface step by step.

Hi Jimmy,

Just want to double check something: are you sure you're using
oracle_fdw 2.0.0?  It seems that the earlier versions suffered from a
problem with exactly the symptom you describe (except the error said
41 instead of 42, but that's expected because the enumerator values
moved):

https://www.postgresql.org/message-id/11960.1511116873%40sss.pgh.pa.us

But that was fixed here:

https://github.com/laurenz/oracle_fdw/commit/4accfebb33c316d71da73d341dac796df813638c

The problem was that it tried to access the system caches from
_PG_init().  It wouldn't matter if you were querying an oracle_fdw
table or not, it'd only matter if that library had been loaded in the
leader process, because that would cause the parallel workers to load
it, run _PG_init(), and blow up.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Re: Re: Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT:parallel worker

From
Thomas Munro
Date:
On Mon, Aug 27, 2018 at 11:19 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Just want to double check something: are you sure you're using
> oracle_fdw 2.0.0?  It seems that the earlier versions suffered from a
> problem with exactly the symptom you describe (except the error said
> 41 instead of 42, but that's expected because the enumerator values
> moved):
>
> https://www.postgresql.org/message-id/11960.1511116873%40sss.pgh.pa.us
>
> But that was fixed here:
>
> https://github.com/laurenz/oracle_fdw/commit/4accfebb33c316d71da73d341dac796df813638c

Ah, I missed the fact that the 2.0.0 release didn't have that fix, and
there hasn't been a new release since.  So this is an issue to take up
on the oracle_fdw issue tracker.

-- 
Thomas Munro
http://www.enterprisedb.com



hi,I  exported the data into the normal postgresql table from foreign table by oracle_fdw.
Now I use four normal tables to query data, not use foreign table .
but it still throws the error like below:
So I think maybe that is not much related with oracle_fdw, because I use the normal tables of Postgresql to query data.
--------------------------------------
ERROR: invalid cache ID: 42 CONTEXT: parallel worker 
SQL 状态:XX000
-------------------------------------- 
I test the same sql in postgres-bigsql-10.5 for windows version, that still throws the error like above.
and the quantity of these tables' field are very large , every tables has more than 800 fields.
Would it make these errors .
Or maybe would the sql I execute has some mistakes.
I am confused.What is wrong.
The sql has been provided before.
I use some unique table indexs like this:
create unique index tableB_id1 on tableB(x);




At 2018-08-27 20:54:27, "Thomas Munro" <thomas.munro@enterprisedb.com> wrote: >On Mon, Aug 27, 2018 at 11:19 PM Thomas Munro ><thomas.munro@enterprisedb.com> wrote: >> Just want to double check something: are you sure you're using >> oracle_fdw 2.0.0? It seems that the earlier versions suffered from a >> problem with exactly the symptom you describe (except the error said >> 41 instead of 42, but that's expected because the enumerator values >> moved): >> >> https://www.postgresql.org/message-id/11960.1511116873%40sss.pgh.pa.us >> >> But that was fixed here: >> >> https://github.com/laurenz/oracle_fdw/commit/4accfebb33c316d71da73d341dac796df813638c > >Ah, I missed the fact that the 2.0.0 release didn't have that fix, and >there hasn't been a new release since. So this is an issue to take up >on the oracle_fdw issue tracker. > >-- >Thomas Munro >http://www.enterprisedb.com