Thread: PL/R Median Busts Commit (Postgres 9.1.6 + plr 8.3.0.13 on Ubuntu 12.10 64 bit)

PL/R Median Busts Commit (Postgres 9.1.6 + plr 8.3.0.13 on Ubuntu 12.10 64 bit)

From
Mark Kirkwood
Date:
I admit - it sounds unlikely. However a simple scenario (attached) gives
rise to:

WARNING:  AbortTransaction while in COMMIT state
PANIC:  cannot abort transaction 880983, it was already committed

Essentially we are doing:

BEGIN;
DROP TABLE IF EXISTS tab0;
CREATE TEMP TABLE tab0 ( id INTEGER PRIMARY KEY, val TEXT);
INSERT INTO tab0  SELECT generate_series(1,1000),'xxxxxxxxxx';
SELECT median(id) FROM tab0;
DROP TABLE IF EXISTS tab1;
CREATE TEMP TABLE tab1 ( id INTEGER PRIMARY KEY, val TEXT);
INSERT INTO tab1  SELECT generate_series(1,1000),'xxxxxxxxxx';
DROP TABLE IF EXISTS tab2;
...
DROP TABLE IF EXISTS tab99;
CREATE TEMP TABLE tab99 ( id INTEGER PRIMARY KEY, val TEXT);
INSERT INTO tab99  SELECT generate_series(1,1000),'xxxxxxxxxx';
SELECT median(id) FROM tab99;
COMMIT;


We crash at commit.

Changing median to something else (e.g avg or even a non PLR median
function) makes the error vanish, so its either PLR or the specific PLR
median function causing the grief.

Regards

Mark



Attachment
On 01/24/2013 05:21 AM, Mark Kirkwood wrote:
> I admit - it sounds unlikely. However a simple scenario (attached) gives
> rise to:

This is the wrong place for the bug report on PL/R I think, but I'll
take a look.

Joe


> WARNING:  AbortTransaction while in COMMIT state
> PANIC:  cannot abort transaction 880983, it was already committed
>
> Essentially we are doing:
>
> BEGIN;
> DROP TABLE IF EXISTS tab0;
> CREATE TEMP TABLE tab0 ( id INTEGER PRIMARY KEY, val TEXT);
> INSERT INTO tab0  SELECT generate_series(1,1000),'xxxxxxxxxx';
> SELECT median(id) FROM tab0;
> DROP TABLE IF EXISTS tab1;
> CREATE TEMP TABLE tab1 ( id INTEGER PRIMARY KEY, val TEXT);
> INSERT INTO tab1  SELECT generate_series(1,1000),'xxxxxxxxxx';
> DROP TABLE IF EXISTS tab2;
> ...
> DROP TABLE IF EXISTS tab99;
> CREATE TEMP TABLE tab99 ( id INTEGER PRIMARY KEY, val TEXT);
> INSERT INTO tab99  SELECT generate_series(1,1000),'xxxxxxxxxx';
> SELECT median(id) FROM tab99;
> COMMIT;
>
>
> We crash at commit.
>
> Changing median to something else (e.g avg or even a non PLR median
> function) makes the error vanish, so its either PLR or the specific PLR
> median function causing the grief.


--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
Ah right - sorry, I did a quick look for a mail list on the plr web site
and didn't spot anything.

Thanks

Mark

On 25/01/13 04:14, Joe Conway wrote:
> On 01/24/2013 05:21 AM, Mark Kirkwood wrote:
>> I admit - it sounds unlikely. However a simple scenario (attached) gives
>> rise to:
>
> This is the wrong place for the bug report on PL/R I think, but I'll
> take a look.
>
On 01/24/2013 08:01 PM, Mark Kirkwood wrote:
> Ah right - sorry, I did a quick look for a mail list on the plr web site
> and didn't spot anything.

No problem. It is plr-general on pgfoundry:
http://pgfoundry.org/mail/?group_id=1000247

Joe


--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
On 25/01/13 04:14, Joe Conway wrote:
> On 01/24/2013 05:21 AM, Mark Kirkwood wrote:
>> I admit - it sounds unlikely. However a simple scenario (attached) gives
>> rise to:
>
> This is the wrong place for the bug report on PL/R I think, but I'll
> take a look.
>
> Joe
>
>

FYI - 8.4 shows the same behaviour as 9.1, but 8.3 gets a SUGUSR1:


Program received signal SIGUSR1, User defined signal 1.
0x00000000005a401f in SIInsertDataEntry (segP=0x7f3b8ea5e2a0,
     data=data@entry=0x4082d10) at sinvaladt.c:255
255        segP->buffer[segP->maxMsgNum % MAXNUMMESSAGES] = *data;
(gdb)
On 25/01/13 10:12, Mark Kirkwood wrote:
> On 25/01/13 04:14, Joe Conway wrote:
>> On 01/24/2013 05:21 AM, Mark Kirkwood wrote:
>>> I admit - it sounds unlikely. However a simple scenario (attached) gives
>>> rise to:
>>
>> This is the wrong place for the bug report on PL/R I think, but I'll
>> take a look.
>>
>> Joe
>>
>>
>
> FYI - 8.4 shows the same behaviour as 9.1, but 8.3 gets a SUGUSR1:
>
>
> Program received signal SIGUSR1, User defined signal 1.
> 0x00000000005a401f in SIInsertDataEntry (segP=0x7f3b8ea5e2a0,
>      data=data@entry=0x4082d10) at sinvaladt.c:255
> 255        segP->buffer[segP->maxMsgNum % MAXNUMMESSAGES] = *data;
> (gdb)
>
>

Sorry - the others are getting a SIGUSR1 too (just was not so obvious).
Here's bt from 9.1:

Program received signal SIGUSR1, User defined signal 1.
0x00007f2bfe4c1707 in kill () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f2bfe4c1707 in kill () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000000000611683 in SICleanupQueue (callerHasWriteLock=1 '\001',
     minFree=<optimized out>) at sinvaladt.c:672
#2  0x0000000000611768 in SIInsertDataEntries (data=0x7f2bf68b4cb0, n=0)
     at sinvaladt.c:437
#3  0x00000000006cdad1 in ProcessInvalidationMessagesMulti (hdr=0x295b240,
     func=0x610de0 <SendSharedInvalidMessages>) at inval.c:402
#4  0x00000000006ce780 in AtEOXact_Inval (isCommit=<optimized out>)
     at inval.c:940
#5  0x00000000004826cd in CommitTransaction () at xact.c:1893
#6  0x000000000048467d in CommitTransactionCommand () at xact.c:2562
#7  0x0000000000623a17 in finish_xact_command () at postgres.c:2452
#8  finish_xact_command () at postgres.c:2441
#9  0x0000000000626b5d in exec_simple_query (query_string=0x29f7d60
"COMMIT;")
     at postgres.c:1051
#10 PostgresMain (argc=<optimized out>, argv=<optimized out>,
     username=<optimized out>) at postgres.c:3968
#11 0x00000000005eb429 in BackendRun (port=0x2979f00) at postmaster.c:3611
#12 BackendStartup (port=0x2979f00) at postmaster.c:3296
#13 ServerLoop () at postmaster.c:1460
#14 0x00000000005ebd3c in PostmasterMain (argc=<optimized out>,
argv=0x2955600)
     at postmaster.c:1121
#15 0x0000000000453d20 in main (argc=1, argv=0x2955600) at main.c:199
(gdb)
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
> Sorry - the others are getting a SIGUSR1 too (just was not so obvious).

SIGUSR1 is not a bug, it's expected cross-session signaling behavior.

            regards, tom lane
On 25/01/13 10:18, Tom Lane wrote:
> Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
>> Sorry - the others are getting a SIGUSR1 too (just was not so obvious).
>
> SIGUSR1 is not a bug, it's expected cross-session signaling behavior.
>
>             regards, tom lane
>
>

Doh! Yes of course, sorry for the noise. I was busy thinking that the
issue could be tied up with sinval and plan caching (if there is any) in
plr and got excited about seeing something in gdb...and didn't think
carefully about why what I was seeing was not a bug at all :-( Thanks
for clarifying!

Cheers

Mark
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
> Doh! Yes of course, sorry for the noise. I was busy thinking that the
> issue could be tied up with sinval and plan caching (if there is any) in
> plr and got excited about seeing something in gdb...and didn't think
> carefully about why what I was seeing was not a bug at all :-( Thanks
> for clarifying!

FWIW, presumably the failure comes from something throwing elog(ERROR)
after RecordTransactionCommit().  So what I'd try is setting a
breakpoint at errfinish and investigating what's causing the first error
call.

            regards, tom lane
On 01/24/2013 10:36 PM, Tom Lane wrote:
> Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
>> Doh! Yes of course, sorry for the noise. I was busy thinking that the
>> issue could be tied up with sinval and plan caching (if there is any) in
>> plr and got excited about seeing something in gdb...and didn't think
>> carefully about why what I was seeing was not a bug at all :-( Thanks
>> for clarifying!
>
> FWIW, presumably the failure comes from something throwing elog(ERROR)
> after RecordTransactionCommit().  So what I'd try is setting a
> breakpoint at errfinish and investigating what's causing the first error
> call.

Yeah -- will do. Thanks!

Joe


--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
On 25/01/13 10:36, Tom Lane wrote:
> Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
>> Doh! Yes of course, sorry for the noise. I was busy thinking that the
>> issue could be tied up with sinval and plan caching (if there is any) in
>> plr and got excited about seeing something in gdb...and didn't think
>> carefully about why what I was seeing was not a bug at all :-( Thanks
>> for clarifying!
>
> FWIW, presumably the failure comes from something throwing elog(ERROR)
> after RecordTransactionCommit().  So what I'd try is setting a
> breakpoint at errfinish and investigating what's causing the first error
> call.

If I have done this right, then this is the trace for the 1st message...
from my wandering through the calls here it looks like a normal commit,
and something goes a bit weird as SI messages are being processed...

0x00007f4e2a538425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f4e2a538425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f4e2a53bb8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00000000006e0a70 in errfinish (dummy=dummy@entry=0) at elog.c:523
#3  0x00000000006e3031 in elog_finish (elevel=elevel@entry=22,
     fmt=fmt@entry=0x714d58 "cannot abort transaction %u, it was already
committed") at elog.c:1202
#4  0x0000000000481d87 in RecordTransactionAbort (
     isSubXact=isSubXact@entry=0 '\000') at xact.c:1366
#5  0x0000000000481e89 in AbortTransaction () at xact.c:2272
#6  0x00000000004851c5 in AbortOutOfAnyTransaction () at xact.c:3807
#7  0x00000000006eddc9 in ShutdownPostgres (code=<optimized out>,
     arg=<optimized out>) at postinit.c:976
#8  0x0000000000610b7d in shmem_exit (code=code@entry=-1) at ipc.c:221
#9  0x0000000000610bf5 in proc_exit_prepare (code=-1) at ipc.c:181
#10 0x00007f4e2a53d901 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00007f4e2a53d985 in exit () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x00007f4e272b951a in ?? () from /usr/lib/libR.so
#13 <signal handler called>
#14 0x00007f4e2a538707 in kill () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x00000000006152e5 in SICleanupQueue (
     callerHasWriteLock=callerHasWriteLock@entry=1 '\001',
     minFree=minFree@entry=4) at sinvaladt.c:672
#16 0x00000000006153bd in SIInsertDataEntries (data=0x3943b70, n=0)
     at sinvaladt.c:437
#17 0x00000000006d1371 in ProcessInvalidationMessagesMulti (
     hdr=hdr@entry=0x204a280, func=0x6149f0 <SendSharedInvalidMessages>)
     at inval.c:402
#18 0x00000000006d2020 in AtEOXact_Inval (isCommit=isCommit@entry=1 '\001')
     at inval.c:940
#19 0x0000000000482bdd in CommitTransaction () at xact.c:1893
#20 0x0000000000484975 in CommitTransactionCommand () at xact.c:2562
#21 0x00000000006276d3 in finish_xact_command () at postgres.c:2452
#22 finish_xact_command () at postgres.c:2441
#23 0x000000000062a6f5 in exec_simple_query (query_string=0x20e8d70
"COMMIT;")
     at postgres.c:1037
#24 PostgresMain (argc=<optimized out>, argv=argv@entry=0x20476b8,
     username=<optimized out>) at postgres.c:3968
#25 0x00000000005ee92f in BackendRun (port=0x2068f40) at postmaster.c:3617
#26 BackendStartup (port=0x2068f40) at postmaster.c:3302
#27 ServerLoop () at postmaster.c:1466
#28 0x00000000005ef4dc in PostmasterMain (argc=argc@entry=1,
     argv=argv@entry=0x2044640) at postmaster.c:1127
#29 0x0000000000453edb in main (argc=1, argv=0x2044640) at main.c:199
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
> If I have done this right, then this is the trace for the 1st message...
> from my wandering through the calls here it looks like a normal commit,
> and something goes a bit weird as SI messages are being processed...

Seems like the critical bit is here:

> #11 0x00007f4e2a53d985 in exit () from /lib/x86_64-linux-gnu/libc.so.6
> #12 0x00007f4e272b951a in ?? () from /usr/lib/libR.so
> #13 <signal handler called>
> #14 0x00007f4e2a538707 in kill () from /lib/x86_64-linux-gnu/libc.so.6
> #15 0x00000000006152e5 in SICleanupQueue (
>      callerHasWriteLock=callerHasWriteLock@entry=1 '\001',
>      minFree=minFree@entry=4) at sinvaladt.c:672

Frame 15 is definitely SICleanupQueue trying to send a catchup SIGUSR1
interrupt to the furthest-behind backend.  The fact that we go directly
into a signal handler from the kill() suggests that the furthest-behind
backend is actually *this* backend, which perhaps is a bit surprising,
but it's supposed to work.  What it looks like, though, is that libR has
commandeered the SIGUSR1 signal handler, and just to be extra special
unfriendly to the surrounding program, it does an exit() when it traps a
SIGUSR1.

Unless libR can be coerced into not screwing up our signal handlers,
I'd say that PL/R is broken beyond repair.  That would be unfortunate.

            regards, tom lane
On 2013-01-24 19:06:21 -0500, Tom Lane wrote:
> Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
> > If I have done this right, then this is the trace for the 1st message...
> > from my wandering through the calls here it looks like a normal commit,
> > and something goes a bit weird as SI messages are being processed...
>
> Seems like the critical bit is here:
>
> > #11 0x00007f4e2a53d985 in exit () from /lib/x86_64-linux-gnu/libc.so.6
> > #12 0x00007f4e272b951a in ?? () from /usr/lib/libR.so
> > #13 <signal handler called>
> > #14 0x00007f4e2a538707 in kill () from /lib/x86_64-linux-gnu/libc.so.6
> > #15 0x00000000006152e5 in SICleanupQueue (
> >      callerHasWriteLock=callerHasWriteLock@entry=1 '\001',
> >      minFree=minFree@entry=4) at sinvaladt.c:672
>
> Frame 15 is definitely SICleanupQueue trying to send a catchup SIGUSR1
> interrupt to the furthest-behind backend.  The fact that we go directly
> into a signal handler from the kill() suggests that the furthest-behind
> backend is actually *this* backend, which perhaps is a bit surprising,
> but it's supposed to work.  What it looks like, though, is that libR has
> commandeered the SIGUSR1 signal handler, and just to be extra special
> unfriendly to the surrounding program, it does an exit() when it traps a
> SIGUSR1.
>
> Unless libR can be coerced into not screwing up our signal handlers,
> I'd say that PL/R is broken beyond repair.  That would be unfortunate.

I wonder whether we could place some Assert's somewhere that verify our
signal handlers are still setup, this isn't the first bug caused by pl's
or libraries overriding them...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
On 25/01/13 13:06, Tom Lane wrote:
> Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
>> If I have done this right, then this is the trace for the 1st message...
>> from my wandering through the calls here it looks like a normal commit,
>> and something goes a bit weird as SI messages are being processed...
>
> Seems like the critical bit is here:
>
>> #11 0x00007f4e2a53d985 in exit () from /lib/x86_64-linux-gnu/libc.so.6
>> #12 0x00007f4e272b951a in ?? () from /usr/lib/libR.so
>> #13 <signal handler called>
>> #14 0x00007f4e2a538707 in kill () from /lib/x86_64-linux-gnu/libc.so.6
>> #15 0x00000000006152e5 in SICleanupQueue (
>>       callerHasWriteLock=callerHasWriteLock@entry=1 '\001',
>>       minFree=minFree@entry=4) at sinvaladt.c:672
>
> Frame 15 is definitely SICleanupQueue trying to send a catchup SIGUSR1
> interrupt to the furthest-behind backend.  The fact that we go directly
> into a signal handler from the kill() suggests that the furthest-behind
> backend is actually *this* backend, which perhaps is a bit surprising,
> but it's supposed to work.  What it looks like, though, is that libR has
> commandeered the SIGUSR1 signal handler, and just to be extra special
> unfriendly to the surrounding program, it does an exit() when it traps a
> SIGUSR1.
>
> Unless libR can be coerced into not screwing up our signal handlers,
> I'd say that PL/R is broken beyond repair.  That would be unfortunate.
>
>             regards, tom lane

It looks like Joe has run into something similar with libR stealing
SIGINT, he reinstalls it. A simple patch along the same lines for
SIGUSR1 (attached) seems to fix the issue.

I wonder if we need to install *all* the remaining signal handlers too?

Cheers

Mark


Attachment
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
> On 25/01/13 13:06, Tom Lane wrote:
>> Unless libR can be coerced into not screwing up our signal handlers,
>> I'd say that PL/R is broken beyond repair.  That would be unfortunate.

> It looks like Joe has run into something similar with libR stealing
> SIGINT, he reinstalls it. A simple patch along the same lines for
> SIGUSR1 (attached) seems to fix the issue.

This certainly is not good enough, for either signal.  What happens if
the signal arrives while libR still has control?  These things being
asynchronous with respect to the receiving backend, we certainly can't
assume that that won't happen.

Why does libR think it should be messing with these handlers in the
first place?

            regards, tom lane
On 25/01/13 13:49, Tom Lane wrote:
> Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
>> On 25/01/13 13:06, Tom Lane wrote:
>>> Unless libR can be coerced into not screwing up our signal handlers,
>>> I'd say that PL/R is broken beyond repair.  That would be unfortunate.
>
>> It looks like Joe has run into something similar with libR stealing
>> SIGINT, he reinstalls it. A simple patch along the same lines for
>> SIGUSR1 (attached) seems to fix the issue.
>
> This certainly is not good enough, for either signal.  What happens if
> the signal arrives while libR still has control?  These things being
> asynchronous with respect to the receiving backend, we certainly can't
> assume that that won't happen.
>
> Why does libR think it should be messing with these handlers in the
> first place?
>
>

Agreed - I will see if I can work out why.

Cheers

Mark
On 25/01/13 13:56, Mark Kirkwood wrote:
> On 25/01/13 13:49, Tom Lane wrote:
>> Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
>>> On 25/01/13 13:06, Tom Lane wrote:
>>>> Unless libR can be coerced into not screwing up our signal handlers,
>>>> I'd say that PL/R is broken beyond repair.  That would be unfortunate.
>>
>>> It looks like Joe has run into something similar with libR stealing
>>> SIGINT, he reinstalls it. A simple patch along the same lines for
>>> SIGUSR1 (attached) seems to fix the issue.
>>
>> This certainly is not good enough, for either signal.  What happens if
>> the signal arrives while libR still has control?  These things being
>> asynchronous with respect to the receiving backend, we certainly can't
>> assume that that won't happen.
>>
>> Why does libR think it should be messing with these handlers in the
>> first place?
>>
>>
>
> Agreed - I will see if I can work out why.
>


Looking at the R source (which reminds me of Postgres, nice, clear
code...), I saw quite a bit of code involving signal handlers - and
nothing along the lines of "don't set them up if you are in embedded
mode". So clearly more investigation is needed, and we really need to
take this up on the R mailing list I think. Joe - is this something you
would like to do? I am happy do it, but my knowledge of R is measured in
hours...

Cheers

Mark
On 29/01/13 10:29, Mark Kirkwood wrote:
> On 25/01/13 13:56, Mark Kirkwood wrote:
>> On 25/01/13 13:49, Tom Lane wrote:
>>> Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:
>>>> On 25/01/13 13:06, Tom Lane wrote:
>>>>> Unless libR can be coerced into not screwing up our signal handlers,
>>>>> I'd say that PL/R is broken beyond repair.  That would be
>>>>> unfortunate.
>>>
>>>> It looks like Joe has run into something similar with libR stealing
>>>> SIGINT, he reinstalls it. A simple patch along the same lines for
>>>> SIGUSR1 (attached) seems to fix the issue.
>>>
>>> This certainly is not good enough, for either signal.  What happens if
>>> the signal arrives while libR still has control?  These things being
>>> asynchronous with respect to the receiving backend, we certainly can't
>>> assume that that won't happen.
>>>
>>> Why does libR think it should be messing with these handlers in the
>>> first place?
>>>
>>>
>>
>> Agreed - I will see if I can work out why.
>>
>
>
> Looking at the R source (which reminds me of Postgres, nice, clear
> code...), I saw quite a bit of code involving signal handlers - and
> nothing along the lines of "don't set them up if you are in embedded
> mode". So clearly more investigation is needed, and we really need to
> take this up on the R mailing list I think. Joe - is this something
> you would like to do? I am happy do it, but my knowledge of R is
> measured in hours...
>
>

Like a dog with a bone - sometimes it's hard to let go... given the
quite readable nature of the R source I managed to find what I *think*
is the "don't do R signals switch" (which I'd obviously missed before).
Making use of this seems to fix the original bug - and possibly the
SIGINT stealing too.

Patch attached to set the variable (R_SignalHandlers = 0), and remove
the SIGINT workaround.

Cheers

Mark



Attachment
On 01/28/2013 03:57 PM, Mark Kirkwood wrote:
> Like a dog with a bone - sometimes it's hard to let go... given the
> quite readable nature of the R source I managed to find what I *think*
> is the "don't do R signals switch" (which I'd obviously missed before).
> Making use of this seems to fix the original bug - and possibly the
> SIGINT stealing too.
>
> Patch attached to set the variable (R_SignalHandlers = 0), and remove
> the SIGINT workaround.

Cool -- thanks. Maybe that got added in a release since last I looked.

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support