Thread: BUG in 10.1 - dsa_area could not attach to a segment that has been freed
BUG in 10.1 - dsa_area could not attach to a segment that has been freed
From
Alexander Voytsekhovskyy
Date:
Hi,
in very certain case i am getting error
"ERROR: dsa_area could not attach to a segment that has been freed"
I am runningPostgreSQL 10.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bit
I am getting this error when
1) i set set enable_nestloop to off
2) run query
SELECT delivery_data.id_product AS id_product, client_tt.id_client_tt
FROM delivery_data JOIN client_tt AS client_tt_tmp ON (client_tt_tmp.id_client_tt =
delivery_data.id_client_tt) JOIN client_tt ON (client_tt_tmp.id_unique = client_tt.id_client_tt)
WHERE delivery_data.delivery_date BETWEEN '2017-09-01' AND '2017-11-30' AND client_tt.id_client_tt IN (167427, 167429, 167430, 167433,
167434, 167435, 167436, 167437, 167438, 167485, 167520, 167521, 167527,
167528, 167529, 167530) AND delivery_data.id_ta IN (3911)
GROUP BY client_tt.id_client_tt, delivery_data.id_product
HAVING SUM(delivery_data.amount) > 0
ORDER BY client_tt.id_client_tt
Here is explain for this query:
https://explain.depesz.com/s/slpM
If i change any part of this query, like remove filter
client_tt.id_client_tt or try it on other data set - it works
For me it's looks like a bug (
Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed
From
Michael Paquier
Date:
On Wed, Nov 29, 2017 at 2:17 AM, Alexander Voytsekhovskyy <young.inbox@gmail.com> wrote: > in very certain case i am getting error > > "ERROR: dsa_area could not attach to a segment that has been freed" You are likely seeing the same problem as discussed here: https://www.postgresql.org/message-id/15e84857859.d4f2ff2b8299.3910523281971861294@zohocorp.com -- Michael
On Wed, Nov 29, 2017 at 12:18 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Nov 29, 2017 at 2:17 AM, Alexander Voytsekhovskyy > <young.inbox@gmail.com> wrote: >> in very certain case i am getting error >> >> "ERROR: dsa_area could not attach to a segment that has been freed" > > You are likely seeing the same problem as discussed here: > https://www.postgresql.org/message-id/15e84857859.d4f2ff2b8299.3910523281971861294@zohocorp.com Hmm. Well, it's the same error message but I don't think it's the same: that other thread was about C extension code using APIs incorrectly. This is parallel query, which shouldn't reach that error unless all the backends detach before one backend tries to attach, which I wouldn't normally expect to happen unless something else went wrong first. Then Alexander would presumably have also seen another error message too but he didn't say that. So I'm not yet sure how this condition is being reached. -- Thomas Munro http://www.enterprisedb.com
On Tue, Nov 28, 2017 at 7:04 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Hmm. Well, it's the same error message but I don't think it's the > same: that other thread was about C extension code using APIs > incorrectly. This is parallel query, which shouldn't reach that error > unless all the backends detach before one backend tries to attach, > which I wouldn't normally expect to happen unless something else went > wrong first. Why not? Can't it just be that the workers are slow getting started? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Nov 29, 2017 at 1:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Why not? Can't it just be that the workers are slow getting started? In the normal non-error control flow, don't we expect ExecShutdownGather() to run ExecParallelFinish() before ExecParallelCleanup(), meaning that the leader waits for workers to finish completely before it detaches itself? Doesn't that need to be case to avoid random "unable to map dynamic shared memory segment" and "dsa_area could not attach to a segment that has been freed" errors, and for the parallel instrumentation shown in EXPLAIN to be reliable? Could it be that the leader thought that a worker didn't start up, but in fact it did? -- Thomas Munro http://www.enterprisedb.com
On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy <young.inbox@gmail.com> wrote: > Hi, > > in very certain case i am getting error > > "ERROR: dsa_area could not attach to a segment that has been freed" > Can you try to get call stack so that we can get a better picture of what is happening here? One way could be you can change this ERROR to PANIC which will generate core dump and you can get stack trace. > > Here is explain for this query: > https://explain.depesz.com/s/slpM > > If i change any part of this query, like remove filter > client_tt.id_client_tt or try it on other data set - it works > Can you share the plan after you change the query? > For me it's looks like a bug ( Yeah. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed
From
Alexander Voytsekhovskyy
Date:
On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila
wrote:
> On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
> wrote:
> > Hi,
> >
> > in very certain case i am getting error
> >
> > "ERROR: dsa_area could not attach to a segment that has been freed"
> >
>
> Can you try to get call stack so that we can get a better picture of
> what is happening here? One way could be you can change this ERROR to
> PANIC which will generate core dump and you can get stack trace.
>
>
Yes, i can do both - i can reproduce it on test environment also
is there somewhere manual for that?
> >
> > Here is explain for this query:
> > https://explain.depesz.com/s/slpM
> >
> > If i change any part of this query, like remove filter
> > client_tt.id_client_tt or try it on other data set - it works
> >
>
> Can you share the plan after you change the query?
>
Yes, here is query without one of conditions in WHERE section
SELECT delivery_data.id_product AS id_product, client_tt.id_client_tt
FROM delivery_data JOIN client_tt AS client_tt_tmp ON (client_tt_tmp.id_client_tt =
delivery_data.id_client_tt) JOIN client_tt ON (client_tt_tmp.id_unique = client_tt.id_client_tt)
WHERE delivery_data.delivery_date BETWEEN '2017-09-01' AND '2017-11-30' AND delivery_data.id_ta IN (3911)
GROUP BY client_tt.id_client_tt, delivery_data.id_product
HAVING SUM(delivery_data.amount) > 0
ORDER BY client_tt.id_client_tt
And explain:
https://explain.depesz.com/s/9aBI
>
> > For me it's looks like a bug (
>
> Yeah.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
On Wed, Nov 29, 2017 at 6:58 PM, Alexander Voytsekhovskyy <young.inbox@gmail.com> wrote: > > > On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> >> On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy >> <young.inbox@gmail.com> wrote: >> > Hi, >> > >> > in very certain case i am getting error >> > >> > "ERROR: dsa_area could not attach to a segment that has been freed" >> > >> >> Can you try to get call stack so that we can get a better picture of >> what is happening here? One way could be you can change this ERROR to >> PANIC which will generate core dump and you can get stack trace. >> > > Yes, i can do both - i can reproduce it on test environment also > > is there somewhere manual for that? > You can get some information from the below link: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed
From
Alexander Voytsekhovskyy
Date:
Here is GDB log:
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
84 in ../sysdeps/unix/syscall-template.S
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x00005560250a10cf in ?? ()
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x00005560250a10cf in ?? ()
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x0000556024ed08e3 in ?? ()
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x0000556024ed08e3 in ?? ()
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
84 in ../sysdeps/unix/syscall-template.S
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
84 in ../sysdeps/unix/syscall-template.S
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x0000556024ec9896 in bringetbitmap ()
Detaching from program: /usr/lib/postgresql/10/bin/postgres, process 7453
On Wed, Nov 29, 2017 at 4:12 PM, Amit Kapila
wrote:
> On Wed, Nov 29, 2017 at 6:58 PM, Alexander Voytsekhovskyy
> wrote:
> >
> >
> > On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila
> > wrote:
> >>
> >> On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
> >> wrote:
> >> > Hi,
> >> >
> >> > in very certain case i am getting error
> >> >
> >> > "ERROR: dsa_area could not attach to a segment that has been freed"
> >> >
> >>
> >> Can you try to get call stack so that we can get a better picture of
> >> what is happening here? One way could be you can change this ERROR to
> >> PANIC which will generate core dump and you can get stack trace.
> >>
> >
> > Yes, i can do both - i can reproduce it on test environment also
> >
> > is there somewhere manual for that?
> >
>
> You can get some information from the below link:
> https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_
> a_running_PostgreSQL_backend_on_Linux/BSD
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
On Tue, Nov 28, 2017 at 8:17 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Nov 29, 2017 at 1:33 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Why not? Can't it just be that the workers are slow getting started? > > In the normal non-error control flow, don't we expect > ExecShutdownGather() to run ExecParallelFinish() before > ExecParallelCleanup(), meaning that the leader waits for workers to > finish completely before it detaches itself? Doesn't that need to be > case to avoid random "unable to map dynamic shared memory segment" and > "dsa_area could not attach to a segment that has been freed" errors, > and for the parallel instrumentation shown in EXPLAIN to be reliable? Oh, hmm. > Could it be that the leader thought that a worker didn't start up, but > in fact it did? Well, I don't know how that could happen, but I can't prove it didn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed
From
Alexander Voytsekhovskyy
Date:
I see my fault
here is GDB with correct symbols
Program received signal SIGUSR1, User defined signal 1.
pagetable_insert (found=, key=8192, tb=)
at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:540
540/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:
No such file or directory.
Continuing.
Program received signal SIGUSR1, User defined signal 1.
pagetable_insert (found=, key=8192, tb=)
at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:540
540 in
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x0000556024ed08ee in slot_deform_tuple (slot=slot@entry=0x556026638e60,
natts=natts@entry=29) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c:1073
1073
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c:
No such file or directory.
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x0000556024ed08ee in slot_deform_tuple (slot=slot@entry=0x556026638e60,
natts=natts@entry=29) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c:1073
1073 in
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
84 in ../sysdeps/unix/syscall-template.S
Continuing.
Program received signal SIGUSR1, User defined signal 1.
0x00005560250a0c4f in pagetable_insert (found=,
key=14592, tb=0x556026699df0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:493
493/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:
No such file or directory.
Detaching from program: /usr/lib/postgresql/10/bin/postgres, process 7453
On Wed, Nov 29, 2017 at 4:12 PM, Amit Kapila
wrote:
> On Wed, Nov 29, 2017 at 6:58 PM, Alexander Voytsekhovskyy
> wrote:
> >
> >
> > On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila
> > wrote:
> >>
> >> On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
> >> wrote:
> >> > Hi,
> >> >
> >> > in very certain case i am getting error
> >> >
> >> > "ERROR: dsa_area could not attach to a segment that has been freed"
> >> >
> >>
> >> Can you try to get call stack so that we can get a better picture of
> >> what is happening here? One way could be you can change this ERROR to
> >> PANIC which will generate core dump and you can get stack trace.
> >>
> >
> > Yes, i can do both - i can reproduce it on test environment also
> >
> > is there somewhere manual for that?
> >
>
> You can get some information from the below link:
> https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_
> a_running_PostgreSQL_backend_on_Linux/BSD
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
On 11/29/2017 04:46 PM, Alexander Voytsekhovskyy wrote: > I see my fault > > here is GDB with correct symbols > > Program received signal SIGUSR1, User defined signal 1. > pagetable_insert (found=<synthetic pointer>, key=8192, tb=<optimized > out>) at > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:540 > 540 > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h: > No such file or directory. > Continuing. > I think you really need to do "handle SIGUSR1 noprint nostop" to eliminate the SIGUSR1 noise. That is absolutely irrelevant to the issue you're trying to investigate. Furthermore, this does not give us any sort of stack trace. You need to either set a breakpoint at the elog() call, which I believe is this one: https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/utils/mmgr/dsa.c#L1736 wait for it to get triggered, and then generate backtrace using 'bt'. So something like 1) gdb -p $PID 2) (gdb) handle SIGUSR1 noprint nostop 3) (gdb) break dsa.c:1736 4) run the query, gdb should interrupt at the breakpoint 5) (gdb) bt and then send us the output of that gdb command. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed
From
Alexander Voytsekhovskyy
Date:
thanks for detailed instructions!
Here it is:
Note: breakpoint 1 also set at pc 0x5560252c21f5.
Breakpoint 2 at 0x5560252c21f5: file
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c,
line 1736.
(gdb) bt
#0 0x00007f53e538e9b3 in __epoll_wait_nocancel () at
../sysdeps/unix/syscall-template.S:84
#1 0x000055602516a791 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffc2849a460, cur_timeout=-1, set=0x556026359498) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/storage/ipc/latch.c:1048
#2 WaitEventSetWait (set=0x556026359498, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffc2849a460, nevents=nevents@entry=1,
wait_event_info=wait_event_info@entry=100663296) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/storage/ipc/latch.c:1000
#3 0x0000556025089767 in secure_read (port=0x55602639ddb0,
ptr=0x5560257195a0 , len=8192) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/libpq/be-secure.c:166
#4 0x0000556025094f48 in pq_recvbuf () at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/libpq/pqcomm.c:963
#5 0x0000556025095c85 in pq_getbyte () at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/libpq/pqcomm.c:1006
#6 0x000055602518d350 in SocketBackend (inBuf=0x7ffc2849a620) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:328
#7 ReadCommand (inBuf=0x7ffc2849a620) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:501
#8 PostgresMain (argc=, argv=argv@entry=0x5560263a06e0,
dbname=0x5560263a05d8 "ddc_pvmrussia", username=) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:4030
#9 0x0000556024ec7fe9 in BackendRun (port=0x55602639ddb0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4357
#10 BackendStartup (port=0x55602639ddb0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4029
#11 ServerLoop () at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1753
#12 0x000055602511d36b in PostmasterMain (argc=5, argv=) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1361
#13 0x0000556024ec93e5 in main (argc=5, argv=0x556026358850) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/main/main.c:228
(gdb)
On Wed, Nov 29, 2017 at 6:02 PM, Tomas Vondra
wrote:
>
> On 11/29/2017 04:46 PM, Alexander Voytsekhovskyy wrote:
> > I see my fault
> >
> > here is GDB with correct symbols
> >
> > Program received signal SIGUSR1, User defined signal 1.
> > pagetable_insert (found=, key=8192, tb= > out>) at
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/include/lib/simplehash.h:540
> > 540
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/include/lib/simplehash.h:
> > No such file or directory.
> > Continuing.
> >
>
> I think you really need to do "handle SIGUSR1 noprint nostop" to
> eliminate the SIGUSR1 noise. That is absolutely irrelevant to the issue
> you're trying to investigate.
>
> Furthermore, this does not give us any sort of stack trace. You need to
> either set a breakpoint at the elog() call, which I believe is this one:
>
> https://github.com/postgres/postgres/blob/REL_10_STABLE/
> src/backend/utils/mmgr/dsa.c#L1736
>
> wait for it to get triggered, and then generate backtrace using 'bt'.
>
>
> So something like
>
> 1) gdb -p $PID
>
> 2) (gdb) handle SIGUSR1 noprint nostop
>
> 3) (gdb) break dsa.c:1736
>
> 4) run the query, gdb should interrupt at the breakpoint
>
> 5) (gdb) bt
>
> and then send us the output of that gdb command.
>
>
> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
On 11/29/2017 05:23 PM, Alexander Voytsekhovskyy wrote: > thanks for detailed instructions! > > Here it is: > > Note: breakpoint 1 also set at pc 0x5560252c21f5. > Breakpoint 2 at 0x5560252c21f5: file > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c, > line 1736. > (gdb) bt > #0 0x00007f53e538e9b3 in __epoll_wait_nocancel () at > ../sysdeps/unix/syscall-template.S:84 > #1 0x000055602516a791 in WaitEventSetWaitBlock (nevents=1, > occurred_events=0x7ffc2849a460, cur_timeout=-1, set=0x556026359498) at > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/storage/ipc/latch.c:1048 Unfortunately, that's still not it. This shows the backtrace at the moment when gdb attached to the process (because it interrupts the exectution). You need to do "c" first, to continue the execution. So it should be: 1) gdb -p $PID 2) (gdb) handle SIGUSR1 noprint nostop 3) (gdb) break dsa.c:1736 4) (gdb) c 5) run the query, gdb should interrupt at the breakpoint 6) (gdb) bt regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed
From
Alexander Voytsekhovskyy
Date:
Thanks for helping, here is one more try
#0 get_segment_by_index (area=area@entry=0x556026700be8, index=1) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736
#1 0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8,
dp=dp@entry=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945
#2 0x00005560250a2c2b in tbm_attach_shared_iterate
(dsa=dsa@entry=0x556026700be8,
dp=1099511685280) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503
#3 0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710)
at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176
#4 0x000055602505b36d in ExecScanFetch (recheckMtd=0x556025066470
, accessMtd=0x556025066500 ,
node=0x556026460710) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execScan.c:97
#5 ExecScan (node=0x556026460710, accessMtd=0x556025066500
, recheckMtd=0x556025066470 ) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execScan.c:164
#6 0x0000556025061033 in ExecProcNode (node=0x556026460710) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#7 ExecAppend (pstate=0x556026582ff8) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAppend.c:222
#8 0x0000556025068ce6 in ExecProcNode (node=0x556026582ff8) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#9 gather_getnext (gatherstate=0x5560263b0ae0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeGather.c:281
#10 ExecGather (pstate=0x5560263b0ae0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeGather.c:215
#11 0x000055602506b6d8 in ExecProcNode (node=0x5560263b0ae0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#12 ExecHashJoinOuterGetTuple (hashvalue=0x7ffc2849a0dc,
hjstate=0x5560263b0220, outerNode=0x5560263b0ae0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeHashjoin.c:612
#13 ExecHashJoin (pstate=0x5560263b0220) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeHashjoin.c:201
#14 0x0000556025077af6 in ExecProcNode (node=0x5560263b0220) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#15 ExecSort (pstate=0x5560263affb0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeSort.c:106
#16 0x000055602506138c in ExecProcNode (node=0x5560263affb0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#17 fetch_input_tuple (aggstate=aggstate@entry=0x5560263aef38) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAgg.c:695
#18 0x000055602506363b in agg_retrieve_direct (aggstate=0x5560263aef38) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAgg.c:2336
#19 ExecAgg (pstate=0x5560263aef38) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAgg.c:2147
#20 0x000055602505528b in ExecProcNode (node=0x5560263aef38) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250
#21 ExecutePlan (execute_once=, dest=0x5560266986c0,
direction=, numberTuples=0, sendTuples=,
operation=CMD_SELECT, use_parallel_mode=,
planstate=0x5560263aef38, estate=0x5560263aecd8) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execMain.c:1722
#22 standard_ExecutorRun (queryDesc=0x556026567f28, direction=, count=0, execute_once=) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execMain.c:363
#23 0x000055602518f95b in PortalRunSelect (portal=portal@entry=0x556026377d48,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x5560266986c0) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/pquery.c:932
#24 0x0000556025190f58 in PortalRun (portal=portal@entry=0x556026377d48,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1
'\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x5560266986c0,
altdest=altdest@entry=0x5560266986c0, completionTag=0x7ffc2849a470 "") at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/pquery.c:773
#25 0x000055602518c8dc in exec_simple_query (query_string=0x55602640eda8
"SELECT\n delivery_data.id_product AS id_product,\n
client_tt.id_client_tt\nFROM delivery_data\n JOIN client_tt AS
client_tt_tmp ON (client_tt_tmp.id_client_tt
=\ndelivery_data.id_client_tt)\n JOIN clien"...) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:1099
#26 0x000055602518ddf6 in PostgresMain (argc=,
argv=argv@entry=0x5560263a08e0, dbname=0x5560263a07d8 "ddc_pvmrussia",
username=) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:4088
#27 0x0000556024ec7fe9 in BackendRun (port=0x55602639e570) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4357
#28 BackendStartup (port=0x55602639e570) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4029
#29 ServerLoop () at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1753
#30 0x000055602511d36b in PostmasterMain (argc=5, argv=) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1361
#31 0x0000556024ec93e5 in main (argc=5, argv=0x556026358850) at
/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/main/main.c:228
On Wed, Nov 29, 2017 at 8:41 PM, Tomas Vondra
wrote:
> On 11/29/2017 05:23 PM, Alexander Voytsekhovskyy wrote:
> > thanks for detailed instructions!
> >
> > Here it is:
> >
> > Note: breakpoint 1 also set at pc 0x5560252c21f5.
> > Breakpoint 2 at 0x5560252c21f5: file
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/backend/utils/mmgr/dsa.c,
> > line 1736.
> > (gdb) bt
> > #0 0x00007f53e538e9b3 in __epoll_wait_nocancel () at
> > ../sysdeps/unix/syscall-template.S:84
> > #1 0x000055602516a791 in WaitEventSetWaitBlock (nevents=1,
> > occurred_events=0x7ffc2849a460, cur_timeout=-1, set=0x556026359498) at
> > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../
> src/backend/storage/ipc/latch.c:1048
>
> Unfortunately, that's still not it. This shows the backtrace at the
> moment when gdb attached to the process (because it interrupts the
> exectution). You need to do "c" first, to continue the execution.
>
> So it should be:
>
> 1) gdb -p $PID
> 2) (gdb) handle SIGUSR1 noprint nostop
> 3) (gdb) break dsa.c:1736
> 4) (gdb) c
> 5) run the query, gdb should interrupt at the breakpoint
> 6) (gdb) bt
>
>
> regards
>
> --
> Tomas Vondra http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
On Thu, Nov 30, 2017 at 9:34 AM, Alexander Voytsekhovskyy <young.inbox@gmail.com> wrote: > Thanks for helping, here is one more try > > #0 get_segment_by_index (area=area@entry=0x556026700be8, index=1) at > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736 > #1 0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8, > dp=dp@entry=1099511685280) at > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945 > #2 0x00005560250a2c2b in tbm_attach_shared_iterate > (dsa=dsa@entry=0x556026700be8, dp=1099511685280) at > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503 > #3 0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176 Thank you for the report and the back trace. I think this might be a manifestation of the problem I just described[1] on -hackers. Depending on the shape of a multi-Gather query plan and therefore the order of control flow, you might finish up using the DSA area that belongs to a different Gather node and then find that it goes away too soon. Investigating. https://www.postgresql.org/message-id/CAEepm%3D1U6as%3DbrnVvMNixEV2tpi8NuyQoTmO8Qef0-VV%2B%3D7MDA%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
On Thu, Nov 30, 2017 at 10:18 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Thu, Nov 30, 2017 at 9:34 AM, Alexander Voytsekhovskyy > <young.inbox@gmail.com> wrote: >> Thanks for helping, here is one more try >> >> #0 get_segment_by_index (area=area@entry=0x556026700be8, index=1) at >> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736 >> #1 0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8, >> dp=dp@entry=1099511685280) at >> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945 >> #2 0x00005560250a2c2b in tbm_attach_shared_iterate >> (dsa=dsa@entry=0x556026700be8, dp=1099511685280) at >> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503 >> #3 0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at >> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176 > > Thank you for the report and the back trace. I think this might be a > manifestation of the problem I just described[1] on -hackers. > Depending on the shape of a multi-Gather query plan and therefore the > order of control flow, you might finish up using the DSA area that > belongs to a different Gather node and then find that it goes away too > soon. Investigating. I haven't managed to reproduce this, but I was coincidentally investigating a bug that appears to explain it. I think what happened is that a background worker was first to execute BitmapHeapNext and allocated a dsa_pointer, and then the leader process reached BitmapHeapNext and called tbm_attach_shared_iterate which tried to deference it, but it had es_query_dsa set to another gather node's DSA area (whichever Gather most recently ran ExecInitParallelPlan). That requires a certain order of execution and timing that I'm not sure how to reach. I have posted a patch that should fix it over here: https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com Are you able to provide a minimal reproducer, an anonymised partial dump, or perhaps try out the patch on a copy of your database? -- Thomas Munro http://www.enterprisedb.com
Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed
From
Alexander Voytsekhovskyy
Date:
Hi, i can confirm that this patch fix the bug for me. I can’t provide minimal reproducer, but i can share whole data set with you > On Thu, Nov 30, 2017 at 10:18 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Thu, Nov 30, 2017 at 9:34 AM, Alexander Voytsekhovskyy >> <young.inbox@gmail.com> wrote: >>> Thanks for helping, here is one more try >>> >>> #0 get_segment_by_index (area=area@entry=0x556026700be8, index=1) at >>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736 >>> #1 0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8, >>> dp=dp@entry=1099511685280) at >>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945 >>> #2 0x00005560250a2c2b in tbm_attach_shared_iterate >>> (dsa=dsa@entry=0x556026700be8, dp=1099511685280) at >>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503 >>> #3 0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at >>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176 >> >> Thank you for the report and the back trace. I think this might be a >> manifestation of the problem I just described[1] on -hackers. >> Depending on the shape of a multi-Gather query plan and therefore the >> order of control flow, you might finish up using the DSA area that >> belongs to a different Gather node and then find that it goes away too >> soon. Investigating. > > I haven't managed to reproduce this, but I was coincidentally > investigating a bug that appears to explain it. I think what happened > is that a background worker was first to execute BitmapHeapNext and > allocated a dsa_pointer, and then the leader process reached > BitmapHeapNext and called tbm_attach_shared_iterate which tried to > deference it, but it had es_query_dsa set to another gather node's DSA > area (whichever Gather most recently ran ExecInitParallelPlan). That > requires a certain order of execution and timing that I'm not sure how > to reach. I have posted a patch that should fix it over here: > > https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com > > Are you able to provide a minimal reproducer, an anonymised partial > dump, or perhaps try out the patch on a copy of your database? > > -- > Thomas Munro > http://www.enterprisedb.com
On Fri, Dec 8, 2017 at 9:55 AM, Alexander Voytsekhovskyy <av@mobile-ua.com> wrote: > i can confirm that this patch fix the bug for me. Great news. Thanks for taking the time to test it! > I can’t provide minimal reproducer, but i can share whole data set with you It's OK, your report is enough for me (I don't want to receive any non-anonymised data). -- Thomas Munro http://www.enterprisedb.com