Thread: BUG in 10.1 - dsa_area could not attach to a segment that has been freed

BUG in 10.1 - dsa_area could not attach to a segment that has been freed

From

Alexander Voytsekhovskyy

Date:

28 November 2017, 20:17:39

Hi, in very certain case i am getting error "ERROR: dsa_area could not attach to a segment that has been freed" I am runningPostgreSQL 10.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bit I am getting this error when 1) i set set enable_nestloop to off 2) run query SELECT delivery_data.id_product AS id_product, client_tt.id_client_tt FROM delivery_data JOIN client_tt AS client_tt_tmp ON (client_tt_tmp.id_client_tt = delivery_data.id_client_tt) JOIN client_tt ON (client_tt_tmp.id_unique = client_tt.id_client_tt) WHERE delivery_data.delivery_date BETWEEN '2017-09-01' AND '2017-11-30' AND client_tt.id_client_tt IN (167427, 167429, 167430, 167433, 167434, 167435, 167436, 167437, 167438, 167485, 167520, 167521, 167527, 167528, 167529, 167530) AND delivery_data.id_ta IN (3911) GROUP BY client_tt.id_client_tt, delivery_data.id_product HAVING SUM(delivery_data.amount) > 0 ORDER BY client_tt.id_client_tt Here is explain for this query: https://explain.depesz.com/s/slpM If i change any part of this query, like remove filter client_tt.id_client_tt or try it on other data set - it works For me it's looks like a bug (

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Michael Paquier

Date:

29 November 2017, 02:18:10

On Wed, Nov 29, 2017 at 2:17 AM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:
> in very certain case i am getting error
>
> "ERROR:  dsa_area could not attach to a segment that has been freed"

You are likely seeing the same problem as discussed here:
https://www.postgresql.org/message-id/15e84857859.d4f2ff2b8299.3910523281971861294@zohocorp.com
-- 
Michael

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Thomas Munro

Date:

29 November 2017, 03:04:34

On Wed, Nov 29, 2017 at 12:18 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Nov 29, 2017 at 2:17 AM, Alexander Voytsekhovskyy
> <young.inbox@gmail.com> wrote:
>> in very certain case i am getting error
>>
>> "ERROR:  dsa_area could not attach to a segment that has been freed"
>
> You are likely seeing the same problem as discussed here:
> https://www.postgresql.org/message-id/15e84857859.d4f2ff2b8299.3910523281971861294@zohocorp.com

Hmm.  Well, it's the same error message but I don't think it's the
same: that other thread was about C extension code using APIs
incorrectly.  This is parallel query, which shouldn't reach that error
unless all the backends detach before one backend tries to attach,
which I wouldn't normally expect to happen unless something else went
wrong first.  Then Alexander would presumably have also seen another
error message too but he didn't say that.  So I'm not yet sure how
this condition is being reached.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Robert Haas

Date:

29 November 2017, 03:33:30

On Tue, Nov 28, 2017 at 7:04 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Hmm.  Well, it's the same error message but I don't think it's the
> same: that other thread was about C extension code using APIs
> incorrectly.  This is parallel query, which shouldn't reach that error
> unless all the backends detach before one backend tries to attach,
> which I wouldn't normally expect to happen unless something else went
> wrong first.

Why not?  Can't it just be that the workers are slow getting started?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Thomas Munro

Date:

29 November 2017, 04:17:55

On Wed, Nov 29, 2017 at 1:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Why not?  Can't it just be that the workers are slow getting started?

In the normal non-error control flow, don't we expect
ExecShutdownGather() to run ExecParallelFinish() before
ExecParallelCleanup(), meaning that the leader waits for workers to
finish completely before it detaches itself?  Doesn't that need to be
case to avoid random "unable to map dynamic shared memory segment" and
"dsa_area could not attach to a segment that has been freed" errors,
and for the parallel instrumentation shown in EXPLAIN to be reliable?

Could it be that the leader thought that a worker didn't start up, but
in fact it did?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Amit Kapila

Date:

29 November 2017, 15:38:32

On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:
> Hi,
>
> in very certain case i am getting error
>
> "ERROR:  dsa_area could not attach to a segment that has been freed"
>

Can you try to get call stack so that we can get a better picture of
what is happening here?  One way could be you can change this ERROR to
PANIC which will generate core dump and you can get stack trace.

>
> Here is explain for this query:
> https://explain.depesz.com/s/slpM
>
> If i change any part of this query, like remove filter
> client_tt.id_client_tt or try it on other data set - it works
>

Can you share the plan after you change the query?

> For me it's looks like a bug (

Yeah.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Alexander Voytsekhovskyy

Date:

29 November 2017, 16:28:07

On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila wrote: > On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy > wrote: > > Hi, > > > > in very certain case i am getting error > > > > "ERROR: dsa_area could not attach to a segment that has been freed" > > > > Can you try to get call stack so that we can get a better picture of > what is happening here? One way could be you can change this ERROR to > PANIC which will generate core dump and you can get stack trace. > > Yes, i can do both - i can reproduce it on test environment also is there somewhere manual for that? > > > > Here is explain for this query: > > https://explain.depesz.com/s/slpM > > > > If i change any part of this query, like remove filter > > client_tt.id_client_tt or try it on other data set - it works > > > > Can you share the plan after you change the query? > Yes, here is query without one of conditions in WHERE section SELECT delivery_data.id_product AS id_product, client_tt.id_client_tt FROM delivery_data JOIN client_tt AS client_tt_tmp ON (client_tt_tmp.id_client_tt = delivery_data.id_client_tt) JOIN client_tt ON (client_tt_tmp.id_unique = client_tt.id_client_tt) WHERE delivery_data.delivery_date BETWEEN '2017-09-01' AND '2017-11-30' AND delivery_data.id_ta IN (3911) GROUP BY client_tt.id_client_tt, delivery_data.id_product HAVING SUM(delivery_data.amount) > 0 ORDER BY client_tt.id_client_tt And explain: https://explain.depesz.com/s/9aBI > > > For me it's looks like a bug ( > > Yeah. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com >

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Amit Kapila

Date:

29 November 2017, 17:12:03

On Wed, Nov 29, 2017 at 6:58 PM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:
>
>
> On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy
>> <young.inbox@gmail.com> wrote:
>> > Hi,
>> >
>> > in very certain case i am getting error
>> >
>> > "ERROR:  dsa_area could not attach to a segment that has been freed"
>> >
>>
>> Can you try to get call stack so that we can get a better picture of
>> what is happening here?  One way could be you can change this ERROR to
>> PANIC which will generate core dump and you can get stack trace.
>>
>
> Yes, i can do both - i can reproduce it on test environment also
>
> is there somewhere manual for that?
>

You can get some information from the below link:
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Alexander Voytsekhovskyy

Date:

29 November 2017, 17:32:19

Here is GDB log: Continuing. Program received signal SIGUSR1, User defined signal 1. 0x00007f53e538e9b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 in ../sysdeps/unix/syscall-template.S Continuing. Program received signal SIGUSR1, User defined signal 1. 0x00005560250a10cf in ?? () Continuing. Program received signal SIGUSR1, User defined signal 1. 0x00005560250a10cf in ?? () Continuing. Program received signal SIGUSR1, User defined signal 1. 0x0000556024ed08e3 in ?? () Continuing. Program received signal SIGUSR1, User defined signal 1. 0x0000556024ed08e3 in ?? () Continuing. Program received signal SIGUSR1, User defined signal 1. 0x00007f53e538e9b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 in ../sysdeps/unix/syscall-template.S Continuing. Program received signal SIGUSR1, User defined signal 1. 0x00007f53e538e9b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 in ../sysdeps/unix/syscall-template.S Continuing. Program received signal SIGUSR1, User defined signal 1. 0x0000556024ec9896 in bringetbitmap () Detaching from program: /usr/lib/postgresql/10/bin/postgres, process 7453 On Wed, Nov 29, 2017 at 4:12 PM, Amit Kapila wrote: > On Wed, Nov 29, 2017 at 6:58 PM, Alexander Voytsekhovskyy > wrote: > > > > > > On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila > > wrote: > >> > >> On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy > >> wrote: > >> > Hi, > >> > > >> > in very certain case i am getting error > >> > > >> > "ERROR: dsa_area could not attach to a segment that has been freed" > >> > > >> > >> Can you try to get call stack so that we can get a better picture of > >> what is happening here? One way could be you can change this ERROR to > >> PANIC which will generate core dump and you can get stack trace. > >> > > > > Yes, i can do both - i can reproduce it on test environment also > > > > is there somewhere manual for that? > > > > You can get some information from the below link: > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_ > a_running_PostgreSQL_backend_on_Linux/BSD > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com >

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Robert Haas

Date:

29 November 2017, 18:36:11

On Tue, Nov 28, 2017 at 8:17 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Nov 29, 2017 at 1:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Why not?  Can't it just be that the workers are slow getting started?
>
> In the normal non-error control flow, don't we expect
> ExecShutdownGather() to run ExecParallelFinish() before
> ExecParallelCleanup(), meaning that the leader waits for workers to
> finish completely before it detaches itself?  Doesn't that need to be
> case to avoid random "unable to map dynamic shared memory segment" and
> "dsa_area could not attach to a segment that has been freed" errors,
> and for the parallel instrumentation shown in EXPLAIN to be reliable?

Oh, hmm.

> Could it be that the leader thought that a worker didn't start up, but
> in fact it did?

Well, I don't know how that could happen, but I can't prove it didn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Alexander Voytsekhovskyy

Date:

29 November 2017, 18:46:27

I see my fault here is GDB with correct symbols Program received signal SIGUSR1, User defined signal 1. pagetable_insert (found=, key=8192, tb=) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:540 540/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h: No such file or directory. Continuing. Program received signal SIGUSR1, User defined signal 1. pagetable_insert (found=, key=8192, tb=) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:540 540 in /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h Continuing. Program received signal SIGUSR1, User defined signal 1. 0x0000556024ed08ee in slot_deform_tuple (slot=slot@entry=0x556026638e60, natts=natts@entry=29) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c:1073 1073 /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c: No such file or directory. Continuing. Program received signal SIGUSR1, User defined signal 1. 0x0000556024ed08ee in slot_deform_tuple (slot=slot@entry=0x556026638e60, natts=natts@entry=29) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c:1073 1073 in /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/access/common/heaptuple.c Continuing. Program received signal SIGUSR1, User defined signal 1. 0x00007f53e538e9b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 ../sysdeps/unix/syscall-template.S: No such file or directory. Continuing. Program received signal SIGUSR1, User defined signal 1. 0x00007f53e538e9b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 84 in ../sysdeps/unix/syscall-template.S Continuing. Program received signal SIGUSR1, User defined signal 1. 0x00005560250a0c4f in pagetable_insert (found=, key=14592, tb=0x556026699df0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:493 493/build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h: No such file or directory. Detaching from program: /usr/lib/postgresql/10/bin/postgres, process 7453 On Wed, Nov 29, 2017 at 4:12 PM, Amit Kapila wrote: > On Wed, Nov 29, 2017 at 6:58 PM, Alexander Voytsekhovskyy > wrote: > > > > > > On Wed, Nov 29, 2017 at 2:38 PM, Amit Kapila > > wrote: > >> > >> On Tue, Nov 28, 2017 at 10:47 PM, Alexander Voytsekhovskyy > >> wrote: > >> > Hi, > >> > > >> > in very certain case i am getting error > >> > > >> > "ERROR: dsa_area could not attach to a segment that has been freed" > >> > > >> > >> Can you try to get call stack so that we can get a better picture of > >> what is happening here? One way could be you can change this ERROR to > >> PANIC which will generate core dump and you can get stack trace. > >> > > > > Yes, i can do both - i can reproduce it on test environment also > > > > is there somewhere manual for that? > > > > You can get some information from the below link: > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_ > a_running_PostgreSQL_backend_on_Linux/BSD > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com >

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Tomas Vondra

Date:

29 November 2017, 19:02:01

On 11/29/2017 04:46 PM, Alexander Voytsekhovskyy wrote:
> I see my fault
> 
> here is GDB with correct symbols
> 
> Program received signal SIGUSR1, User defined signal 1.
> pagetable_insert (found=<synthetic pointer>, key=8192, tb=<optimized
> out>) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:540
> 540   
>  /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/lib/simplehash.h:
> No such file or directory.
> Continuing.
> 

I think you really need to do "handle SIGUSR1 noprint nostop" to
eliminate the SIGUSR1 noise. That is absolutely irrelevant to the issue
you're trying to investigate.

Furthermore, this does not give us any sort of stack trace. You need to
either set a breakpoint at the elog() call, which I believe is this one:

https://github.com/postgres/postgres/blob/REL_10_STABLE/src/backend/utils/mmgr/dsa.c#L1736

wait for it to get triggered, and then generate backtrace using 'bt'.

So something like

1) gdb -p $PID

2) (gdb) handle SIGUSR1 noprint nostop

3) (gdb) break dsa.c:1736

4) run the query, gdb should interrupt at the breakpoint

5) (gdb) bt

and then send us the output of that gdb command.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Alexander Voytsekhovskyy

Date:

29 November 2017, 19:23:49

thanks for detailed instructions! Here it is: Note: breakpoint 1 also set at pc 0x5560252c21f5. Breakpoint 2 at 0x5560252c21f5: file /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c, line 1736. (gdb) bt #0 0x00007f53e538e9b3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84 #1 0x000055602516a791 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffc2849a460, cur_timeout=-1, set=0x556026359498) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/storage/ipc/latch.c:1048 #2 WaitEventSetWait (set=0x556026359498, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffc2849a460, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=100663296) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/storage/ipc/latch.c:1000 #3 0x0000556025089767 in secure_read (port=0x55602639ddb0, ptr=0x5560257195a0 , len=8192) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/libpq/be-secure.c:166 #4 0x0000556025094f48 in pq_recvbuf () at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/libpq/pqcomm.c:963 #5 0x0000556025095c85 in pq_getbyte () at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/libpq/pqcomm.c:1006 #6 0x000055602518d350 in SocketBackend (inBuf=0x7ffc2849a620) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:328 #7 ReadCommand (inBuf=0x7ffc2849a620) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:501 #8 PostgresMain (argc=, argv=argv@entry=0x5560263a06e0, dbname=0x5560263a05d8 "ddc_pvmrussia", username=) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:4030 #9 0x0000556024ec7fe9 in BackendRun (port=0x55602639ddb0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4357 #10 BackendStartup (port=0x55602639ddb0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4029 #11 ServerLoop () at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1753 #12 0x000055602511d36b in PostmasterMain (argc=5, argv=) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1361 #13 0x0000556024ec93e5 in main (argc=5, argv=0x556026358850) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/main/main.c:228 (gdb) On Wed, Nov 29, 2017 at 6:02 PM, Tomas Vondra wrote: > > On 11/29/2017 04:46 PM, Alexander Voytsekhovskyy wrote: > > I see my fault > > > > here is GDB with correct symbols > > > > Program received signal SIGUSR1, User defined signal 1. > > pagetable_insert (found=, key=8192, tb= > out>) at > > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../ > src/include/lib/simplehash.h:540 > > 540 > > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../ > src/include/lib/simplehash.h: > > No such file or directory. > > Continuing. > > > > I think you really need to do "handle SIGUSR1 noprint nostop" to > eliminate the SIGUSR1 noise. That is absolutely irrelevant to the issue > you're trying to investigate. > > Furthermore, this does not give us any sort of stack trace. You need to > either set a breakpoint at the elog() call, which I believe is this one: > > https://github.com/postgres/postgres/blob/REL_10_STABLE/ > src/backend/utils/mmgr/dsa.c#L1736 > > wait for it to get triggered, and then generate backtrace using 'bt'. > > > So something like > > 1) gdb -p $PID > > 2) (gdb) handle SIGUSR1 noprint nostop > > 3) (gdb) break dsa.c:1736 > > 4) run the query, gdb should interrupt at the breakpoint > > 5) (gdb) bt > > and then send us the output of that gdb command. > > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Tomas Vondra

Date:

29 November 2017, 21:41:39

On 11/29/2017 05:23 PM, Alexander Voytsekhovskyy wrote:
> thanks for detailed instructions!
> 
> Here it is:
> 
> Note: breakpoint 1 also set at pc 0x5560252c21f5.
> Breakpoint 2 at 0x5560252c21f5: file
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c,
> line 1736.
> (gdb) bt
> #0  0x00007f53e538e9b3 in __epoll_wait_nocancel () at
> ../sysdeps/unix/syscall-template.S:84
> #1  0x000055602516a791 in WaitEventSetWaitBlock (nevents=1,
> occurred_events=0x7ffc2849a460, cur_timeout=-1, set=0x556026359498) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/storage/ipc/latch.c:1048

Unfortunately, that's still not it.  This shows the backtrace at the
moment when gdb attached to the process (because it interrupts the
exectution). You need to do "c" first, to continue the execution.

So it should be:

1) gdb -p $PID
2) (gdb) handle SIGUSR1 noprint nostop
3) (gdb) break dsa.c:1736
4) (gdb) c
5) run the query, gdb should interrupt at the breakpoint
6) (gdb) bt


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Alexander Voytsekhovskyy

Date:

29 November 2017, 23:34:30

Thanks for helping, here is one more try #0 get_segment_by_index (area=area@entry=0x556026700be8, index=1) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736 #1 0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8, dp=dp@entry=1099511685280) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945 #2 0x00005560250a2c2b in tbm_attach_shared_iterate (dsa=dsa@entry=0x556026700be8, dp=1099511685280) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503 #3 0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176 #4 0x000055602505b36d in ExecScanFetch (recheckMtd=0x556025066470 , accessMtd=0x556025066500 , node=0x556026460710) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execScan.c:97 #5 ExecScan (node=0x556026460710, accessMtd=0x556025066500 , recheckMtd=0x556025066470 ) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execScan.c:164 #6 0x0000556025061033 in ExecProcNode (node=0x556026460710) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250 #7 ExecAppend (pstate=0x556026582ff8) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAppend.c:222 #8 0x0000556025068ce6 in ExecProcNode (node=0x556026582ff8) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250 #9 gather_getnext (gatherstate=0x5560263b0ae0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeGather.c:281 #10 ExecGather (pstate=0x5560263b0ae0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeGather.c:215 #11 0x000055602506b6d8 in ExecProcNode (node=0x5560263b0ae0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250 #12 ExecHashJoinOuterGetTuple (hashvalue=0x7ffc2849a0dc, hjstate=0x5560263b0220, outerNode=0x5560263b0ae0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeHashjoin.c:612 #13 ExecHashJoin (pstate=0x5560263b0220) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeHashjoin.c:201 #14 0x0000556025077af6 in ExecProcNode (node=0x5560263b0220) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250 #15 ExecSort (pstate=0x5560263affb0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeSort.c:106 #16 0x000055602506138c in ExecProcNode (node=0x5560263affb0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250 #17 fetch_input_tuple (aggstate=aggstate@entry=0x5560263aef38) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAgg.c:695 #18 0x000055602506363b in agg_retrieve_direct (aggstate=0x5560263aef38) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAgg.c:2336 #19 ExecAgg (pstate=0x5560263aef38) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeAgg.c:2147 #20 0x000055602505528b in ExecProcNode (node=0x5560263aef38) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/include/executor/executor.h:250 #21 ExecutePlan (execute_once=, dest=0x5560266986c0, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x5560263aef38, estate=0x5560263aecd8) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execMain.c:1722 #22 standard_ExecutorRun (queryDesc=0x556026567f28, direction=, count=0, execute_once=) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/execMain.c:363 #23 0x000055602518f95b in PortalRunSelect (portal=portal@entry=0x556026377d48, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x5560266986c0) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/pquery.c:932 #24 0x0000556025190f58 in PortalRun (portal=portal@entry=0x556026377d48, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x5560266986c0, altdest=altdest@entry=0x5560266986c0, completionTag=0x7ffc2849a470 "") at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/pquery.c:773 #25 0x000055602518c8dc in exec_simple_query (query_string=0x55602640eda8 "SELECT\n delivery_data.id_product AS id_product,\n client_tt.id_client_tt\nFROM delivery_data\n JOIN client_tt AS client_tt_tmp ON (client_tt_tmp.id_client_tt =\ndelivery_data.id_client_tt)\n JOIN clien"...) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:1099 #26 0x000055602518ddf6 in PostgresMain (argc=, argv=argv@entry=0x5560263a08e0, dbname=0x5560263a07d8 "ddc_pvmrussia", username=) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/tcop/postgres.c:4088 #27 0x0000556024ec7fe9 in BackendRun (port=0x55602639e570) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4357 #28 BackendStartup (port=0x55602639e570) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:4029 #29 ServerLoop () at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1753 #30 0x000055602511d36b in PostmasterMain (argc=5, argv=) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/postmaster/postmaster.c:1361 #31 0x0000556024ec93e5 in main (argc=5, argv=0x556026358850) at /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/main/main.c:228 On Wed, Nov 29, 2017 at 8:41 PM, Tomas Vondra wrote: > On 11/29/2017 05:23 PM, Alexander Voytsekhovskyy wrote: > > thanks for detailed instructions! > > > > Here it is: > > > > Note: breakpoint 1 also set at pc 0x5560252c21f5. > > Breakpoint 2 at 0x5560252c21f5: file > > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../ > src/backend/utils/mmgr/dsa.c, > > line 1736. > > (gdb) bt > > #0 0x00007f53e538e9b3 in __epoll_wait_nocancel () at > > ../sysdeps/unix/syscall-template.S:84 > > #1 0x000055602516a791 in WaitEventSetWaitBlock (nevents=1, > > occurred_events=0x7ffc2849a460, cur_timeout=-1, set=0x556026359498) at > > /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../ > src/backend/storage/ipc/latch.c:1048 > > Unfortunately, that's still not it. This shows the backtrace at the > moment when gdb attached to the process (because it interrupts the > exectution). You need to do "c" first, to continue the execution. > > So it should be: > > 1) gdb -p $PID > 2) (gdb) handle SIGUSR1 noprint nostop > 3) (gdb) break dsa.c:1736 > 4) (gdb) c > 5) run the query, gdb should interrupt at the breakpoint > 6) (gdb) bt > > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Thomas Munro

Date:

30 November 2017, 00:18:48

On Thu, Nov 30, 2017 at 9:34 AM, Alexander Voytsekhovskyy
<young.inbox@gmail.com> wrote:
> Thanks for helping, here is one more try
>
> #0  get_segment_by_index (area=area@entry=0x556026700be8, index=1) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736
> #1  0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8,
> dp=dp@entry=1099511685280) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945
> #2  0x00005560250a2c2b in tbm_attach_shared_iterate
> (dsa=dsa@entry=0x556026700be8, dp=1099511685280) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503
> #3  0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at
> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176

Thank you for the report and the back trace.  I think this might be a
manifestation of the problem I just described[1] on -hackers.
Depending on the shape of a multi-Gather query plan and therefore the
order of control flow, you might finish up using the DSA area that
belongs to a different Gather node and then find that it goes away too
soon.  Investigating.

https://www.postgresql.org/message-id/CAEepm%3D1U6as%3DbrnVvMNixEV2tpi8NuyQoTmO8Qef0-VV%2B%3D7MDA%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Thomas Munro

Date:

05 December 2017, 04:35:07

On Thu, Nov 30, 2017 at 10:18 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Nov 30, 2017 at 9:34 AM, Alexander Voytsekhovskyy
> <young.inbox@gmail.com> wrote:
>> Thanks for helping, here is one more try
>>
>> #0  get_segment_by_index (area=area@entry=0x556026700be8, index=1) at
>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736
>> #1  0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8,
>> dp=dp@entry=1099511685280) at
>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945
>> #2  0x00005560250a2c2b in tbm_attach_shared_iterate
>> (dsa=dsa@entry=0x556026700be8, dp=1099511685280) at
>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503
>> #3  0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at
>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176
>
> Thank you for the report and the back trace.  I think this might be a
> manifestation of the problem I just described[1] on -hackers.
> Depending on the shape of a multi-Gather query plan and therefore the
> order of control flow, you might finish up using the DSA area that
> belongs to a different Gather node and then find that it goes away too
> soon.  Investigating.

I haven't managed to reproduce this, but I was coincidentally
investigating a bug that appears to explain it.  I think what happened
is that a background worker was first to execute BitmapHeapNext and
allocated a dsa_pointer, and then the leader process reached
BitmapHeapNext and called tbm_attach_shared_iterate which tried to
deference it, but it had es_query_dsa set to another gather node's DSA
area (whichever Gather most recently ran ExecInitParallelPlan).  That
requires a certain order of execution and timing that I'm not sure how
to reach.  I have posted a patch that should fix it over here:

https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com

Are you able to provide a minimal reproducer, an anonymised partial
dump, or perhaps try out the patch on a copy of your database?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Alexander Voytsekhovskyy

Date:

07 December 2017, 23:55:56

Hi,

i can confirm that this patch fix the bug for me.

I can’t provide minimal reproducer, but i can share whole data set with you

> On Thu, Nov 30, 2017 at 10:18 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Thu, Nov 30, 2017 at 9:34 AM, Alexander Voytsekhovskyy
>> <young.inbox@gmail.com> wrote:
>>> Thanks for helping, here is one more try
>>>
>>> #0  get_segment_by_index (area=area@entry=0x556026700be8, index=1) at
>>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:1736
>>> #1  0x00005560252c2b90 in dsa_get_address (area=area@entry=0x556026700be8,
>>> dp=dp@entry=1099511685280) at
>>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/utils/mmgr/dsa.c:945
>>> #2  0x00005560250a2c2b in tbm_attach_shared_iterate
>>> (dsa=dsa@entry=0x556026700be8, dp=1099511685280) at
>>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/nodes/tidbitmap.c:1503
>>> #3  0x0000556025066c7b in BitmapHeapNext (node=node@entry=0x556026460710) at
>>> /build/postgresql-10-qAeTPy/postgresql-10-10.1/build/../src/backend/executor/nodeBitmapHeapscan.c:176
>>
>> Thank you for the report and the back trace.  I think this might be a
>> manifestation of the problem I just described[1] on -hackers.
>> Depending on the shape of a multi-Gather query plan and therefore the
>> order of control flow, you might finish up using the DSA area that
>> belongs to a different Gather node and then find that it goes away too
>> soon.  Investigating.
>
> I haven't managed to reproduce this, but I was coincidentally
> investigating a bug that appears to explain it.  I think what happened
> is that a background worker was first to execute BitmapHeapNext and
> allocated a dsa_pointer, and then the leader process reached
> BitmapHeapNext and called tbm_attach_shared_iterate which tried to
> deference it, but it had es_query_dsa set to another gather node's DSA
> area (whichever Gather most recently ran ExecInitParallelPlan).  That
> requires a certain order of execution and timing that I'm not sure how
> to reach.  I have posted a patch that should fix it over here:
>
> https://www.postgresql.org/message-id/CAEepm%3D0Mv9BigJPpribGQhnHqVGYo2%2BkmzekGUVJJc9Y_ZVaYA%40mail.gmail.com
>
> Are you able to provide a minimal reproducer, an anonymised partial
> dump, or perhaps try out the patch on a copy of your database?
>
> --
> Thomas Munro
> http://www.enterprisedb.com

Re: BUG in 10.1 - dsa_area could not attach to a segment that hasbeen freed

From

Thomas Munro

Date:

08 December 2017, 00:45:43

On Fri, Dec 8, 2017 at 9:55 AM, Alexander Voytsekhovskyy
<av@mobile-ua.com> wrote:
> i can confirm that this patch fix the bug for me.

Great news.  Thanks for taking the time to test it!

> I can’t provide minimal reproducer, but i can share whole data set with you

It's OK, your report is enough for me (I don't want to receive any
non-anonymised data).

--
Thomas Munro
http://www.enterprisedb.com