Thread: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1

[HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1

From
Peter Geoghegan
Date:
A customer is on 9.6.1, and complains of a segfault observed at least
3 times. In all cases, the logs look like this:

Jan 11 16:11:07 ip-10-0-118-82 kernel: [41913.530453] traps:
postgres[6561] general protection ip:55fcf08b0491 sp:7ffc17dfa650
error:0 in postgres[55fcf0557000+638000]
Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-1] LOG:  server
process (PID 643) was terminated by signal 11: Segmentation fault
Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [4-2] DETAIL:  Failed
process was running: SELECT COUNT(count_column) FROM (SELECT  1 AS
count_column FROM "albums" INNER JOIN "album_photo_assignments" ON
"albums"."id" = "album_photo_assignments"."album_id" WHERE
"albums"."deleted_at" IS NULL AND "album_photo_assignments"."photo_id"
= $1 LIMIT 1000 OFFSET 0) subquery_for_count
Jan 11 16:11:07 ip-10-0-118-82 postgresql[7]: [5-1] LOG:  terminating
any other active server processes
Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-1] WARNING:
terminating connection because of crash of another server process
Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-2] DETAIL:  The
postmaster has commanded this server process to roll back the current
transaction and exit, because another server process exited abnormally
and possibly corrupted shared memory.
Jan 11 16:11:07 ip-10-0-118-82 postgresql[80]: [6-3] HINT:  In a
moment you should be able to reconnect to the database and repeat your
command.

The "general protection ip:55fcf08b0491 sp:7ffc17dfa650 error:0 in
postgres[55fcf0557000+638000]" details are identical in all 3
segfaults observed. Note that the kernel in question is
"4.4.0-57-generic #78~14.04.1-Ubuntu". These crashes were days apart,
but I have yet to isolate the problem. The customer tells me that this
is a representative query plan:

explain SELECT COUNT(count_column)
FROM (SELECT  1 AS count_column   FROM "albums"   INNER JOIN "album_photo_assignments" ON "albums"."id" =
"album_photo_assignments"."album_id"   WHERE "albums"."deleted_at" IS NULL AND
"album_photo_assignments"."photo_id" =
'fd0bcbb3-9c4c-4036-954a-cf01935d004c'   LIMIT 1000 OFFSET 0)
subquery_for_count ;                                                QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────Aggregate
(cost=16419.94..16419.94rows=1 width=8)  ->  Limit  (cost=1000.06..16419.94 rows=1 width=4)        ->  Nested Loop
(cost=1000.06..16419.94rows=1 width=4)              ->  Gather  (cost=1000.00..16415.87 rows=1 width=16)
   Workers Planned: 1                    ->  Parallel Seq Scan on album_photo_assignments 
(cost=0.00..15415.77 rows=1 width=16)                          Filter: (photo_id =
'fd0bcbb3-9c4c-4036-954a-cf01935d004c'::uuid)              ->  Index Scan using albums_pkey on albums
(cost=0.06..4.06 rows=1 width=16)                    Index Cond: (id = album_photo_assignments.album_id)
   Filter: (deleted_at IS NULL) 
(10 rows)

Note that this isn't a prepared statement, unlike the problematic
query we see from the logs. And, I've not yet managed to verify that
the problem case involved execution of this particular "$1 substitute
constant", since that isn't displayed in the logs. So, while this
query plan might well be representative of the problem, it may also
not be.

I can use GDB to get details of the instruction pointer that appeared
in the kernel trap error, which shows a function from the expanded
value representation infrastructure:

(gdb) info symbol 0x55fcf08b0491
EOH_get_flat_size + 1 in section .text of /usr/lib/postgresql/9.6/bin/postgres
(gdb) info symbol 0x55fcf08b0490
EOH_get_flat_size in section .text of /usr/lib/postgresql/9.6/bin/postgres
(gdb) disassemble 0x55fcf08b0490
Dump of assembler code for function EOH_get_flat_size:  0x000055fcf08b0490 <+0>: push   %rbp  0x000055fcf08b0491 <+1>:
mov   0x8(%rdi),%rax  0x000055fcf08b0495 <+5>: mov    %rsp,%rbp  0x000055fcf08b0498 <+8>: pop    %rbp
0x000055fcf08b0499<+9>: mov    (%rax),%rax  0x000055fcf08b049c <+12>: jmpq   *%rax 
End of assembler dump.

(I believe that Linux outputs VMA-wise addresses in the kernel trap
error quoted above, an address from the userland program (Postgres)
address space, so GDB does in fact display interesting disassembly --
the documentation on this seems extremely limited.)

Note that there is a text array column in the table "albums". I note
that there have been fixes in this area already, that are not in 9.6.1
(commit 77cd0dc, for one). I know next to nothing about this code, and
would appreciate some guidance.

--
Peter Geoghegan



Peter Geoghegan <pg@heroku.com> writes:
> A customer is on 9.6.1, and complains of a segfault observed at least
> 3 times. In all cases, the logs look like this:
> ...
> I can use GDB to get details of the instruction pointer that appeared
> in the kernel trap error, which shows a function from the expanded
> value representation infrastructure:

It would help to know the data types of the columns involved in this
query; but just eyeballing it, it doesn't look like it involves any
array operations, so it's pretty hard to believe that the expanded-object
code could have gotten invoked intentionally.  (The mere presence of
an array column somewhere in the vicinity would not do that; you'd
need to invoke an array-ish operation, or at least pass the array into
a plpgsql function.)

If I had to bet on the basis of this much info, I would bet that the
parallel-query infrastructure is dropping the ball somewhere and
transmitting a corrupted datum that accidentally looks like it is
an expanded-object reference.

If $customer wants a quick fix, I'd suggest seeing whether disabling
parallel query makes the problem go away.  That might be a good first
step anyway, just to narrow down where the problem lies.
        regards, tom lane



On Thu, Jan 19, 2017 at 9:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If I had to bet on the basis of this much info, I would bet that the
> parallel-query infrastructure is dropping the ball somewhere and
> transmitting a corrupted datum that accidentally looks like it is
> an expanded-object reference.

There's code in datumSerialize() that is intended to do the right
thing with expanded-object references.  But it could have a bug, of
course.  That function gets called from SerializeParamList(), which
might be another place to look for problems if the crash can only be
reproduced when there are statement parameters.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Possible issue with expanded object infrastructure onPostgres 9.6.1

From
Peter Geoghegan
Date:
On Thu, Jan 19, 2017 at 6:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If $customer wants a quick fix, I'd suggest seeing whether disabling
> parallel query makes the problem go away.  That might be a good first
> step anyway, just to narrow down where the problem lies.

I no longer work at Heroku, and so can do very little about it now,
but I found out privately that disabling parallel query made the
problem go away. It hasn't returned as of today. Parallel query was
disabled this whole time.


-- 
Peter Geoghegan



Re: [HACKERS] Possible issue with expanded object infrastructure on Postgres 9.6.1

From
Justin Workman
Date:
It would help to know the data types of the columns involved in this
query; but just eyeballing it, it doesn't look like it involves any
array operations, so it's pretty hard to believe that the expanded-object
code could have gotten invoked intentionally.  (The mere presence of
an array column somewhere in the vicinity would not do that; you'd
need to invoke an array-ish operation, or at least pass the array into
a plpgsql function.)
If I had to bet on the basis of this much info, I would bet that the
parallel-query infrastructure is dropping the ball somewhere and
transmitting a corrupted datum that accidentally looks like it is
an expanded-object reference.
If $customer wants a quick fix, I'd suggest seeing whether disabling
parallel query makes the problem go away.  That might be a good first
step anyway, just to narrow down where the problem lies.
regards, tom lane

Hi Tom,

My name is Justin, and I am $customer as it were. As Peter explained, we haven't seen the segfaults anymore since disabling parallel queries. This works as a quick fix and is much appreciated! If you would still like to get to the bottom of this, I am willing to help out with more information as needed. My knowledge of PG internals is extremely limited so I don't know how much help I can be, but we'd like to see this resolved beyond the quick fix, or at least understand why it happened.

album_photo_assignments.id, album_photo_assignments.album_id, album_photo_assignments.photo_id and albums.id are all UUID columns. albums.deleted_at is a timestamp.

Thanks so much for your time,

Justin Workman
On Thu, Feb 16, 2017 at 11:45 AM, Justin Workman <justin@photolynx.com> wrote:
>> It would help to know the data types of the columns involved in this
>> query; but just eyeballing it, it doesn't look like it involves any
>> array operations, so it's pretty hard to believe that the expanded-object
>> code could have gotten invoked intentionally.  (The mere presence of
>> an array column somewhere in the vicinity would not do that; you'd
>> need to invoke an array-ish operation, or at least pass the array into
>> a plpgsql function.)
>> If I had to bet on the basis of this much info, I would bet that the
>> parallel-query infrastructure is dropping the ball somewhere and
>> transmitting a corrupted datum that accidentally looks like it is
>> an expanded-object reference.
>> If $customer wants a quick fix, I'd suggest seeing whether disabling
>> parallel query makes the problem go away.  That might be a good first
>> step anyway, just to narrow down where the problem lies.
>> regards, tom lane
>
>
> Hi Tom,
>
> My name is Justin, and I am $customer as it were. As Peter explained, we
> haven't seen the segfaults anymore since disabling parallel queries. This
> works as a quick fix and is much appreciated! If you would still like to get
> to the bottom of this, I am willing to help out with more information as
> needed. My knowledge of PG internals is extremely limited so I don't know
> how much help I can be, but we'd like to see this resolved beyond the quick
> fix, or at least understand why it happened.
>

Thanks Justin for the confirmation.  Is it possible for you to get a
standalone test case with which we can reproduce the problem?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Possible issue with expanded object infrastructure onPostgres 9.6.1

From
Peter Geoghegan
Date:
On Thu, Jan 19, 2017 at 5:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
> A customer is on 9.6.1, and complains of a segfault observed at least
> 3 times.

> I can use GDB to get details of the instruction pointer that appeared
> in the kernel trap error, which shows a function from the expanded
> value representation infrastructure:
>
> (gdb) info symbol 0x55fcf08b0491
> EOH_get_flat_size + 1 in section .text of /usr/lib/postgresql/9.6/bin/postgres
> (gdb) info symbol 0x55fcf08b0490
> EOH_get_flat_size in section .text of /usr/lib/postgresql/9.6/bin/postgres
> (gdb) disassemble 0x55fcf08b0490
> Dump of assembler code for function EOH_get_flat_size:
>    0x000055fcf08b0490 <+0>: push   %rbp
>    0x000055fcf08b0491 <+1>: mov    0x8(%rdi),%rax
>    0x000055fcf08b0495 <+5>: mov    %rsp,%rbp
>    0x000055fcf08b0498 <+8>: pop    %rbp
>    0x000055fcf08b0499 <+9>: mov    (%rax),%rax
>    0x000055fcf08b049c <+12>: jmpq   *%rax
> End of assembler dump.

For the sake of the archives: this now looks very much like the issue
that Tom just fixed with commit
9bf4068cc321a4d44ac54089ab651a49d89bb567.

-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> On Thu, Jan 19, 2017 at 5:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> A customer is on 9.6.1, and complains of a segfault observed at least
>> 3 times.
> ...
> For the sake of the archives: this now looks very much like the issue
> that Tom just fixed with commit
> 9bf4068cc321a4d44ac54089ab651a49d89bb567.

Yeah, particularly seeing that $customer noted that some of the
columns involved were UUIDs:

https://www.postgresql.org/message-id/CAOxz3fqK9Y0GntL8MDoeZjy2Ot_6Lx1YvHAY6Bd1vYkUp-iS_A@mail.gmail.com

Good to have gotten to the bottom of that one.  Too bad it just
missed the train for 9.6.4.
        regards, tom lane



Re: [HACKERS] Possible issue with expanded object infrastructure onPostgres 9.6.1

From
Justin Workman
Date:
On Tue, Aug 8, 2017 at 8:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <pg@bowt.ie> writes:
> On Thu, Jan 19, 2017 at 5:45 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> A customer is on 9.6.1, and complains of a segfault observed at least
>> 3 times.
> ...
> For the sake of the archives: this now looks very much like the issue
> that Tom just fixed with commit
> 9bf4068cc321a4d44ac54089ab651a49d89bb567.

Yeah, particularly seeing that $customer noted that some of the
columns involved were UUIDs:

https://www.postgresql.org/message-id/CAOxz3fqK9Y0GntL8MDoeZjy2Ot_6Lx1YvHAY6Bd1vYkUp-iS_A@mail.gmail.com

Good to have gotten to the bottom of that one.  Too bad it just
missed the train for 9.6.4.

                        regards, tom lane

 $customer, here. I just want to thank everyone involved for getting to the bottom of this and for your support. Even if it missed the 9.6.4 release, I'm very grateful for your help. We haven't had much of an issue since disabling parallel workers so nothing is harmed by waiting a little longer.

Thanks,
Justin