Thread: Seg Fault in backend after beginning to use xpath (PG 9.0, FreeBSD 8.1)

Our developers started to use some xpath features and upon deployment
we now have an issue where PostgreSQL is seg faulting periodically.
Any ideas on what to look at next would be much appreciated.

FreeBSD 8.1
PostgreSQL 9.0.3 (also tried upgrading to 9.0.4)  built from ports
Libxml2 2.7.6 (also tried upgrading to 2.7.8)   built from ports

pgsql logs show:
May  1 17:51:13 192.168.20.100 postgres[11862]: [94-1] LOG:  server
process (PID 62112) was terminated by signal 11: Segmentation fault

syslog shows:
May  2 20:29:16 db3 kernel: pid 49956 (postgres), uid 70: exited on
signal 11 (core dumped)
May  2 21:06:37 db3 kernel: pid 39086 (postgres), uid 70: exited on
signal 10 (core dumped)

Checking out postgres.core and we see:

(gdb) bt
#0  0x00000008f5f19afd in pthread_mutex_lock () from /lib/libthr.so.3
#1  0x0000000800d22965 in xmlRMutexLock () from /usr/local/lib/libxml2.so.5
#2  0x0000000800d717e1 in xmlDictReference () from /usr/local/lib/libxml2.so.5
#3  0x0000000800d74ba5 in xmlSAX2StartDocument ()
   from /usr/local/lib/libxml2.so.5
#4  0x0000000800ccee5f in xmlParseDocument () from /usr/local/lib/libxml2.so.5
#5  0x0000000800ccef85 in xmlDoRead () from /usr/local/lib/libxml2.so.5
#6  0x000000000076b58d in xpath ()
#7  0x00000000005880e4 in GetAttributeByNum ()
#8  0x0000000000588e91 in GetAttributeByName ()
#9  0x00000000005850a3 in ExecProject ()
#10 0x000000000058c5e4 in ExecScan ()
#11 0x0000000000584a2d in ExecProcNode ()
#12 0x000000000059bfc8 in ExecLimit ()
#13 0x00000000005848f5 in ExecProcNode ()
#14 0x0000000000583049 in standard_ExecutorRun ()
#15 0x000000000067630d in PostgresMain ()
#16 0x0000000000677921 in PortalRun ()
#17 0x0000000000672ea4 in pg_parse_and_rewrite ()
#18 0x0000000000675354 in PostgresMain ()
#19 0x0000000000626afb in ClosePostmasterPorts ()
#20 0x0000000000627a8e in PostmasterMain ()
#21 0x00000000005bbea7 in main ()
(gdb)


Ideas?  Need more info?

Thanks,
Alan

alan bryan <alan.bryan@gmail.com> writes:
> Checking out postgres.core and we see:

> (gdb) bt
> #0  0x00000008f5f19afd in pthread_mutex_lock () from /lib/libthr.so.3
> #1  0x0000000800d22965 in xmlRMutexLock () from /usr/local/lib/libxml2.so.5
> #2  0x0000000800d717e1 in xmlDictReference () from /usr/local/lib/libxml2.so.5
> #3  0x0000000800d74ba5 in xmlSAX2StartDocument ()
>    from /usr/local/lib/libxml2.so.5
> #4  0x0000000800ccee5f in xmlParseDocument () from /usr/local/lib/libxml2.so.5
> #5  0x0000000800ccef85 in xmlDoRead () from /usr/local/lib/libxml2.so.5
> #6  0x000000000076b58d in xpath ()
> #7  0x00000000005880e4 in GetAttributeByNum ()
> #8  0x0000000000588e91 in GetAttributeByName ()
> #9  0x00000000005850a3 in ExecProject ()
> #10 0x000000000058c5e4 in ExecScan ()
> #11 0x0000000000584a2d in ExecProcNode ()
> #12 0x000000000059bfc8 in ExecLimit ()
> #13 0x00000000005848f5 in ExecProcNode ()
> #14 0x0000000000583049 in standard_ExecutorRun ()
> #15 0x000000000067630d in PostgresMain ()
> #16 0x0000000000677921 in PortalRun ()
> #17 0x0000000000672ea4 in pg_parse_and_rewrite ()
> #18 0x0000000000675354 in PostgresMain ()
> #19 0x0000000000626afb in ClosePostmasterPorts ()
> #20 0x0000000000627a8e in PostmasterMain ()
> #21 0x00000000005bbea7 in main ()
> (gdb)

> Ideas?  Need more info?

Well, the first thing that you should consider is rebuilding both PG and
libxml with debug symbols enabled, so you can get a stack trace that's
worth the electrons it's written on.  That one has enough laughers in
the PG part to make me not trust the libxml part too much.  That would
also help you find out what SQL command is being executed, which'd
possibly lead to being able to create a reproducible test case.

            regards, tom lane

On Mon, May 2, 2011 at 10:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> alan bryan <alan.bryan@gmail.com> writes:
>> Checking out postgres.core and we see:
>
>> (gdb) bt
>> #0  0x00000008f5f19afd in pthread_mutex_lock () from /lib/libthr.so.3
>> #1  0x0000000800d22965 in xmlRMutexLock () from /usr/local/lib/libxml2.so.5
>> #2  0x0000000800d717e1 in xmlDictReference () from /usr/local/lib/libxml2.so.5
>> #3  0x0000000800d74ba5 in xmlSAX2StartDocument ()
>>    from /usr/local/lib/libxml2.so.5
>> #4  0x0000000800ccee5f in xmlParseDocument () from /usr/local/lib/libxml2.so.5
>> #5  0x0000000800ccef85 in xmlDoRead () from /usr/local/lib/libxml2.so.5
>> #6  0x000000000076b58d in xpath ()
>> #7  0x00000000005880e4 in GetAttributeByNum ()
>> #8  0x0000000000588e91 in GetAttributeByName ()
>> #9  0x00000000005850a3 in ExecProject ()
>> #10 0x000000000058c5e4 in ExecScan ()
>> #11 0x0000000000584a2d in ExecProcNode ()
>> #12 0x000000000059bfc8 in ExecLimit ()
>> #13 0x00000000005848f5 in ExecProcNode ()
>> #14 0x0000000000583049 in standard_ExecutorRun ()
>> #15 0x000000000067630d in PostgresMain ()
>> #16 0x0000000000677921 in PortalRun ()
>> #17 0x0000000000672ea4 in pg_parse_and_rewrite ()
>> #18 0x0000000000675354 in PostgresMain ()
>> #19 0x0000000000626afb in ClosePostmasterPorts ()
>> #20 0x0000000000627a8e in PostmasterMain ()
>> #21 0x00000000005bbea7 in main ()
>> (gdb)
>
>> Ideas?  Need more info?
>
> Well, the first thing that you should consider is rebuilding both PG and
> libxml with debug symbols enabled, so you can get a stack trace that's
> worth the electrons it's written on.  That one has enough laughers in
> the PG part to make me not trust the libxml part too much.  That would
> also help you find out what SQL command is being executed, which'd
> possibly lead to being able to create a reproducible test case.
>
>                        regards, tom lane
>

Thanks Tom - I'll see what I can do.  We just removed that new code
and did it in our PHP code instead as a workaround.  I'll try to spend
some time getting a reproducible test case and come back with a better
trace if possible.

Appreciate the quick response.

--Alan

On 03/05/2011 07:12, alan bryan wrote:
> Our developers started to use some xpath features and upon deployment
> we now have an issue where PostgreSQL is seg faulting periodically.
> Any ideas on what to look at next would be much appreciated.
>
> FreeBSD 8.1
> PostgreSQL 9.0.3 (also tried upgrading to 9.0.4)  built from ports
> Libxml2 2.7.6 (also tried upgrading to 2.7.8)   built from ports
>
> pgsql logs show:
> May  1 17:51:13 192.168.20.100 postgres[11862]: [94-1] LOG:  server
> process (PID 62112) was terminated by signal 11: Segmentation fault
>
> syslog shows:
> May  2 20:29:16 db3 kernel: pid 49956 (postgres), uid 70: exited on
> signal 11 (core dumped)
> May  2 21:06:37 db3 kernel: pid 39086 (postgres), uid 70: exited on
> signal 10 (core dumped)
>
> Checking out postgres.core and we see:
>
> (gdb) bt
> #0  0x00000008f5f19afd in pthread_mutex_lock () from /lib/libthr.so.3
> #1  0x0000000800d22965 in xmlRMutexLock () from /usr/local/lib/libxml2.so.5

This is unusual. There isn't any need to use pthreads here. As far as I
can see, the normal build of libxml2 doesn't import it explicitly:

> ldd /usr/local/lib/libxml2.so
/usr/local/lib/libxml2.so:
    libz.so.5 => /lib/libz.so.5 (0x800889000)
    libiconv.so.3 => /usr/local/lib/libiconv.so.3 (0x800e50000)
    libm.so.5 => /lib/libm.so.5 (0x80104b000)
    libc.so.7 => /lib/libc.so.7 (0x800647000)

Judging by the mix of SIGBUS and SIGSEGV, I'd say it is likely this is
causing you problems.

To make sure, you may want to rebuild libxml2 with WITHOUT_THREADS
defined. You may also need to rebuild postgresql afterwards.