Thread: Segmentation Fault in Postgres server when using psqlODBC

Segmentation Fault in Postgres server when using psqlODBC

From

Joshua Berry

Date:

24 May 2013, 19:55:30

Hi Groups,

I'm dealing with periodic backend process segmentation faults. I'm posting to both the bugs and odbc lists as it seems that my application's use of pgsqlODBC triggers a bug in the postgres backend.

The environment is:

Clients: win32 clients using various version of the psqlODBC driver (mostly using 8.04.0200). The connection string contains "UseDeclareFetch=1" which causes long idle transactions, heavy cursor and savepoint use.

Server: Dell dual Xeon x86 48GB RAM (12GB PG shared mem) RHEL system with Postgresql 9.2.4. Note that these same issues occurred on PG9.1 and PG8.4.

I've experienced these issues for over a year now. However, during that time several things have changed which may or may not be related:

* The schema has been modified with heavier use of triggers to update manually created materialized views (ie not using the 9.3 CREATE MATERIALIZED VIEW).

* The number of concurrent users has increased (from perhaps 15 concurrent users two years ago, to 30 concurrent users now).

* The PG version used has changed from 8.4, to 9.1, and finally to 9.2

* I've done recent tuning to the planner cost constants in order to favor index scans over table scans in more situations.

I've looked through past error logs and I found that I had a segfault in the server process while using PG 8.4 about a dozen times over a 12 month period. Each time one client's postgres process crashes, the backend forcefully closes all active connections due to possiby corrupted memory and then restarts. This leaves all active clients stranded as the connection is closed, and all cursors and savepoint info is lost. My app doesn't recover gracefully, and users are forced to click through some cryptic error messages from the application framework used for the app (Clarion) and then restart the app.

A few months ago, I upgraded to another server with PG 9.1, assuming that the issue with the previous server with 8.4 was due to bad RAM, as I did observe a high ECC error count on the previous system as logged by the IPMI controller. However, I very quickly had another segfault on the new system with PG9.1. The default settings of the OS and PG init script disabled core dumps, so I only started collecting core files for the past few months.

I posted details in pgsql-general April 10, 2013; here is a link to the thread: http://www.postgresql.org/message-id/CAPmZXM03MEDEn6nqqf_Phs3M1DK-EaXP5_K-LmirneOJMAQ-Hg@mail.gmail.com

The crash always is some variation of the following stack, as observed in both PG91 and PG92 crashes:

(gdb) bt

#0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:603

#1 0x000000000070f372 in SearchCatCache (cache=0x27fab90, v1=<value optimized out>, v2=<value optimized out>, v3=<value optimized out>, v4=<value optimized out>) at catcache.c:1136

#2 0x000000000071b1ae in getTypeOutputInfo (type=20, typOutput=0x2b3db80, typIsVarlena=0x2b3db88 "") at lsyscache.c:2482

#3 0x000000000045d127 in printtup_prepare_info (myState=0x2810290, typeinfo=0x29ad7b0, numAttrs=42) at printtup.c:263

#4 0x000000000045d4c4 in printtup (slot=0x3469650, self=0x2810290) at printtup.c:297

#5 0x000000000065a76a in RunFromStore (portal=0x285fed0, direction=<value optimized out>, count=10, dest=0x2810290) at pquery.c:1122

#6 0x000000000065a852 in PortalRunSelect (portal=0x285fed0, forward=<value optimized out>, count=10, dest=0x2810290) at pquery.c:940

#7 0x000000000065bcf8 in PortalRun (portal=0x285fed0, count=10, isTopLevel=1 '\001', dest=0x2810290, altdest=0x2810290, completionTag=0x7fffee67f7c0 "") at pquery.c:788

#8 0x0000000000659552 in exec_execute_message (argc=<value optimized out>, argv=<value optimized out>, dbname=0x2768370 "DBNAME", username=<value optimized out>) at postgres.c:1929

#9 PostgresMain (argc=<value optimized out>, argv=<value optimized out>, dbname=0x2768370 "DBNAME", username=<value optimized out>) at postgres.c:4016

#10 0x0000000000615161 in BackendRun () at postmaster.c:3614

#11 BackendStartup () at postmaster.c:3304

#12 ServerLoop () at postmaster.c:1367

#13 0x0000000000617dcc in PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at postmaster.c:1127

#14 0x00000000005b6830 in main (argc=5, argv=0x2766480) at main.c:199

(gdb)

Andres Freund and Tom Lane spent some time tracking down possible root causes with some of the following summaries:

Tom Lane writes:

|Andres Freund <andres@2ndquadrant.com> writes:

|> Tom: It looks to me like printtup_prepare_info won't normally be called

|> in an held cursor. But if some concurrent DDL changed the number of

|> columns in typeinfo vs thaose in the the receiver that could explain the

|> issue and why its not seen all the time, right?

|It looks to me like there are probably two triggering conditions:

|1. Client is doing a direct protocol Execute on a held-cursor portal.

|2. Cache flush occurs to drop the syscache entries needed by

|getTypeOutputInfo. (Otherwise, they'd still be around from when the

|held cursor was created.)

|The first of these explains why we don't see it in the

|CLOBBER_CACHE_ALWAYS buildfarm runs, and the second one explains why

|Joshua is only seeing it intermittently and not every darn time his

|application does that.

Andres Freund writes:

|On 2013-04-10 19:06:12 -0400, Tom Lane wrote:

|> I wrote:

|> > (Wanders away wondering just how much the regression tests exercise

|> > holdable cursors.)

|> And the answer is they're not testing this code path at all, because if

|> you do

|> DECLARE c CURSOR WITH HOLD FOR ...

|> FETCH ALL FROM c;

|> then the second query executes with a portal (and resource owner)

|> created to execute the FETCH command, not directly on the held portal.

|> After a little bit of thought I'm not sure it's even possible to

|> reproduce this problem with libpq, because it doesn't expose any way to

|> issue a bare protocol Execute command against a pre-existing portal.

|> (I had thought psqlOBC went through libpq, but maybe it's playing some

|> games here.)

|Hm. PQexecPrepared() looks like it can do that.

I didn't find that any bugs were officially reported/documented, so I wanted to followup with more detail and follow the guidelines here: http://www.postgresql.org/docs/current/static/bug-reporting.html

Specifically, I decided to update to the latest stable release, currently PG9.2.4.

After updating to PG92 three days ago, I've had two of the same segfaults. Either I'm very (un)lucky or the intermittant issue is occurring more frequently with PG92.

On a test machine running a snapshort of the production database, and with the backend configured to log all queries, I have an example of the type of traffic that client apps will perform: https://dl.dropboxusercontent.com/u/4715250/pg92_psqlODBC_08.04.0200.txt

In the above logfile, the logfile was cleared, the client app was started, A window was opened, it was closed, and the app was closed. I installed pgODBC version 09.01.0200, and aside from cursor names and timestamps differing, all other lines were identical.

I do not currently have a testing environment which can emulate the production environment enough to trigger this bug. I've never had this crash outside of the production environment.

Any thoughts/suggestions?

Regards,

-Joshua

Re: Segmentation Fault in Postgres server when using psqlODBC

From

Hiroshi Inoue

Date:

25 May 2013, 00:25:19

Hi,

Psqlodbc drivers send Execite requests for cursors instead of
issueing FETCH commands.

regards,
Hiroshi Inoue

(2013/05/25 1:55), Joshua Berry wrote:
> Hi Groups,
>
> I'm dealing with periodic backend process segmentation faults. I'm
> posting to both the bugs and odbc lists as it seems that my
> application's use of pgsqlODBC triggers a bug in the postgres backend.
>
> The environment is:
> Clients: win32 clients using various version of the psqlODBC driver
> (mostly using 8.04.0200). The connection string contains
> "UseDeclareFetch=1" which causes long idle transactions, heavy cursor
> and savepoint use.
> Server: Dell dual Xeon x86 48GB RAM (12GB PG shared mem) RHEL system
> with Postgresql 9.2.4. Note that these same issues occurred on PG9.1 and
> PG8.4.
>
> I've experienced these issues for over a year now. However, during that
> time several things have changed which may or may not be related:
> * The schema has been modified with heavier use of triggers to update
> manually created materialized views (ie not using the 9.3 CREATE
> MATERIALIZED VIEW).
> * The number of concurrent users has increased (from perhaps 15
> concurrent users two years ago, to 30 concurrent users now).
> * The PG version used has changed from 8.4, to 9.1, and finally to 9.2
> * I've done recent tuning to the planner cost constants in order to
> favor index scans over table scans in more situations.
>
> I've looked through past error logs and I found that I had a segfault in
> the server process while using PG 8.4 about a dozen times over a 12
> month period. Each time one client's postgres process crashes, the
> backend forcefully closes all active connections due to possiby
> corrupted memory and then restarts. This leaves all active clients
> stranded as the connection is closed, and all cursors and savepoint info
> is lost. My app doesn't recover gracefully, and users are forced to
> click through some cryptic error messages from the application framework
> used for the app (Clarion) and then restart the app.
>
> A few months ago, I upgraded to another server with PG 9.1, assuming
> that the issue with the previous server with 8.4 was due to bad RAM, as
> I did observe a high ECC error count on the previous system as logged by
> the IPMI controller. However, I very quickly had another segfault on the
> new system with PG9.1. The default settings of the OS and PG init script
> disabled core dumps, so I only started collecting core files for the
> past few months.
>
> I posted details in pgsql-general April 10, 2013; here is a link to the
> thread:
> http://www.postgresql.org/message-id/CAPmZXM03MEDEn6nqqf_Phs3M1DK-EaXP5_K-LmirneOJMAQ-Hg@mail.gmail.com
>
> The crash always is some variation of the following stack, as observed
> in both PG91 and PG92 crashes:
> (gdb) bt
> #0  ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:603
> #1  0x000000000070f372 in SearchCatCache (cache=0x27fab90, v1=<value
> optimized out>, v2=<value optimized out>, v3=<value optimized out>,
> v4=<value optimized out>) at catcache.c:1136
> #2  0x000000000071b1ae in getTypeOutputInfo (type=20,
> typOutput=0x2b3db80, typIsVarlena=0x2b3db88 "") at lsyscache.c:2482
> #3  0x000000000045d127 in printtup_prepare_info (myState=0x2810290,
> typeinfo=0x29ad7b0, numAttrs=42) at printtup.c:263
> #4  0x000000000045d4c4 in printtup (slot=0x3469650, self=0x2810290) at
> printtup.c:297
> #5  0x000000000065a76a in RunFromStore (portal=0x285fed0,
> direction=<value optimized out>, count=10, dest=0x2810290) at pquery.c:1122
> #6  0x000000000065a852 in PortalRunSelect (portal=0x285fed0,
> forward=<value optimized out>, count=10, dest=0x2810290) at pquery.c:940
> #7  0x000000000065bcf8 in PortalRun (portal=0x285fed0, count=10,
> isTopLevel=1 '\001', dest=0x2810290, altdest=0x2810290,
> completionTag=0x7fffee67f7c0 "") at pquery.c:788
> #8  0x0000000000659552 in exec_execute_message (argc=<value optimized
> out>, argv=<value optimized out>, dbname=0x2768370 "DBNAME",
> username=<value optimized out>) at postgres.c:1929
> #9  PostgresMain (argc=<value optimized out>, argv=<value optimized
> out>, dbname=0x2768370 "DBNAME", username=<value optimized out>) at
> postgres.c:4016
> #10 0x0000000000615161 in BackendRun () at postmaster.c:3614
> #11 BackendStartup () at postmaster.c:3304
> #12 ServerLoop () at postmaster.c:1367
> #13 0x0000000000617dcc in PostmasterMain (argc=<value optimized out>,
> argv=<value optimized out>) at postmaster.c:1127
> #14 0x00000000005b6830 in main (argc=5, argv=0x2766480) at main.c:199
> (gdb)
>
> Andres Freund and Tom Lane spent some time tracking down possible root
> causes with some of the following summaries:
>
> Tom Lane writes:
> |Andres Freund <andres@2ndquadrant.com <mailto:andres@2ndquadrant.com>>
> writes:
> |> Tom: It looks to me like printtup_prepare_info won't normally be called
> |> in an held cursor. But if some concurrent DDL changed the number of
> |> columns in typeinfo vs thaose in the the receiver that could explain the
> |> issue and why its not seen all the time, right?
> |
> |It looks to me like there are probably two triggering conditions:
> |
> |1. Client is doing a direct protocol Execute on a held-cursor portal.
> |
> |2. Cache flush occurs to drop the syscache entries needed by
> |getTypeOutputInfo.  (Otherwise, they'd still be around from when the
> |held cursor was created.)
> |
> |The first of these explains why we don't see it in the
> |CLOBBER_CACHE_ALWAYS buildfarm runs, and the second one explains why
> |Joshua is only seeing it intermittently and not every darn time his
> |application does that.
>
> Andres Freund writes:
> |On 2013-04-10 19:06:12 -0400, Tom Lane wrote:
> |> I wrote:
> |> > (Wanders away wondering just how much the regression tests exercise
> |> > holdable cursors.)
> |>
> |> And the answer is they're not testing this code path at all, because if
> |> you do
> |>       DECLARE c CURSOR WITH HOLD FOR ...
> |>       FETCH ALL FROM c;
> |> then the second query executes with a portal (and resource owner)
> |> created to execute the FETCH command, not directly on the held portal.
> |>
> |> After a little bit of thought I'm not sure it's even possible to
> |> reproduce this problem with libpq, because it doesn't expose any way to
> |> issue a bare protocol Execute command against a pre-existing portal.
> |> (I had thought psqlOBC went through libpq, but maybe it's playing some
> |> games here.)
> |
> |Hm. PQexecPrepared() looks like it can do that.
>
> I didn't find that any bugs were officially reported/documented, so I
> wanted to followup with more detail and follow the guidelines here:
> http://www.postgresql.org/docs/current/static/bug-reporting.html
> Specifically, I decided to update to the latest stable release,
> currently PG9.2.4.
>
> After updating to PG92 three days ago, I've had two of the same
> segfaults. Either I'm very (un)lucky or the intermittant issue is
> occurring more frequently with PG92.
>
> On a test machine running a snapshort of the production database, and
> with the backend configured to log all queries, I have an example of the
> type of traffic that client apps will perform:
> https://dl.dropboxusercontent.com/u/4715250/pg92_psqlODBC_08.04.0200.txt
> In the above logfile, the logfile was cleared, the client app was
> started, A window was opened, it was closed, and the app was closed. I
> installed pgODBC version 09.01.0200, and aside from cursor names and
> timestamps differing, all other lines were identical.
>
> I do not currently have a testing environment which can emulate the
> production environment enough to trigger this bug. I've never had this
> crash outside of the production environment.
>
> Any thoughts/suggestions?
>
> Regards,
> -Joshua

Re: Segmentation Fault in Postgres server when using psqlODBC

From

Joshua Berry

Date:

28 May 2013, 17:20:38

Hi Hiroshi,

On Fri, May 24, 2013 at 4:25 PM, Hiroshi Inoue <inoue@tpf.co.jp> wrote:

Psqlodbc drivers send Execite requests for cursors instead of
issueing FETCH commands.

Thanks for the reply. Have you seen or heard of psqlODBC's use of cursors intermittently causing segfaults as I'd described? If so, can these issues be avoided by enabling or disabling options in the driver or avoiding certain queries/conditions? The entire system is a blackbox to me, so when I encounter issues like this, I'm not sure how to diagnose and correct. Any advice or tips would be much appreciated.

Kind Regards,

-Joshua

Re: Segmentation Fault in Postgres server when using psqlODBC

From

Hiroshi Inoue

Date:

28 May 2013, 23:19:34

Hi Joshua,

(2013/05/28 23:20), Joshua Berry wrote:
> Hi Hiroshi,
>
> On Fri, May 24, 2013 at 4:25 PM, Hiroshi Inoue <inoue@tpf.co.jp
> <mailto:inoue@tpf.co.jp>> wrote:
>
>     Psqlodbc drivers send Execite requests for cursors instead of
>     issueing FETCH commands.
>
>
> Thanks for the reply. Have you seen or heard of psqlODBC's use of
> cursors intermittently causing segfaults as I'd described?

No.

I have a question about your application.
Is your application mutilthreaded?
If so, do multiple threads share the same connection?

regards,
Hiroshi Inoue

Re: Segmentation Fault in Postgres server when using psqlODBC

From

Joshua Berry

Date:

29 May 2013, 06:02:10

Hi Hiroshi,

On Tue, May 28, 2013 at 3:19 PM, Hiroshi Inoue <inoue@tpf.co.jp> wrote:

Thanks for the reply. Have you seen or heard of psqlODBC's use of
cursors intermittently causing segfaults as I'd described?

No.

I have a question about your application.
Is your application mutilthreaded?
If so, do multiple threads share the same connection?

It is multithreaded; all threads use the same connection.

Regards,

-Joshua

Re: Segmentation Fault in Postgres server when using psqlODBC

From

Hiroshi Inoue

Date:

30 May 2013, 16:54:13

Hi Joshua,

(2013/05/29 12:02), Joshua Berry wrote:
> Hi Hiroshi,
>
> On Tue, May 28, 2013 at 3:19 PM, Hiroshi Inoue <inoue@tpf.co.jp
> <mailto:inoue@tpf.co.jp>> wrote:
>
>         Thanks for the reply. Have you seen or heard of psqlODBC's use of
>         cursors intermittently causing segfaults as I'd described?
>
>
>     No.
>
>     I have a question about your application.
>     Is your application mutilthreaded?
>     If so, do multiple threads share the same connection?
>
> It is multithreaded; all threads use the same connection.

There's a possibility that the crash is caused by the multithreaded
application. Jack Wilson's bug report seems also related to multithread
and I sent him a patch to fix the bug hopefully. We will release the
next release when the patch fixes the bug.

regards,
Hiroshi Inoue

Re: Segmentation Fault in Postgres server when using psqlODBC

From

Joshua Berry

Date:

03 June 2013, 17:56:07

Hi Hiroshi,

On Thu, May 30, 2013 at 8:52 AM, Hiroshi Inoue <inoue@tpf.co.jp> wrote:

It is multithreaded; all threads use the same connection.

There's a possibility that the crash is caused by the multithreaded
application. Jack Wilson's bug report seems also related to multithread and I sent him a patch to fix the bug hopefully. We will release the next release when the patch fixes the bug.

Thanks for letting me know. I have not had another crash since my last posting to the list, so nothing more to report on this.

Kind Regards,

-Joshua

Re: Segmentation Fault in Postgres server when using psqlODBC

From

"Inoue, Hiroshi"

Date:

04 June 2013, 06:47:39

Hi Joshua,

(2013/06/03 23:55), Joshua Berry wrote:
> Hi Hiroshi,
>
> On Thu, May 30, 2013 at 8:52 AM, Hiroshi Inoue <inoue@tpf.co.jp
> <mailto:inoue@tpf.co.jp>> wrote:
>
>         It is multithreaded; all threads use the same connection.
>
>
>     There's a possibility that the crash is caused by the multithreaded
>     application. Jack Wilson's bug report seems also related to
>     multithread and I sent him a patch to fix the bug hopefully. We will
>     release the next release when the patch fixes the bug.
>
>
>
> Thanks for letting me know. I have not had another crash since my last
> posting to the list, so nothing more to report on this.

It's also preferrable to fix the crash at backend.
The crash is caused by execute commands after commit.

regards,
Hiroshi Inoue

Re: Segmentation Fault in Postgres server when using psqlODBC

From

Joshua Berry

Date:

04 June 2013, 17:46:46

Hi Hiroshi,

On Mon, Jun 3, 2013 at 10:47 PM, Inoue, Hiroshi <inoue@tpf.co.jp> wrote:

It's also preferrable to fix the crash at backend.
The crash is caused by execute commands after commit.

That would be the best solution. Thanks for your help.

Kind Regards,

-Joshua

Re: Segmentation Fault in Postgres server when using psqlODBC

From

Joshua Berry

Date:

04 June 2013, 18:48:42

Hi Hiroshi,

On Tue, Jun 4, 2013 at 9:46 AM, Joshua Berry <yoberi@gmail.com> wrote:

On Mon, Jun 3, 2013 at 10:47 PM, Inoue, Hiroshi <inoue@tpf.co.jp> wrote:
It's also preferrable to fix the crash at backend.
The crash is caused by execute commands after commit.

That would be the best solution. Thanks for your help.

Is there any test code that I could leverage to put together a test case which can quickly invoke the backend problem that I'm seeing? Perhaps something that is used in the pgsqlODBC project or something else you or others might have sitting around? I would like to have a testapp/function that could help verify that the issue has been fixed in a future backend patch/release.

I've never explicitly used EXECUTE. Could I construct a plpgsql script which could use EXECUTE in a similar manner as psqlODBC, thus creating a test case that would have greater portability?

Kind Regards,

-Joshua

Re: Segmentation Fault in Postgres server when using psqlODBC

From

Hiroshi Inoue

Date:

06 June 2013, 06:39:40

(2013/06/05 0:48), Joshua Berry wrote:
> Hi Hiroshi,
>
> On Tue, Jun 4, 2013 at 9:46 AM, Joshua Berry <yoberi@gmail.com
> <mailto:yoberi@gmail.com>> wrote:
>
>
>     On Mon, Jun 3, 2013 at 10:47 PM, Inoue, Hiroshi <inoue@tpf.co.jp
>     <mailto:inoue@tpf.co.jp>> wrote:
>
>         It's also preferrable to fix the crash at backend.
>         The crash is caused by execute commands after commit.
>
>     That would be the best solution. Thanks for your help.
>
>
> Is there any test code that I could leverage to put together a test case
> which can quickly invoke the backend problem that I'm seeing? Perhaps
> something that is used in the pgsqlODBC project or something else you or
> others might have sitting around? I would like to have a
> testapp/function that could help verify that the issue has been fixed in
> a future backend patch/release.

It seems difficult to provide a test code. However I can reproduce
the crash by changing 1 line of psqlodbc driver source code with a
test case. For example, the crash is fixed by the attached patch.

> I've never explicitly used EXECUTE. Could I construct a plpgsql script
> which could use EXECUTE in a similar manner as psqlODBC, thus creating a
> test case that would have greater portability?

Oops it's an Execute message used in extended query protocol not a
*EXECUTE* command.

regards,
Hiroshi Inoue

Attachment

printtup_holdable_cursor.patch