Re: [ODBC] Segmentation Fault in Postgres server when using psqlODBC - Mailing list pgsql-bugs
From | Hiroshi Inoue |
---|---|
Subject | Re: [ODBC] Segmentation Fault in Postgres server when using psqlODBC |
Date | |
Msg-id | 519FDAB3.3010009@tpf.co.jp Whole thread Raw |
In response to | Segmentation Fault in Postgres server when using psqlODBC (Joshua Berry <yoberi@gmail.com>) |
List | pgsql-bugs |
Hi, Psqlodbc drivers send Execite requests for cursors instead of issueing FETCH commands. regards, Hiroshi Inoue (2013/05/25 1:55), Joshua Berry wrote: > Hi Groups, > > I'm dealing with periodic backend process segmentation faults. I'm > posting to both the bugs and odbc lists as it seems that my > application's use of pgsqlODBC triggers a bug in the postgres backend. > > The environment is: > Clients: win32 clients using various version of the psqlODBC driver > (mostly using 8.04.0200). The connection string contains > "UseDeclareFetch=1" which causes long idle transactions, heavy cursor > and savepoint use. > Server: Dell dual Xeon x86 48GB RAM (12GB PG shared mem) RHEL system > with Postgresql 9.2.4. Note that these same issues occurred on PG9.1 and > PG8.4. > > I've experienced these issues for over a year now. However, during that > time several things have changed which may or may not be related: > * The schema has been modified with heavier use of triggers to update > manually created materialized views (ie not using the 9.3 CREATE > MATERIALIZED VIEW). > * The number of concurrent users has increased (from perhaps 15 > concurrent users two years ago, to 30 concurrent users now). > * The PG version used has changed from 8.4, to 9.1, and finally to 9.2 > * I've done recent tuning to the planner cost constants in order to > favor index scans over table scans in more situations. > > I've looked through past error logs and I found that I had a segfault in > the server process while using PG 8.4 about a dozen times over a 12 > month period. Each time one client's postgres process crashes, the > backend forcefully closes all active connections due to possiby > corrupted memory and then restarts. This leaves all active clients > stranded as the connection is closed, and all cursors and savepoint info > is lost. My app doesn't recover gracefully, and users are forced to > click through some cryptic error messages from the application framework > used for the app (Clarion) and then restart the app. > > A few months ago, I upgraded to another server with PG 9.1, assuming > that the issue with the previous server with 8.4 was due to bad RAM, as > I did observe a high ECC error count on the previous system as logged by > the IPMI controller. However, I very quickly had another segfault on the > new system with PG9.1. The default settings of the OS and PG init script > disabled core dumps, so I only started collecting core files for the > past few months. > > I posted details in pgsql-general April 10, 2013; here is a link to the > thread: > http://www.postgresql.org/message-id/CAPmZXM03MEDEn6nqqf_Phs3M1DK-EaXP5_K-LmirneOJMAQ-Hg@mail.gmail.com > > The crash always is some variation of the following stack, as observed > in both PG91 and PG92 crashes: > (gdb) bt > #0 ResourceOwnerEnlargeCatCacheRefs (owner=0x0) at resowner.c:603 > #1 0x000000000070f372 in SearchCatCache (cache=0x27fab90, v1=<value > optimized out>, v2=<value optimized out>, v3=<value optimized out>, > v4=<value optimized out>) at catcache.c:1136 > #2 0x000000000071b1ae in getTypeOutputInfo (type=20, > typOutput=0x2b3db80, typIsVarlena=0x2b3db88 "") at lsyscache.c:2482 > #3 0x000000000045d127 in printtup_prepare_info (myState=0x2810290, > typeinfo=0x29ad7b0, numAttrs=42) at printtup.c:263 > #4 0x000000000045d4c4 in printtup (slot=0x3469650, self=0x2810290) at > printtup.c:297 > #5 0x000000000065a76a in RunFromStore (portal=0x285fed0, > direction=<value optimized out>, count=10, dest=0x2810290) at pquery.c:1122 > #6 0x000000000065a852 in PortalRunSelect (portal=0x285fed0, > forward=<value optimized out>, count=10, dest=0x2810290) at pquery.c:940 > #7 0x000000000065bcf8 in PortalRun (portal=0x285fed0, count=10, > isTopLevel=1 '\001', dest=0x2810290, altdest=0x2810290, > completionTag=0x7fffee67f7c0 "") at pquery.c:788 > #8 0x0000000000659552 in exec_execute_message (argc=<value optimized > out>, argv=<value optimized out>, dbname=0x2768370 "DBNAME", > username=<value optimized out>) at postgres.c:1929 > #9 PostgresMain (argc=<value optimized out>, argv=<value optimized > out>, dbname=0x2768370 "DBNAME", username=<value optimized out>) at > postgres.c:4016 > #10 0x0000000000615161 in BackendRun () at postmaster.c:3614 > #11 BackendStartup () at postmaster.c:3304 > #12 ServerLoop () at postmaster.c:1367 > #13 0x0000000000617dcc in PostmasterMain (argc=<value optimized out>, > argv=<value optimized out>) at postmaster.c:1127 > #14 0x00000000005b6830 in main (argc=5, argv=0x2766480) at main.c:199 > (gdb) > > Andres Freund and Tom Lane spent some time tracking down possible root > causes with some of the following summaries: > > Tom Lane writes: > |Andres Freund <andres@2ndquadrant.com <mailto:andres@2ndquadrant.com>> > writes: > |> Tom: It looks to me like printtup_prepare_info won't normally be called > |> in an held cursor. But if some concurrent DDL changed the number of > |> columns in typeinfo vs thaose in the the receiver that could explain the > |> issue and why its not seen all the time, right? > | > |It looks to me like there are probably two triggering conditions: > | > |1. Client is doing a direct protocol Execute on a held-cursor portal. > | > |2. Cache flush occurs to drop the syscache entries needed by > |getTypeOutputInfo. (Otherwise, they'd still be around from when the > |held cursor was created.) > | > |The first of these explains why we don't see it in the > |CLOBBER_CACHE_ALWAYS buildfarm runs, and the second one explains why > |Joshua is only seeing it intermittently and not every darn time his > |application does that. > > Andres Freund writes: > |On 2013-04-10 19:06:12 -0400, Tom Lane wrote: > |> I wrote: > |> > (Wanders away wondering just how much the regression tests exercise > |> > holdable cursors.) > |> > |> And the answer is they're not testing this code path at all, because if > |> you do > |> DECLARE c CURSOR WITH HOLD FOR ... > |> FETCH ALL FROM c; > |> then the second query executes with a portal (and resource owner) > |> created to execute the FETCH command, not directly on the held portal. > |> > |> After a little bit of thought I'm not sure it's even possible to > |> reproduce this problem with libpq, because it doesn't expose any way to > |> issue a bare protocol Execute command against a pre-existing portal. > |> (I had thought psqlOBC went through libpq, but maybe it's playing some > |> games here.) > | > |Hm. PQexecPrepared() looks like it can do that. > > I didn't find that any bugs were officially reported/documented, so I > wanted to followup with more detail and follow the guidelines here: > http://www.postgresql.org/docs/current/static/bug-reporting.html > Specifically, I decided to update to the latest stable release, > currently PG9.2.4. > > After updating to PG92 three days ago, I've had two of the same > segfaults. Either I'm very (un)lucky or the intermittant issue is > occurring more frequently with PG92. > > On a test machine running a snapshort of the production database, and > with the backend configured to log all queries, I have an example of the > type of traffic that client apps will perform: > https://dl.dropboxusercontent.com/u/4715250/pg92_psqlODBC_08.04.0200.txt > In the above logfile, the logfile was cleared, the client app was > started, A window was opened, it was closed, and the app was closed. I > installed pgODBC version 09.01.0200, and aside from cursor names and > timestamps differing, all other lines were identical. > > I do not currently have a testing environment which can emulate the > production environment enough to trigger this bug. I've never had this > crash outside of the production environment. > > Any thoughts/suggestions? > > Regards, > -Joshua
pgsql-bugs by date: