Thread: pg_dump causes postgres crash

pg_dump causes postgres crash

From
Jeff Amiel
Date:
Fairly new (less than a week) install.
"PostgreSQL 8.2.4 on i386-pc-solaris2.10, compiled by
GCC gcc (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath)"

database size around 43 gigabytes.

2 attempts at a pg_dump across the network caused the
database to go down...

The first time I thought it was because of mismatched
pg_dump (was version 8.0.X)...but the second time it
was definitely 8.2.4 version of pg_dump.

My first thought was corruption...but this database
has successfully seeded 2 slony subscriber nodes from
scratch as well running flawlessly under heavy load
for the past week.

Even more odd is that a LOCAL pg_dump (from on the
box) succeeded just fine tonight (after the second
crash).

Thoughts?

----First Crash-------

backup-srv2 prod_backup # time /usr/bin/pg_dump
--format=c --compress=9 --ignore-version
--username=backup --host=prod_server prod > x

pg_dump: server version: 8.2.4; pg_dump version:
8.0.13
pg_dump: proceeding despite version mismatch
pg_dump: WARNING:  terminating connection because of
crash of another server process
DETAIL:  The postmaster has commanded this server
process to roll back the current transaction and exit,
because another server process exited abnormally and
possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to
the database and repeat your command.
pg_dump: server closed the connection unexpectedly
       This probably means the server terminated
abnormally
       before or while processing the request.
pg_dump: SQL command to dump the contents of table
"access_logs" failed: PQendcopy() failed.
pg_dump: Error message from server: server closed the
connection unexpectedly
       This probably means the server terminated
abnormally
       before or while processing the request.
pg_dump: The command was: COPY public.access_logs (ip,
username, "action", date, params) TO stdout;


------Second Crash--------

backup-srv2 ~ # time /usr/bin/pg_dump --format=c
--compress=9  --username=backup --host=prod_server
prod | wc -l
pg_dump: Dumping the contents of table "audit" failed:
PQgetCopyData() failed.
pg_dump: Error message from server: server closed the
connection unexpectedly
       This probably means the server terminated
abnormally
       before or while processing the request.
pg_dump: The command was: COPY public.audit (audit_id,
entity_id, table_name, serial_id, audit_action,
when_ts, user_id, user_ip) TO stdout;









____________________________________________________________________________________
Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games.
http://sims.yahoo.com/

Re: pg_dump causes postgres crash

From
Jeff Amiel
Date:
From the logs tonight when the second crash occurred..

Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
local0.info] [6-1] 2007-08-22 20:45:12 CDT   LOG:
received smart shutdown request
Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
local0.info] [7-1] 2007-08-22 20:45:12 CDT   LOG:
server process (PID 20188) was terminated by signal 11
Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
local0.info] [8-1] 2007-08-22 20:45:12 CDT   LOG:
terminating any other active server processes

There was a core file created...but I believe I do not
have postgresql compiled with debug info.....(well, a
pstack provided nothing useful)
pstack core  |more
core 'core' of 20188:   /usr/local/pgsql/bin/postgres
-D /db
 fee8ec23 sk_value (10023d, 105d8b00, d2840f,
1c7f0000, f20f883, 10584) + 33
 0c458b51 ???????? (0, 0, 511f1600, 2000400, ff001c09,
467f71ea)
 00000000 ???????? ()

Once again...a local pg_dump worked just fine 30
minutes later......

We have introduced some new network architecture which
is acting odd lately (dell managed switches, netscreen
ssgs, etc) and the database itself resides on a zfs
partition on a Pillar SAN (connected via fibre
channel)

Any thoughts would be appreciated.



____________________________________________________________________________________
Pinpoint customers who are looking for what you sell.
http://searchmarketing.yahoo.com/

Re: pg_dump causes postgres crash

From
Tom Lane
Date:
Jeff Amiel <becauseimjeff@yahoo.com> writes:
> Even more odd is that a LOCAL pg_dump (from on the
> box) succeeded just fine tonight (after the second
> crash).

That seems to eliminate the theory of a crash due to data corruption
... unless the corruption somehow repaired itself in the intervening
30 minutes, which hardly seems likely.

> ----First Crash-------

> backup-srv2 prod_backup # time /usr/bin/pg_dump
> --format=c --compress=9 --ignore-version
> --username=backup --host=prod_server prod > x
> pg_dump: server version: 8.2.4; pg_dump version:
> 8.0.13
> pg_dump: proceeding despite version mismatch
> pg_dump: WARNING:  terminating connection because of
> crash of another server process
> DETAIL:  The postmaster has commanded this server
> process to roll back the current transaction and exit,
> because another server process exited abnormally and
> possibly corrupted shared memory.

Notice that pg_dump is showing that the crash was in some OTHER server
process, not the one it was attached to.

> ------Second Crash--------

> backup-srv2 ~ # time /usr/bin/pg_dump --format=c
> --compress=9  --username=backup --host=prod_server
> prod | wc -l
> pg_dump: Dumping the contents of table "audit" failed:
> PQgetCopyData() failed.
> pg_dump: Error message from server: server closed the
> connection unexpectedly
>        This probably means the server terminated
> abnormally
>        before or while processing the request.
> pg_dump: The command was: COPY public.audit (audit_id,

This one looks more like it might have been the directly connected
server process that crashed.  However, your postmaster log from
the other message:

> From the logs tonight when the second crash occurred..
> Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
> local0.info] [6-1] 2007-08-22 20:45:12 CDT   LOG:
> received smart shutdown request
> Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
> local0.info] [7-1] 2007-08-22 20:45:12 CDT   LOG:
> server process (PID 20188) was terminated by signal 11
> Aug 22 20:45:12 db-1 postgres[5805]: [ID 748848
> local0.info] [8-1] 2007-08-22 20:45:12 CDT   LOG:
> terminating any other active server processes

raises still more questions: where the heck did the "smart shutdown
request" (that is to say, a SIGTERM interrupt to the postmaster) come
from?  It's far too much of a coincidence for that to have occurred
within a second of detecting the server process crash.

> We have introduced some new network architecture which
> is acting odd lately (dell managed switches, netscreen
> ssgs, etc) and the database itself resides on a zfs
> partition on a Pillar SAN (connected via fibre
> channel)

I can't help thinking you are looking at generalized system
instability.  Maybe someone knocked a few cables loose while
installing new network hardware?

            regards, tom lane

Re: pg_dump causes postgres crash

From
Jeff Amiel
Date:
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I can't help thinking you are looking at generalized
> system
> instability.  Maybe someone knocked a few cables
> loose while
> installing new network hardware?

Database server/storage instability or network
instability?

There is no doubt that there is something flaky about
the networking between the db server and the box(es)
trying to do the pg_dump.  We have indeed had issues
(timeouts, halts, etc) moving large quantities of data
across various segments to and from these boxes...like
the db server....but how would this effect something
like a pg_dump?

Would a good stack trace (assuming I want to crash my
database again) help here?





____________________________________________________________________________________Ready for the edge of your seat?
Check out tonight's top picks on Yahoo! TV.
http://tv.yahoo.com/

Re: pg_dump causes postgres crash

From
Tom Lane
Date:
Jeff Amiel <becauseimjeff@yahoo.com> writes:
> Would a good stack trace (assuming I want to crash my
> database again) help here?

Well, it'd be more information than we have now ...

            regards, tom lane