Thread: failure with pg_dump

failure with pg_dump

From
Vivek Khera
Date:
This morning I came in to discover that my nightly pg_dump backup to a
remote server had failed.  Both the server and the client are have
postgres 8.0.1 installed.  Figuring it was just a fluke I ran the dump
again by hand and got the same error:

% pg_dump -h d01-prv -Fc -f mm.22-Mar-2005.dump vkmlm
pg_dump: socket not open
pg_dump: SQL command to dump the contents of table "user_list" failed:
PQendcopy() failed.
pg_dump: Error message from server: socket not open
pg_dump: The command was: COPY public.user_list ( ... list elided for
privacy concerns ... )

the postgres log on the server has this to say:

Mar 22 10:35:00 d01 postgres[35190]: [8-1] LOG:  could not receive data
from client: Operation timed out
Mar 22 10:35:00 d01 postgres[35190]: [9-1] LOG:  unexpected EOF on
client connection


the overnight dump had this error:

pg_dump: socket not open
pg_dump: SQL command to dump the contents of table "user_list" failed:
PQendcopy() failed.
pg_dump: Error message from server: socket not open
pg_dump: The command was: COPY public.user_list ( ... )


Mar 22 03:42:10 d01 postgres[33589]: [6-1] LOG:  could not send data to
client: Broken pipe
Mar 22 03:47:10 d01 postgres[33589]: [7-1] LOG:  duration: 473083.960
ms  statement:
  [ the COPY query ]
Mar 22 03:47:10 d01 postgres[33589]: [8-1] LOG:  could not receive data
from client: Operation timed out
Mar 22 03:47:10 d01 postgres[33589]: [9-1] LOG:  unexpected EOF on
client connection


While poking through the logs for these errors, I'm finding a *lot* of
broken pipe/unexpected EOF errors for this server but for connections
from other hosts as well, running reports.  those hosts still have the
7.4 client libraries.

This is a brand new box (rushed into production after minimal testing
since one other died, so I'm not 100% certain it is stable) running
FreeBSD 5.4-PRERELEASE amd64 on a dual Opteron with 4GB ram and
MegaRAID RAID.

The same config on the old box with Pg 7.4.7 worked flawlessly for
running reports and dumps.  Another issue is that the 8.0 server is
noticeably slower than the 7.4 with identically (translated to 8.0
style) configs.

Any ideas on where to poke around for the pipe problems?


Vivek Khera, Ph.D.
+1-301-869-4449 x806


Re: failure with pg_dump

From
Scott Marlowe
Date:
On Tue, 2005-03-22 at 10:36, Vivek Khera wrote:
> This morning I came in to discover that my nightly pg_dump backup to a
> remote server had failed.  Both the server and the client are have
> postgres 8.0.1 installed.  Figuring it was just a fluke I ran the dump
> again by hand and got the same error:
>
> % pg_dump -h d01-prv -Fc -f mm.22-Mar-2005.dump vkmlm
> pg_dump: socket not open
> pg_dump: SQL command to dump the contents of table "user_list" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: socket not open
> pg_dump: The command was: COPY public.user_list ( ... list elided for
> privacy concerns ... )
>
> the postgres log on the server has this to say:
>
> Mar 22 10:35:00 d01 postgres[35190]: [8-1] LOG:  could not receive data
> from client: Operation timed out
> Mar 22 10:35:00 d01 postgres[35190]: [9-1] LOG:  unexpected EOF on
> client connection
>
>
> the overnight dump had this error:
>
> pg_dump: socket not open
> pg_dump: SQL command to dump the contents of table "user_list" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: socket not open
> pg_dump: The command was: COPY public.user_list ( ... )
>
>
> Mar 22 03:42:10 d01 postgres[33589]: [6-1] LOG:  could not send data to
> client: Broken pipe
> Mar 22 03:47:10 d01 postgres[33589]: [7-1] LOG:  duration: 473083.960
> ms  statement:
>   [ the COPY query ]
> Mar 22 03:47:10 d01 postgres[33589]: [8-1] LOG:  could not receive data
> from client: Operation timed out
> Mar 22 03:47:10 d01 postgres[33589]: [9-1] LOG:  unexpected EOF on
> client connection
>
>
> While poking through the logs for these errors, I'm finding a *lot* of
> broken pipe/unexpected EOF errors for this server but for connections
> from other hosts as well, running reports.  those hosts still have the
> 7.4 client libraries.
>
> This is a brand new box (rushed into production after minimal testing
> since one other died, so I'm not 100% certain it is stable) running
> FreeBSD 5.4-PRERELEASE amd64 on a dual Opteron with 4GB ram and
> MegaRAID RAID.
>
> The same config on the old box with Pg 7.4.7 worked flawlessly for
> running reports and dumps.  Another issue is that the 8.0 server is
> noticeably slower than the 7.4 with identically (translated to 8.0
> style) configs.

IS there a difference in the infrastructure for this server?  Like
firewalls and routing?

Also, is it vacuum / analyzed often?  Poor stats will cause the server
to run slower.

Are you getting any output from the postgresql logs that would point to
backends dieing or anything like that?  This sounds like a networking /
client problem to me.

Re: failure with pg_dump

From
Vivek Khera
Date:
On Mar 22, 2005, at 11:47 AM, Scott Marlowe wrote:

>> The same config on the old box with Pg 7.4.7 worked flawlessly for
>> running reports and dumps.  Another issue is that the 8.0 server is
>> noticeably slower than the 7.4 with identically (translated to 8.0
>> style) configs.
>
> IS there a difference in the infrastructure for this server?  Like
> firewalls and routing?
>

Nope.  I actually pulled the ethernet wire from the dead box and
plugged it into this one :-)  The IP number is different by 1 bit.
That's pretty much the only difference in the old and new boxes other
than the move to Pg 8.0.1.


> Also, is it vacuum / analyzed often?  Poor stats will cause the server
> to run slower.
>

Yes, vacuum analyze regularly.  The query plans seem identical except
the cost estimates are a bit different in number of rows returned.  The
choice of plan remains the same...

>
>
> Are you getting any output from the postgresql logs that would point to
> backends dieing or anything like that?  This sounds like a networking /
> client problem to me.

My only guess is that a bug in gcc when optimizing for opteron, so I'm
rebuilding the kernel and libraries of the base OS without those
optimizations to see what happens.  I know the clients are not having
problems since they've been stable for a long time.

I'm guessing nobody else is seeing arbitrary connection drops in 8.0.1,
particularly on FreeBSD 5.4-PRERELEASE :-(

Funny thing is that the pg_dump worked yesterday...

Vivek Khera, Ph.D.
+1-301-869-4449 x806


Re: failure with pg_dump

From
"Ed L."
Date:
On Tuesday March 22 2005 9:36, Vivek Khera wrote:
> While poking through the logs for these errors, I'm finding a
> *lot* of broken pipe/unexpected EOF errors for this server but
> for connections from other hosts as well, running reports.
>  those hosts still have the 7.4 client libraries.

You might look into possible network-related issues.  Maybe check
what is happening on the NICs (tx/rx errors?  collisions?), see
if that gives any clues...

Ed


Re: failure with pg_dump

From
Vivek Khera
Date:
On Mar 22, 2005, at 11:53 AM, Ed L. wrote:

> You might look into possible network-related issues.  Maybe check
> what is happening on the NICs (tx/rx errors?  collisions?), see
> if that gives any clues...
>

well, whaddya know:  kernel log messages at the relevant times:

Mar 22 03:42:22 d01 kernel: bge0: watchdog timeout -- resetting
Mar 22 10:28:24 d01 kernel: bge0: watchdog timeout -- resetting

now, the question is it OS or is it hardware.... :-(

Vivek Khera, Ph.D.
+1-301-869-4449 x806


Re: failure with pg_dump

From
Vivek Khera
Date:
>>>>> "SM" == Scott Marlowe <smarlowe@g2switchworks.com> writes:

SM> Are you getting any output from the postgresql logs that would point to
SM> backends dieing or anything like that?  This sounds like a networking /
SM> client problem to me.

Ok... well, I rebuilt the entire OS and postgres *without* the gcc
Opteron optmizations, and it seems much more stable so far (24 hours)
and somewhat faster.  My daily reports went faster too (as fast as
with the Pg 7.4 box).

So I'm chalking this one up to bad compiler.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.                Khera Communications, Inc.
Internet: khera@kciLink.com       Rockville, MD  +1-301-869-4449 x806
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/