Thread: failure with pg_dump
This morning I came in to discover that my nightly pg_dump backup to a remote server had failed. Both the server and the client are have postgres 8.0.1 installed. Figuring it was just a fluke I ran the dump again by hand and got the same error: % pg_dump -h d01-prv -Fc -f mm.22-Mar-2005.dump vkmlm pg_dump: socket not open pg_dump: SQL command to dump the contents of table "user_list" failed: PQendcopy() failed. pg_dump: Error message from server: socket not open pg_dump: The command was: COPY public.user_list ( ... list elided for privacy concerns ... ) the postgres log on the server has this to say: Mar 22 10:35:00 d01 postgres[35190]: [8-1] LOG: could not receive data from client: Operation timed out Mar 22 10:35:00 d01 postgres[35190]: [9-1] LOG: unexpected EOF on client connection the overnight dump had this error: pg_dump: socket not open pg_dump: SQL command to dump the contents of table "user_list" failed: PQendcopy() failed. pg_dump: Error message from server: socket not open pg_dump: The command was: COPY public.user_list ( ... ) Mar 22 03:42:10 d01 postgres[33589]: [6-1] LOG: could not send data to client: Broken pipe Mar 22 03:47:10 d01 postgres[33589]: [7-1] LOG: duration: 473083.960 ms statement: [ the COPY query ] Mar 22 03:47:10 d01 postgres[33589]: [8-1] LOG: could not receive data from client: Operation timed out Mar 22 03:47:10 d01 postgres[33589]: [9-1] LOG: unexpected EOF on client connection While poking through the logs for these errors, I'm finding a *lot* of broken pipe/unexpected EOF errors for this server but for connections from other hosts as well, running reports. those hosts still have the 7.4 client libraries. This is a brand new box (rushed into production after minimal testing since one other died, so I'm not 100% certain it is stable) running FreeBSD 5.4-PRERELEASE amd64 on a dual Opteron with 4GB ram and MegaRAID RAID. The same config on the old box with Pg 7.4.7 worked flawlessly for running reports and dumps. Another issue is that the 8.0 server is noticeably slower than the 7.4 with identically (translated to 8.0 style) configs. Any ideas on where to poke around for the pipe problems? Vivek Khera, Ph.D. +1-301-869-4449 x806
On Tue, 2005-03-22 at 10:36, Vivek Khera wrote: > This morning I came in to discover that my nightly pg_dump backup to a > remote server had failed. Both the server and the client are have > postgres 8.0.1 installed. Figuring it was just a fluke I ran the dump > again by hand and got the same error: > > % pg_dump -h d01-prv -Fc -f mm.22-Mar-2005.dump vkmlm > pg_dump: socket not open > pg_dump: SQL command to dump the contents of table "user_list" failed: > PQendcopy() failed. > pg_dump: Error message from server: socket not open > pg_dump: The command was: COPY public.user_list ( ... list elided for > privacy concerns ... ) > > the postgres log on the server has this to say: > > Mar 22 10:35:00 d01 postgres[35190]: [8-1] LOG: could not receive data > from client: Operation timed out > Mar 22 10:35:00 d01 postgres[35190]: [9-1] LOG: unexpected EOF on > client connection > > > the overnight dump had this error: > > pg_dump: socket not open > pg_dump: SQL command to dump the contents of table "user_list" failed: > PQendcopy() failed. > pg_dump: Error message from server: socket not open > pg_dump: The command was: COPY public.user_list ( ... ) > > > Mar 22 03:42:10 d01 postgres[33589]: [6-1] LOG: could not send data to > client: Broken pipe > Mar 22 03:47:10 d01 postgres[33589]: [7-1] LOG: duration: 473083.960 > ms statement: > [ the COPY query ] > Mar 22 03:47:10 d01 postgres[33589]: [8-1] LOG: could not receive data > from client: Operation timed out > Mar 22 03:47:10 d01 postgres[33589]: [9-1] LOG: unexpected EOF on > client connection > > > While poking through the logs for these errors, I'm finding a *lot* of > broken pipe/unexpected EOF errors for this server but for connections > from other hosts as well, running reports. those hosts still have the > 7.4 client libraries. > > This is a brand new box (rushed into production after minimal testing > since one other died, so I'm not 100% certain it is stable) running > FreeBSD 5.4-PRERELEASE amd64 on a dual Opteron with 4GB ram and > MegaRAID RAID. > > The same config on the old box with Pg 7.4.7 worked flawlessly for > running reports and dumps. Another issue is that the 8.0 server is > noticeably slower than the 7.4 with identically (translated to 8.0 > style) configs. IS there a difference in the infrastructure for this server? Like firewalls and routing? Also, is it vacuum / analyzed often? Poor stats will cause the server to run slower. Are you getting any output from the postgresql logs that would point to backends dieing or anything like that? This sounds like a networking / client problem to me.
On Mar 22, 2005, at 11:47 AM, Scott Marlowe wrote: >> The same config on the old box with Pg 7.4.7 worked flawlessly for >> running reports and dumps. Another issue is that the 8.0 server is >> noticeably slower than the 7.4 with identically (translated to 8.0 >> style) configs. > > IS there a difference in the infrastructure for this server? Like > firewalls and routing? > Nope. I actually pulled the ethernet wire from the dead box and plugged it into this one :-) The IP number is different by 1 bit. That's pretty much the only difference in the old and new boxes other than the move to Pg 8.0.1. > Also, is it vacuum / analyzed often? Poor stats will cause the server > to run slower. > Yes, vacuum analyze regularly. The query plans seem identical except the cost estimates are a bit different in number of rows returned. The choice of plan remains the same... > > > Are you getting any output from the postgresql logs that would point to > backends dieing or anything like that? This sounds like a networking / > client problem to me. My only guess is that a bug in gcc when optimizing for opteron, so I'm rebuilding the kernel and libraries of the base OS without those optimizations to see what happens. I know the clients are not having problems since they've been stable for a long time. I'm guessing nobody else is seeing arbitrary connection drops in 8.0.1, particularly on FreeBSD 5.4-PRERELEASE :-( Funny thing is that the pg_dump worked yesterday... Vivek Khera, Ph.D. +1-301-869-4449 x806
On Tuesday March 22 2005 9:36, Vivek Khera wrote: > While poking through the logs for these errors, I'm finding a > *lot* of broken pipe/unexpected EOF errors for this server but > for connections from other hosts as well, running reports. > those hosts still have the 7.4 client libraries. You might look into possible network-related issues. Maybe check what is happening on the NICs (tx/rx errors? collisions?), see if that gives any clues... Ed
On Mar 22, 2005, at 11:53 AM, Ed L. wrote: > You might look into possible network-related issues. Maybe check > what is happening on the NICs (tx/rx errors? collisions?), see > if that gives any clues... > well, whaddya know: kernel log messages at the relevant times: Mar 22 03:42:22 d01 kernel: bge0: watchdog timeout -- resetting Mar 22 10:28:24 d01 kernel: bge0: watchdog timeout -- resetting now, the question is it OS or is it hardware.... :-( Vivek Khera, Ph.D. +1-301-869-4449 x806
>>>>> "SM" == Scott Marlowe <smarlowe@g2switchworks.com> writes: SM> Are you getting any output from the postgresql logs that would point to SM> backends dieing or anything like that? This sounds like a networking / SM> client problem to me. Ok... well, I rebuilt the entire OS and postgres *without* the gcc Opteron optmizations, and it seems much more stable so far (24 hours) and somewhat faster. My daily reports went faster too (as fast as with the Pg 7.4 box). So I'm chalking this one up to bad compiler. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Vivek Khera, Ph.D. Khera Communications, Inc. Internet: khera@kciLink.com Rockville, MD +1-301-869-4449 x806 AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/