Thread: Problem with multi-job pg_restore
I think I've discovered an issue with multi-job pg_restore on a 700 GB data file created with pg_dump. Before anyone points out that the preferred procedure is to use the newest pg_dump to backup a database before doing pg_restore let just say, "Yes I'm aware of that advice and unfortunately it just isn't an option." Here is the dump file information (acquired via pg_restore -l) ; Archive created at Wed Mar 7 10:51:40 2012 ; dbname: raritan ; TOC Entries: 756 ; Compression: -1 ; Dump Version: 1.11-0 ; Format: CUSTOM ; Integer: 4 bytes ; Offset: 8 bytes ; Dumped from database version: 8.4.9 ; Dumped by pg_dump version: 8.4.9 The problem occurs during the restore when one of the bulk loads (COPY) seems to get disconnected from the restore process. I captured stdout and stderr from the pg_restore execution and there isn't a single hint of a problem. When I look at the log file in the $PGDATA/pg_log directory I found the following errors: LOG: could not send data to client: Connection reset by peer STATEMENT: COPY public.outlet_readings_rollup (id, outlet_id, rollup_interval, reading_time, min_current, max_current, average_current, min_active_power, max_active_power, average_active_power, min_apparent_power, max_apparent_power, average_apparent_power, watt_hour, pdu_id, min_voltage, max_voltage, average_voltage) TO stdout; I'm running PostgreSQL 9.1.3 on a CentOS 6 x86-64 build. I'm a developer by trade so I'm good with building from the latest source and using debugging tools as necessary. What I'm really looking for is advice on how to maximize the information I get so that I can minimize the number of times I have to run the restore. The restore process takes at least a day to complete (discounting the disconnected COPY process) and I don't have weeks to figure out what's going on. Thanks -- Brian -- /* insert witty comment here */
Brian Weaver <cmdrclueless@gmail.com> writes: > I think I've discovered an issue with multi-job pg_restore on a 700 GB > data file created with pg_dump. Just to clarify, you mean parallel restore, right? Are you using any options beyond -j, that is any sort of selective restore? > The problem occurs during the restore when one of the bulk loads > (COPY) seems to get disconnected from the restore process. I captured > stdout and stderr from the pg_restore execution and there isn't a > single hint of a problem. When I look at the log file in the > $PGDATA/pg_log directory I found the following errors: > LOG: could not send data to client: Connection reset by peer > STATEMENT: COPY public.outlet_readings_rollup (id, outlet_id, > rollup_interval, reading_time, min_current, max_current, > average_current, min_active_power, max_active_power, > average_active_power, min_apparent_power, max_apparent_power, > average_apparent_power, watt_hour, pdu_id, min_voltage, max_voltage, > average_voltage) TO stdout; I'm confused. A copy-to-stdout ought to be something that pg_dump would do, not pg_restore. Are you sure this is related at all? regards, tom lane
Here's the steps I'm taking # pg_restore -l database.dump > database.list Edit the database.list file and comment out the 'PROCEDURAL LANGUAGE - plpgsql' item. Because the dump comes from an 8.4 system without PL/PgSQL installed by default the create language emitted during pg_restore causes the restore to fail; PL/PgSQL is installed by default in 9.1. # pg_restore --jobs=24 --create --dbname=template1 --use-list=database.list --verbose database.dump After letting the restore run overnight I find the error message in the log file and still one running 'postmaster' process with the arguments 'postgresql dbname [local] COPY' still active. The pg_restore command exited with a zero status as if all went well. I checked and no other clients are connected to the system so my only thought is that this is an orphaned instance from the pg_restore. I'm hoping it's all related because I stopped all the running process that would normally access the database. I'm in the phase of testing the restore process because we're new to dealing with multi-terabyte data and we're trying to figure out our options on supporting with such large data sets going forward. I guess I could add the PID of the process generating the error in the log; if it's the pid of the leftover postgresql process that would be a bit more definitive. It will probably be tomorrow afternoon before I have any answer that include the PID given the runtime issues. -- Brian On Tue, May 1, 2012 at 12:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Brian Weaver <cmdrclueless@gmail.com> writes: >> I think I've discovered an issue with multi-job pg_restore on a 700 GB >> data file created with pg_dump. > > Just to clarify, you mean parallel restore, right? Are you using any > options beyond -j, that is any sort of selective restore? > >> The problem occurs during the restore when one of the bulk loads >> (COPY) seems to get disconnected from the restore process. I captured >> stdout and stderr from the pg_restore execution and there isn't a >> single hint of a problem. When I look at the log file in the >> $PGDATA/pg_log directory I found the following errors: > >> LOG: could not send data to client: Connection reset by peer >> STATEMENT: COPY public.outlet_readings_rollup (id, outlet_id, >> rollup_interval, reading_time, min_current, max_current, >> average_current, min_active_power, max_active_power, >> average_active_power, min_apparent_power, max_apparent_power, >> average_apparent_power, watt_hour, pdu_id, min_voltage, max_voltage, >> average_voltage) TO stdout; > > I'm confused. A copy-to-stdout ought to be something that pg_dump > would do, not pg_restore. Are you sure this is related at all? > > regards, tom lane -- /* insert witty comment here */
Doh! I missed a script that was run by cron that does a nightly backup. That's the likely offender for the 'copy-to-stdout' I've removed it from the nightly run. I'll see if have any better luck with this run. Still not sure about the best way to debug the issue though. Any pointers would be appreciated. On Tue, May 1, 2012 at 12:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Brian Weaver <cmdrclueless@gmail.com> writes: > > I'm confused. A copy-to-stdout ought to be something that pg_dump > would do, not pg_restore. Are you sure this is related at all? > > regards, tom lane -- /* insert witty comment here */
Brian Weaver <cmdrclueless@gmail.com> writes: > Doh! I missed a script that was run by cron that does a nightly > backup. That's the likely offender for the 'copy-to-stdout' > I've removed it from the nightly run. I'll see if have any better luck > with this run. Still not sure about the best way to debug the issue > though. Any pointers would be appreciated. Well, given that this takes so long, adding PID and time of day to your log_line_prefix would probably be a smart idea. But at this point it sounds like what failed was the cron job. Are you sure that the pg_restore didn't finish just fine? If it didn't, what's the evidence? regards, tom lane
Tom, The restore appears to have finished without a problem. The issue I have is a running instance of postgres is still active in a COPY state after the restore. The process is running full tilt, almost like it's in a tight loop condition. -- Brian On Tue, May 1, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Brian Weaver <cmdrclueless@gmail.com> writes: >> Doh! I missed a script that was run by cron that does a nightly >> backup. That's the likely offender for the 'copy-to-stdout' > >> I've removed it from the nightly run. I'll see if have any better luck >> with this run. Still not sure about the best way to debug the issue >> though. Any pointers would be appreciated. > > Well, given that this takes so long, adding PID and time of day to your > log_line_prefix would probably be a smart idea. But at this point it > sounds like what failed was the cron job. Are you sure that the > pg_restore didn't finish just fine? If it didn't, what's the evidence? > > regards, tom lane -- /* insert witty comment here */
Brian Weaver <cmdrclueless@gmail.com> wrote: > The restore appears to have finished without a problem. The issue > I have is a running instance of postgres is still active in a COPY > state after the restore. The process is running full tilt, almost > like it's in a tight loop condition. It's probably dumping all that new data you just restored to ... somewhere. You'll need to review the crontab job (or maybe ps output) to figure out where. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Brian Weaver <cmdrclueless@gmail.com> wrote: >> The restore appears to have finished without a problem. The issue >> I have is a running instance of postgres is still active in a COPY >> state after the restore. The process is running full tilt, almost >> like it's in a tight loop condition. > It's probably dumping all that new data you just restored to ... > somewhere. You'll need to review the crontab job (or maybe ps > output) to figure out where. Or even more likely, it's reading all that data and deciding that it doesn't need to output it because it's not visible to its snapshot; that could look like a pretty tight loop from outside. If the connected cron job has gone away, I'd suggest just SIGINT'ing the COPY job. regards, tom lane