Thread: Problem with multi-job pg_restore

Problem with multi-job pg_restore

From
Brian Weaver
Date:
I think I've discovered an issue with multi-job pg_restore on a 700 GB
data file created with pg_dump. Before anyone points out that the
preferred procedure is to use the newest pg_dump to backup a database
before doing pg_restore let just say, "Yes I'm aware of that advice
and unfortunately it just isn't an option."

Here is the dump file information (acquired via pg_restore -l)

; Archive created at Wed Mar  7 10:51:40 2012
;     dbname: raritan
;     TOC Entries: 756
;     Compression: -1
;     Dump Version: 1.11-0
;     Format: CUSTOM
;     Integer: 4 bytes
;     Offset: 8 bytes
;     Dumped from database version: 8.4.9
;     Dumped by pg_dump version: 8.4.9

The problem occurs during the restore when one of the bulk loads
(COPY) seems to get disconnected from the restore process. I captured
stdout and stderr from the pg_restore execution and there isn't a
single hint of a problem. When I look at the log file in the
$PGDATA/pg_log directory I found the following errors:

LOG:  could not send data to client: Connection reset by peer
STATEMENT:  COPY public.outlet_readings_rollup (id, outlet_id,
rollup_interval, reading_time, min_current, max_current,
average_current, min_active_power, max_active_power,
average_active_power, min_apparent_power, max_apparent_power,
average_apparent_power, watt_hour, pdu_id, min_voltage, max_voltage,
average_voltage) TO stdout;

I'm running PostgreSQL 9.1.3 on a CentOS 6 x86-64 build. I'm a
developer by trade so I'm good with building from the latest source
and using debugging tools as necessary. What I'm really looking for is
advice on how to maximize the information I get so that I can minimize
the number of times I have to run the restore. The restore process
takes at least a day to complete (discounting the disconnected COPY
process) and I don't have weeks to figure out what's going on.

Thanks

-- Brian
-- 

/* insert witty comment here */


Re: Problem with multi-job pg_restore

From
Tom Lane
Date:
Brian Weaver <cmdrclueless@gmail.com> writes:
> I think I've discovered an issue with multi-job pg_restore on a 700 GB
> data file created with pg_dump.

Just to clarify, you mean parallel restore, right?  Are you using any
options beyond -j, that is any sort of selective restore?

> The problem occurs during the restore when one of the bulk loads
> (COPY) seems to get disconnected from the restore process. I captured
> stdout and stderr from the pg_restore execution and there isn't a
> single hint of a problem. When I look at the log file in the
> $PGDATA/pg_log directory I found the following errors:

> LOG:  could not send data to client: Connection reset by peer
> STATEMENT:  COPY public.outlet_readings_rollup (id, outlet_id,
> rollup_interval, reading_time, min_current, max_current,
> average_current, min_active_power, max_active_power,
> average_active_power, min_apparent_power, max_apparent_power,
> average_apparent_power, watt_hour, pdu_id, min_voltage, max_voltage,
> average_voltage) TO stdout;

I'm confused.  A copy-to-stdout ought to be something that pg_dump
would do, not pg_restore.  Are you sure this is related at all?
        regards, tom lane


Re: Problem with multi-job pg_restore

From
Brian Weaver
Date:
Here's the steps I'm taking

# pg_restore -l database.dump > database.list

Edit the database.list file and comment out the 'PROCEDURAL LANGUAGE -
plpgsql' item. Because the dump comes from an 8.4 system without
PL/PgSQL installed by default the create language emitted during
pg_restore causes the restore to fail; PL/PgSQL is installed by
default in 9.1.

# pg_restore --jobs=24 --create --dbname=template1
--use-list=database.list --verbose database.dump

After letting the restore run overnight I find the error message in
the log file and still one running 'postmaster' process with the
arguments 'postgresql dbname [local] COPY' still active. The
pg_restore command exited with a zero status as if all went well. I
checked and no other clients are connected to the system so my only
thought is that this is an orphaned instance from the pg_restore.

I'm hoping it's all related because I stopped all the running process
that would normally access the database. I'm in the phase of testing
the restore process because we're new to dealing with multi-terabyte
data and we're  trying to figure out our options on supporting with
such large data sets going forward.

I guess I could add the PID of the process generating the error in the
log; if it's the pid of the leftover postgresql process that would be
a bit more definitive. It will probably be tomorrow afternoon before I
have any answer that include the PID given the runtime issues.

-- Brian

On Tue, May 1, 2012 at 12:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Brian Weaver <cmdrclueless@gmail.com> writes:
>> I think I've discovered an issue with multi-job pg_restore on a 700 GB
>> data file created with pg_dump.
>
> Just to clarify, you mean parallel restore, right?  Are you using any
> options beyond -j, that is any sort of selective restore?
>
>> The problem occurs during the restore when one of the bulk loads
>> (COPY) seems to get disconnected from the restore process. I captured
>> stdout and stderr from the pg_restore execution and there isn't a
>> single hint of a problem. When I look at the log file in the
>> $PGDATA/pg_log directory I found the following errors:
>
>> LOG:  could not send data to client: Connection reset by peer
>> STATEMENT:  COPY public.outlet_readings_rollup (id, outlet_id,
>> rollup_interval, reading_time, min_current, max_current,
>> average_current, min_active_power, max_active_power,
>> average_active_power, min_apparent_power, max_apparent_power,
>> average_apparent_power, watt_hour, pdu_id, min_voltage, max_voltage,
>> average_voltage) TO stdout;
>
> I'm confused.  A copy-to-stdout ought to be something that pg_dump
> would do, not pg_restore.  Are you sure this is related at all?
>
>                        regards, tom lane



--

/* insert witty comment here */


Re: Problem with multi-job pg_restore

From
Brian Weaver
Date:
Doh! I missed a script that was run by cron that does a nightly
backup. That's the likely offender for the 'copy-to-stdout'

I've removed it from the nightly run. I'll see if have any better luck
with this run. Still not sure about the best way to debug the issue
though. Any pointers would be appreciated.

On Tue, May 1, 2012 at 12:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Brian Weaver <cmdrclueless@gmail.com> writes:
>
> I'm confused.  A copy-to-stdout ought to be something that pg_dump
> would do, not pg_restore.  Are you sure this is related at all?
>
>                        regards, tom lane



--

/* insert witty comment here */


Re: Problem with multi-job pg_restore

From
Tom Lane
Date:
Brian Weaver <cmdrclueless@gmail.com> writes:
> Doh! I missed a script that was run by cron that does a nightly
> backup. That's the likely offender for the 'copy-to-stdout'

> I've removed it from the nightly run. I'll see if have any better luck
> with this run. Still not sure about the best way to debug the issue
> though. Any pointers would be appreciated.

Well, given that this takes so long, adding PID and time of day to your
log_line_prefix would probably be a smart idea.  But at this point it
sounds like what failed was the cron job.  Are you sure that the
pg_restore didn't finish just fine?  If it didn't, what's the evidence?
        regards, tom lane


Re: Problem with multi-job pg_restore

From
Brian Weaver
Date:
Tom,

The restore appears to have finished without a problem. The issue I
have is a running instance of postgres is still active in a COPY state
after the restore. The process is running full tilt, almost like it's
in a tight loop condition.

-- Brian

On Tue, May 1, 2012 at 1:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Brian Weaver <cmdrclueless@gmail.com> writes:
>> Doh! I missed a script that was run by cron that does a nightly
>> backup. That's the likely offender for the 'copy-to-stdout'
>
>> I've removed it from the nightly run. I'll see if have any better luck
>> with this run. Still not sure about the best way to debug the issue
>> though. Any pointers would be appreciated.
>
> Well, given that this takes so long, adding PID and time of day to your
> log_line_prefix would probably be a smart idea.  But at this point it
> sounds like what failed was the cron job.  Are you sure that the
> pg_restore didn't finish just fine?  If it didn't, what's the evidence?
>
>                        regards, tom lane



--

/* insert witty comment here */


Re: Problem with multi-job pg_restore

From
"Kevin Grittner"
Date:
Brian Weaver <cmdrclueless@gmail.com> wrote:
> The restore appears to have finished without a problem. The issue
> I have is a running instance of postgres is still active in a COPY
> state after the restore. The process is running full tilt, almost
> like it's in a tight loop condition.
It's probably dumping all that new data you just restored to ...
somewhere.  You'll need to review the crontab job (or maybe ps
output) to figure out where.
-Kevin


Re: Problem with multi-job pg_restore

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Brian Weaver <cmdrclueless@gmail.com> wrote:
>> The restore appears to have finished without a problem. The issue
>> I have is a running instance of postgres is still active in a COPY
>> state after the restore. The process is running full tilt, almost
>> like it's in a tight loop condition.
> It's probably dumping all that new data you just restored to ...
> somewhere.  You'll need to review the crontab job (or maybe ps
> output) to figure out where.

Or even more likely, it's reading all that data and deciding that
it doesn't need to output it because it's not visible to its snapshot;
that could look like a pretty tight loop from outside.  If the connected
cron job has gone away, I'd suggest just SIGINT'ing the COPY job.
        regards, tom lane