Thread: pg_restore takes ages

pg_restore takes ages

From

Alex

Date:

03 October 2003, 07:20:17

Hi,
I use pg_restore to load a previously dumped database.  (10mil records).
the load of the data runs quite fast but when starting creating the
triggers for foreign keys it takes forever.

Isnt there are a faster way. after all the triggers in the source db
already made sure the data was clean.

Thanks
Alex

Re: pg_restore takes ages

From

Shridhar Daithankar

Date:

03 October 2003, 08:29:25

Alex wrote:

> Hi,
> I use pg_restore to load a previously dumped database.  (10mil records).
> the load of the data runs quite fast but when starting creating the
> triggers for foreign keys it takes forever.
>
> Isnt there are a faster way. after all the triggers in the source db
> already made sure the data was clean.

You can try creating index/triggers first and load the data. At the end it will
take a while before you get a usable database with either approach but see what
works faster for you.

Personally I was in a situation where postgresql was hogging space while
creating index on a table that had 81M rows with 3GB disk footprint. I dropped
the table and recreated it. Also created index before loading data. The loading
was slow with this approach but it finished in 3 hours. And I had an updated
index as well. Just had to run vacuum over it.

Take your pick..

  Shridhar

Re: pg_restore takes ages

From

Vivek Khera

Date:

03 October 2003, 14:16:19

>>>>> "SD" == Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:

SD> You can try creating index/triggers first and load the data. At the
SD> end it will take a while before you get a usable database with either
SD> approach but see what works faster for you.

The triggers and FK's don't do much at the time they are created.
They work upon update/insert/delete of data.

SD> footprint. I dropped the table and recreated it. Also created index
SD> before loading data. The loading was slow with this approach but it
SD> finished in 3 hours. And I had an updated index as well. Just had to
SD> run vacuum over it.

I cannot believe that this was faster than load data followed by
create index.  Perhaps you needed to bump sort_mem so the index could
be created more efficiently.  I also find that bumping up
checkpoint_segments to a high number speeds things up considerably.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.                Khera Communications, Inc.
Internet: khera@kciLink.com       Rockville, MD       +1-240-453-8497
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

Re: pg_restore takes ages

From

Shridhar Daithankar

Date:

03 October 2003, 14:29:15

Vivek Khera wrote:
> SD> footprint. I dropped the table and recreated it. Also created index
> SD> before loading data. The loading was slow with this approach but it
> SD> finished in 3 hours. And I had an updated index as well. Just had to
> SD> run vacuum over it.
>
> I cannot believe that this was faster than load data followed by
> create index.  Perhaps you needed to bump sort_mem so the index could
> be created more efficiently.  I also find that bumping up
> checkpoint_segments to a high number speeds things up considerably.

Well, In my case speed wasn't the issue. I put $PGDATA on a 12/13GB partition
and loaded 3GB of table. When I went to create index, it ran out of rest of the
free space which was close to 9GB. Actually I killed it because when it started
it had 9GB free and when I killed it, there was only 150MB free left.

Oracle had same problems. With tablespaces set to auto extent it ate huge amount
of space.

I posted this earlier and Tom remarked it the same, saying that it should be
same one way or other.

Anyway the project abandoned all the database and went to in memory structures..:-)

  Shridhar

Re: pg_restore takes ages

From

"scott.marlowe"

Date:

03 October 2003, 19:15:46

On Fri, 3 Oct 2003, Vivek Khera wrote:

> >>>>> "SD" == Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes:
>
> SD> You can try creating index/triggers first and load the data. At the
> SD> end it will take a while before you get a usable database with either
> SD> approach but see what works faster for you.
>
> The triggers and FK's don't do much at the time they are created.
> They work upon update/insert/delete of data.
>
> SD> footprint. I dropped the table and recreated it. Also created index
> SD> before loading data. The loading was slow with this approach but it
> SD> finished in 3 hours. And I had an updated index as well. Just had to
> SD> run vacuum over it.
>
> I cannot believe that this was faster than load data followed by
> create index.  Perhaps you needed to bump sort_mem so the index could
> be created more efficiently.  I also find that bumping up
> checkpoint_segments to a high number speeds things up considerably.

also, turning off fsync during the load helps a lot.  Be sure to turn it
back on when you're done of course.

Re: pg_restore takes ages

From

Alvaro Herrera

Date:

03 October 2003, 19:29:56

On Fri, Oct 03, 2003 at 01:06:26PM -0600, scott.marlowe wrote:

> also, turning off fsync during the load helps a lot.  Be sure to turn it
> back on when you're done of course.

I'm not sure I understand why this is so.  If I turn fsync off, it means
that I won't force the kernel to write WAL logs to disk, but they will
have to be written eventually.  If you have tons of RAM it may well be
that the kernel will just keep dirty buffers in RAM, but if not there
should not be any difference.  Am I missing something?

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)

Re: pg_restore takes ages

From

Vivek Khera

Date:

03 October 2003, 20:08:31

>>>>> "sm" == scott marlowe <scott.marlowe> writes:

sm> also, turning off fsync during the load helps a lot.  Be sure to turn it
sm> back on when you're done of course.

Only if you don't have a battery-backed cache on your RAID.  If you
do, it won't likely make a big difference.  For me it was about 2
seconds over a 4-hour restore.

Re: pg_restore takes ages

From

"scott.marlowe"

Date:

03 October 2003, 20:37:22

On Fri, 3 Oct 2003, Alvaro Herrera wrote:

> On Fri, Oct 03, 2003 at 01:06:26PM -0600, scott.marlowe wrote:
>
> > also, turning off fsync during the load helps a lot.  Be sure to turn it
> > back on when you're done of course.
>
> I'm not sure I understand why this is so.  If I turn fsync off, it means
> that I won't force the kernel to write WAL logs to disk, but they will
> have to be written eventually.  If you have tons of RAM it may well be
> that the kernel will just keep dirty buffers in RAM, but if not there
> should not be any difference.  Am I missing something?

Yes, you are.  Basically, with fsync on, things have to happen in order.

I.e.

write to WAL what you're gonna do.  WAIT for confirmation on write
write the tuples out.  wait for confirmation
checkpoint the WAL.  wait for confirmation

Notice the wait for confirmation above.  Without fsync, there's no wait,
you just write it all out at once, and hope the machine / database doesn't
ever crash in the middle of a transaction.

Give it a try, turn off fsync, run pgbench -c 4 -t 1000, then turn it back
on and see how much it slows down.

Pull the plug while the transactions are running with fsync on, and your
machine, assuming it has a meta-data journaling file system, will come
right back, and postgresql will replay the WAL files and you'll have a
nice consistent database.

turn off fsync, initiate many transactions, pull the plug, and look at
your corrupted database refuse to start on update.

Note that if you're running on IDE drives, you already ARE probably
running with fsync off if write caching is enabled, so you'll need to turn
it off (hdparm -W0 /dev/hdx in linux) to ensure fsync actually works.

Re: pg_restore takes ages

From

"scott.marlowe"

Date:

03 October 2003, 20:49:08

On Fri, 3 Oct 2003, Vivek Khera wrote:

> >>>>> "sm" == scott marlowe <scott.marlowe> writes:
>
> sm> also, turning off fsync during the load helps a lot.  Be sure to turn it
> sm> back on when you're done of course.
>
> Only if you don't have a battery-backed cache on your RAID.  If you
> do, it won't likely make a big difference.  For me it was about 2
> seconds over a 4-hour restore.

True, very true.  Have you done the "pull the plug" test on it to make
sure it really works, by the way?

Re: pg_restore takes ages

From

Tom Lane

Date:

03 October 2003, 21:32:01

"scott.marlowe" <scott.marlowe@ihs.com> writes:
> Yes, you are.  Basically, with fsync on, things have to happen in order.
> I.e.
> write to WAL what you're gonna do.  WAIT for confirmation on write
> write the tuples out.  wait for confirmation
> checkpoint the WAL.  wait for confirmation

Not really.  With fsync on, we *only* sync the WAL writes.  Data writes
can happen whenever, so long as we know the corresponding WAL writes
went down first.  We only wait for data writes to complete before
considering that a checkpoint is complete --- which is something that is
not in the main line of execution and doesn't block other activity.

This is one good reason for keeping WAL on a separate drive from the
data files --- you are then freeing the system to schedule data I/O as
optimally as it can.

> Note that if you're running on IDE drives, you already ARE probably
> running with fsync off if write caching is enabled, so you'll need to turn
> it off (hdparm -W0 /dev/hdx in linux) to ensure fsync actually works.

It'd be interesting to think about whether a write-caching IDE drive
could safely be used for data storage, if WAL is elsewhere.

Right offhand I think the only problem is how to know when it's safe
to consider a checkpoint complete.  Maybe all that would be needed is
a long enough time delay after issuing sync(2) in the checkpoint code.
Do these drives guarantee "data will be written within 30 seconds" or
something like that?  Or can the delay be indefinite when load is heavy?

            regards, tom lane

Re: pg_restore takes ages

From

"scott.marlowe"

Date:

03 October 2003, 22:28:54

On Fri, 3 Oct 2003, Tom Lane wrote:

> "scott.marlowe" <scott.marlowe@ihs.com> writes:
> > Yes, you are.  Basically, with fsync on, things have to happen in order.
> > I.e.
> > write to WAL what you're gonna do.  WAIT for confirmation on write
> > write the tuples out.  wait for confirmation
> > checkpoint the WAL.  wait for confirmation
>
> Not really.  With fsync on, we *only* sync the WAL writes.  Data writes
> can happen whenever, so long as we know the corresponding WAL writes
> went down first.  We only wait for data writes to complete before
> considering that a checkpoint is complete --- which is something that is
> not in the main line of execution and doesn't block other activity.
>
> This is one good reason for keeping WAL on a separate drive from the
> data files --- you are then freeing the system to schedule data I/O as
> optimally as it can.

Oh, Ok.  That's why the WAL is such a choking point.

> > Note that if you're running on IDE drives, you already ARE probably
> > running with fsync off if write caching is enabled, so you'll need to turn
> > it off (hdparm -W0 /dev/hdx in linux) to ensure fsync actually works.
>
> It'd be interesting to think about whether a write-caching IDE drive
> could safely be used for data storage, if WAL is elsewhere.

Well, I just so happen to have a machine with two drives in it.  I'll get
back to you on that.

> Right offhand I think the only problem is how to know when it's safe
> to consider a checkpoint complete.  Maybe all that would be needed is
> a long enough time delay after issuing sync(2) in the checkpoint code.
> Do these drives guarantee "data will be written within 30 seconds" or
> something like that?  Or can the delay be indefinite when load is heavy?

I don't know.  My guess is that they probably don't wait too long, they
just wait until they have enough to write to make it "worth their while"
i.e. they don't write 8k at a time, they write 64k or 512k at a time, now
that many have 8 Meg of cache, it would make sense to group writes to one
area together.

I'll test this out.  Do you think "pgbench -c 50 -t 100000000" is enough
thrash, or do I need more connections at once?  The machine I'm on has 2
gig ram, so it can probably open several hundred connections, but the time
to bring the database back up by replaying the WAL with hundreds of
concurrent transactions is gonna be a bit.

Re: pg_restore takes ages

From

"scott.marlowe"

Date:

04 October 2003, 00:00:41

On Fri, 3 Oct 2003, scott.marlowe wrote:

> On Fri, 3 Oct 2003, Tom Lane wrote:
> >
> > It'd be interesting to think about whether a write-caching IDE drive
> > could safely be used for data storage, if WAL is elsewhere.
>
> Well, I just so happen to have a machine with two drives in it.  I'll get
> back to you on that.

Ok, I just tested it.  I put pg_xlog and pg_clog on a drive that was set
to write cache disabled, and left the data on a drive where caching was
enabled.  The tps on a pgbench -c 5 -t 500 on the single drive was 45 to
55.  With the pg_[xc]log moved to another drive and all, I got up to 108
tps.  About double performance, as you'd expect.  I didn't test the data
drive with write caching disabled, but my guess is it wouldn't be any
slower since pgsql doesn't wait on the rest.

I pulled the plug three times, and all three times the database came up in
recovery mode and sucessfully recovered.  I didn't bother testing to see
if write caching would corrupt it as I'm pretty sure it would, it
certainly did when everything was on one drive.

Would you like to try some kind of wal patch out on it while I've got it
for testing?  I'd be glad to torture that poor little box some more if
you're in the mood and the beta period is winding down.  It's running 7.4
beta3, by the way.

Re: pg_restore takes ages

From

Bruce Momjian

Date:

04 October 2003, 19:29:38

scott.marlowe wrote:
> On Fri, 3 Oct 2003, scott.marlowe wrote:
>
> > On Fri, 3 Oct 2003, Tom Lane wrote:
> > >
> > > It'd be interesting to think about whether a write-caching IDE drive
> > > could safely be used for data storage, if WAL is elsewhere.
> >
> > Well, I just so happen to have a machine with two drives in it.  I'll get
> > back to you on that.
>
> Ok, I just tested it.  I put pg_xlog and pg_clog on a drive that was set
> to write cache disabled, and left the data on a drive where caching was
> enabled.  The tps on a pgbench -c 5 -t 500 on the single drive was 45 to
> 55.  With the pg_[xc]log moved to another drive and all, I got up to 108
> tps.  About double performance, as you'd expect.  I didn't test the data
> drive with write caching disabled, but my guess is it wouldn't be any
> slower since pgsql doesn't wait on the rest.
>
> I pulled the plug three times, and all three times the database came up in
> recovery mode and sucessfully recovered.  I didn't bother testing to see
> if write caching would corrupt it as I'm pretty sure it would, it
> certainly did when everything was on one drive.

You would have had to pull the plug between the time the system did a
checkpoint (and wrote to the write cache), and before it flushed the
write cache to disk  --- no idea how you would find that window, but my
guess is that if you pulled the plug right after the checkpoint
completed, the WAL recovery would fail.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: pg_restore takes ages

From

"scott.marlowe"

Date:

06 October 2003, 15:02:16

On Sat, 4 Oct 2003, Bruce Momjian wrote:

> scott.marlowe wrote:
> > On Fri, 3 Oct 2003, scott.marlowe wrote:
> >
> > > On Fri, 3 Oct 2003, Tom Lane wrote:
> > > >
> > > > It'd be interesting to think about whether a write-caching IDE drive
> > > > could safely be used for data storage, if WAL is elsewhere.
> > >
> > > Well, I just so happen to have a machine with two drives in it.  I'll get
> > > back to you on that.
> >
> > Ok, I just tested it.  I put pg_xlog and pg_clog on a drive that was set
> > to write cache disabled, and left the data on a drive where caching was
> > enabled.  The tps on a pgbench -c 5 -t 500 on the single drive was 45 to
> > 55.  With the pg_[xc]log moved to another drive and all, I got up to 108
> > tps.  About double performance, as you'd expect.  I didn't test the data
> > drive with write caching disabled, but my guess is it wouldn't be any
> > slower since pgsql doesn't wait on the rest.
> >
> > I pulled the plug three times, and all three times the database came up in
> > recovery mode and sucessfully recovered.  I didn't bother testing to see
> > if write caching would corrupt it as I'm pretty sure it would, it
> > certainly did when everything was on one drive.
>
> You would have had to pull the plug between the time the system did a
> checkpoint (and wrote to the write cache), and before it flushed the
> write cache to disk  --- no idea how you would find that window, but my
> guess is that if you pulled the plug right after the checkpoint
> completed, the WAL recovery would fail.

I'm not sure what you mean.  Are you talking about the failure more with
write cache enabled?  That always failed when I tested it.  I was testing
it with 80 parallel transactions, by the way.  I'll try it anyway just to
be sure that it causes the problem I'm expecting it to (i.e. write cache
enabled on pg_xlog causes database corruption under heavy parallel load
when plug is pulled.)

Re: pg_restore takes ages

From

Bruce Momjian

Date:

06 October 2003, 19:31:41

scott.marlowe wrote:
> > > Ok, I just tested it.  I put pg_xlog and pg_clog on a drive that was set
> > > to write cache disabled, and left the data on a drive where caching was
> > > enabled.  The tps on a pgbench -c 5 -t 500 on the single drive was 45 to
> > > 55.  With the pg_[xc]log moved to another drive and all, I got up to 108
> > > tps.  About double performance, as you'd expect.  I didn't test the data
> > > drive with write caching disabled, but my guess is it wouldn't be any
> > > slower since pgsql doesn't wait on the rest.
> > >
> > > I pulled the plug three times, and all three times the database came up in
> > > recovery mode and sucessfully recovered.  I didn't bother testing to see
> > > if write caching would corrupt it as I'm pretty sure it would, it
> > > certainly did when everything was on one drive.
> >
> > You would have had to pull the plug between the time the system did a
> > checkpoint (and wrote to the write cache), and before it flushed the
> > write cache to disk  --- no idea how you would find that window, but my
> > guess is that if you pulled the plug right after the checkpoint
> > completed, the WAL recovery would fail.
>
> I'm not sure what you mean.  Are you talking about the failure more with
> write cache enabled?  That always failed when I tested it.  I was testing
> it with 80 parallel transactions, by the way.  I'll try it anyway just to
> be sure that it causes the problem I'm expecting it to (i.e. write cache
> enabled on pg_xlog causes database corruption under heavy parallel load
> when plug is pulled.)

The issue is that a force write to disk is required for the sync() that
is run before the WAL files are recycled.  You can get by with a write
cache enabled on the data drive as long as the crash doesn't happen in
the window between the sync (and WAL files removed) and the data
actually making it to the platers.  If it does, I don't think the system
will recover, or if it does, it will not be consistent.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073