Thread: Backup/dump of huge tables and performance

Backup/dump of huge tables and performance

From
brianb-pggeneral@edsamail.com
Date:
Hello,

I have a large table, the datafiles of which are already several GB in
size. It is no longer being used in transactions. I would like to back them
up to migrate them to another server, but there are several considerations:

1. pg_dump or COPY take up a lot of I/O resources. That's not surprising
considering the size of the tables, but the impact on the overall
production system's performance is not acceptable.

2. I don't know how to estimate the size of the resulting dump files from
the database files. Even running a SELECT COUNT(*) on the table slows the
entire system down considerably, so I don't know how many rows there are
(last I checked, it was a couple million, and it's definitely in the tens
of millions). What happens when I hit Linux's 2GB filesize limit?

I would very much prefer to have the backup files in little 10-20MB chunks,
rather than one humongous dumpfile. I also want to be able to run the
backup without shutting down the service that uses the database.

As noted above, this particular table is no longer being used in
transactions. I figure I could write a C program to declare a cursor and
write it out in n-MB or m-thousand row chunks, with rest pauses in between.

Any better ideas? Would this be a good feature to incorporate into future
versions of pg_dump?

Brian
--
Brian Baquiran <brianb@edsamail.com>
http://www.baquiran.com/  AIM: bbaquiran
Work: +63(2)7182222       Home: +63(2) 9227123

Re: Backup/dump of huge tables and performance

From
Philip Warner
Date:
At 02:26 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote:
>
>1. pg_dump or COPY take up a lot of I/O resources. That's not surprising
>considering the size of the tables, but the impact on the overall
>production system's performance is not acceptable.

Firstly, you are not using '--inserts', I hope. It is very slow for both
backup & restore. Also, do you know if pg_dump is the IO bottleneck, or the
backend?

Other than that, I'm not sure what can be done about it - the I/O has to be
done some time. Possibly causing pg_dump to (optionally) pause between
records, but that seems like a bad idea, especially with 10s of millions of
records.

Maybe someone who knows more about backend processing can suggest if using
a different kind of retrieval in the COPY command would help.

>2. I don't know how to estimate the size of the resulting dump files from
>the database files.

Not very easy, unless you have very similar data in each tuple...and in the
future pg_dump will support compression, so the size will be even harder to
estimate.


>I would very much prefer to have the backup files in little 10-20MB chunks,
>rather than one humongous dumpfile.

Maybe: pg_dump | split --bytes=10m

> I also want to be able to run the
>backup without shutting down the service that uses the database.

AFAIK, you don't need to shut it down, or are you referring to the
performance problems?


>As noted above, this particular table is no longer being used in
>transactions. I figure I could write a C program to declare a cursor and
>write it out in n-MB or m-thousand row chunks, with rest pauses in between.

You could always do a file-based backup of the database, and restore it
somewhere else, and drop the tables you don't need. Not very elegant, I
realize.


>Any better ideas? Would this be a good feature to incorporate into future
>versions of pg_dump?

I'm not sure what should be changed in pg_dump; delaying between records
seems like a bad idea since it does the dump in a single TX, and besides,
sleeping while a TX is open seems evil to me.

I think making multiple files can be done by 'split', so the real issue is
where the IO problem comes from, and how to reduce it. If pg_dump is the
source of the I/O, then I can try to address it, but if the COPY command is
the problem, that needs to be done by someone else...


----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|
                                 |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

Re: Backup/dump of huge tables and performance

From
brianb-pggeneral@edsamail.com
Date:
Hi Philip,

Philip Warner writes:
> At 02:26 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote:
> >1. pg_dump or COPY take up a lot of I/O resources. That's not surprising
> >considering the size of the tables, but the impact on the overall
> >production system's performance is not acceptable.
>
> Firstly, you are not using '--inserts', I hope. It is very slow for both
> backup & restore. Also, do you know if pg_dump is the IO bottleneck, or the
> backend?

Nope, I don't use --inserts. The bottleneck seems to be at the backend,
since it also happens with COPY.

> Other than that, I'm not sure what can be done about it - the I/O has to be
> done some time. Possibly causing pg_dump to (optionally) pause between
> records, but that seems like a bad idea, especially with 10s of millions of
> records.

> You could always do a file-based backup of the database, and restore it
> somewhere else, and drop the tables you don't need. Not very elegant, I
> realize.

What file-based backup are you referring to? From reading previous posts,
I'd inferred that I couldn't just copy my /var/lib/pgsql over to another
server with Postgres and start postmaster.

> >Any better ideas? Would this be a good feature to incorporate into future
> >versions of pg_dump?
>
> I'm not sure what should be changed in pg_dump; delaying between records
> seems like a bad idea since it does the dump in a single TX, and besides,
> sleeping while a TX is open seems evil to me.

I suppose it's a special case, then. I just need to move the table, and I'd
rather have the I/O load spread over a longer period of time to soften the
impact on the rest of the system.

> I think making multiple files can be done by 'split', so the real issue is
> where the IO problem comes from, and how to reduce it. If pg_dump is the
> source of the I/O, then I can try to address it, but if the COPY command is
> the problem, that needs to be done by someone else...

And thanks for suggesting "split". I had forgotten it existed and was about
to write it from scratch.

Brian
--
Brian Baquiran <brianb@edsamail.com>
http://www.baquiran.com/  AIM: bbaquiran
Work: +63(2)7182222       Home: +63(2) 9227123

I'm smarter than average. Therefore, average, to me, seems kind of stupid.
People weren't purposely being stupid. It just came naturally.
                              -- Bruce "Tog" Toganazzini

Re: Backup/dump of huge tables and performance

From
Philip Warner
Date:
At 10:11 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote:
>
>Philip Warner writes:
>> At 02:26 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote:
>> >1. pg_dump or COPY take up a lot of I/O resources. That's not surprising
>> >considering the size of the tables, but the impact on the overall
>> >production system's performance is not acceptable.
>>
>> Firstly, you are not using '--inserts', I hope. It is very slow for both
>> backup & restore. Also, do you know if pg_dump is the IO bottleneck, or the
>> backend?
>
>Nope, I don't use --inserts. The bottleneck seems to be at the backend,
>since it also happens with COPY.

I'm not sure you can conclude that, but you are probably right. If possible
can you use whatever tools you have to see which process *is* doing the
I/O? Just to be sure.


>> Other than that, I'm not sure what can be done about it - the I/O has to be
>> done some time. Possibly causing pg_dump to (optionally) pause between
>> records, but that seems like a bad idea, especially with 10s of millions of
>> records.
>
>> You could always do a file-based backup of the database, and restore it
>> somewhere else, and drop the tables you don't need. Not very elegant, I
>> realize.
>
>What file-based backup are you referring to? From reading previous posts,
>I'd inferred that I couldn't just copy my /var/lib/pgsql over to another
>server with Postgres and start postmaster.

You probably can't/shouldn't. To make it work, I suspect you'd have to kick
all the users off the DB, which is not going to help.


>>
>> I'm not sure what should be changed in pg_dump; delaying between records
>> seems like a bad idea since it does the dump in a single TX, and besides,
>> sleeping while a TX is open seems evil to me.
>
>I suppose it's a special case, then. I just need to move the table, and I'd
>rather have the I/O load spread over a longer period of time to soften the
>impact on the rest of the system.

It might bear thinking about...especially if anyone else has an opinion on
the options.





----------------------------------------------------------------
Philip Warner                    |     __---_____
Albatross Consulting Pty. Ltd.   |----/       -  \
(A.C.N. 008 659 498)             |          /(@)   ______---_
Tel: (+61) 0500 83 82 81         |                 _________  \
Fax: (+61) 0500 83 82 82         |                 ___________ |
Http://www.rhyme.com.au          |                /           \|
                                 |    --________--
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/

RE: Backup/dump of huge tables and performance

From
Matthew
Date:
Have you tried playing with the priority of the process?  If you gave it
a high nice value that would lower the impact on the other processes and
not drag down the system as much.

my 2¢

Matt

> -----Original Message-----
> From:    Philip Warner [SMTP:pjw@rhyme.com.au]
> Sent:    Friday, July 28, 2000 5:51 AM
> To:    brianb-pggeneral@edsamail.com
> Cc:    pgsql-general@postgresql.org; paul@edsamail.com
> Subject:    Re: [GENERAL] Backup/dump of huge tables and performance
>
> At 10:11 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote:
> >
> >Philip Warner writes:
> >> At 02:26 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote:
> >> >1. pg_dump or COPY take up a lot of I/O resources. That's not
> surprising
> >> >considering the size of the tables, but the impact on the overall
> >> >production system's performance is not acceptable.
> >>
> >> Firstly, you are not using '--inserts', I hope. It is very slow for
> both
> >> backup & restore. Also, do you know if pg_dump is the IO
> bottleneck, or the
> >> backend?
> >
> >Nope, I don't use --inserts. The bottleneck seems to be at the
> backend,
> >since it also happens with COPY.
>
> I'm not sure you can conclude that, but you are probably right. If
> possible
> can you use whatever tools you have to see which process *is* doing
> the
> I/O? Just to be sure.
>
>
> >> Other than that, I'm not sure what can be done about it - the I/O
> has to be
> >> done some time. Possibly causing pg_dump to (optionally) pause
> between
> >> records, but that seems like a bad idea, especially with 10s of
> millions of
> >> records.
> >
> >> You could always do a file-based backup of the database, and
> restore it
> >> somewhere else, and drop the tables you don't need. Not very
> elegant, I
> >> realize.
> >
> >What file-based backup are you referring to? From reading previous
> posts,
> >I'd inferred that I couldn't just copy my /var/lib/pgsql over to
> another
> >server with Postgres and start postmaster.
>
> You probably can't/shouldn't. To make it work, I suspect you'd have to
> kick
> all the users off the DB, which is not going to help.
>
>
> >>
> >> I'm not sure what should be changed in pg_dump; delaying between
> records
> >> seems like a bad idea since it does the dump in a single TX, and
> besides,
> >> sleeping while a TX is open seems evil to me.
> >
> >I suppose it's a special case, then. I just need to move the table,
> and I'd
> >rather have the I/O load spread over a longer period of time to
> soften the
> >impact on the rest of the system.
>
> It might bear thinking about...especially if anyone else has an
> opinion on
> the options.
>
>
>
>
>
> ----------------------------------------------------------------
> Philip Warner                    |     __---_____
> Albatross Consulting Pty. Ltd.   |----/       -  \
> (A.C.N. 008 659 498)             |          /(@)   ______---_
> Tel: (+61) 0500 83 82 81         |                 _________  \
> Fax: (+61) 0500 83 82 82         |                 ___________ |
> Http://www.rhyme.com.au          |                /           \|
>                                  |    --________--
> PGP key available upon request,  |  /
> and from pgp5.ai.mit.edu:11371   |/