Thread: Backup/dump of huge tables and performance
Hello, I have a large table, the datafiles of which are already several GB in size. It is no longer being used in transactions. I would like to back them up to migrate them to another server, but there are several considerations: 1. pg_dump or COPY take up a lot of I/O resources. That's not surprising considering the size of the tables, but the impact on the overall production system's performance is not acceptable. 2. I don't know how to estimate the size of the resulting dump files from the database files. Even running a SELECT COUNT(*) on the table slows the entire system down considerably, so I don't know how many rows there are (last I checked, it was a couple million, and it's definitely in the tens of millions). What happens when I hit Linux's 2GB filesize limit? I would very much prefer to have the backup files in little 10-20MB chunks, rather than one humongous dumpfile. I also want to be able to run the backup without shutting down the service that uses the database. As noted above, this particular table is no longer being used in transactions. I figure I could write a C program to declare a cursor and write it out in n-MB or m-thousand row chunks, with rest pauses in between. Any better ideas? Would this be a good feature to incorporate into future versions of pg_dump? Brian -- Brian Baquiran <brianb@edsamail.com> http://www.baquiran.com/ AIM: bbaquiran Work: +63(2)7182222 Home: +63(2) 9227123
At 02:26 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote: > >1. pg_dump or COPY take up a lot of I/O resources. That's not surprising >considering the size of the tables, but the impact on the overall >production system's performance is not acceptable. Firstly, you are not using '--inserts', I hope. It is very slow for both backup & restore. Also, do you know if pg_dump is the IO bottleneck, or the backend? Other than that, I'm not sure what can be done about it - the I/O has to be done some time. Possibly causing pg_dump to (optionally) pause between records, but that seems like a bad idea, especially with 10s of millions of records. Maybe someone who knows more about backend processing can suggest if using a different kind of retrieval in the COPY command would help. >2. I don't know how to estimate the size of the resulting dump files from >the database files. Not very easy, unless you have very similar data in each tuple...and in the future pg_dump will support compression, so the size will be even harder to estimate. >I would very much prefer to have the backup files in little 10-20MB chunks, >rather than one humongous dumpfile. Maybe: pg_dump | split --bytes=10m > I also want to be able to run the >backup without shutting down the service that uses the database. AFAIK, you don't need to shut it down, or are you referring to the performance problems? >As noted above, this particular table is no longer being used in >transactions. I figure I could write a C program to declare a cursor and >write it out in n-MB or m-thousand row chunks, with rest pauses in between. You could always do a file-based backup of the database, and restore it somewhere else, and drop the tables you don't need. Not very elegant, I realize. >Any better ideas? Would this be a good feature to incorporate into future >versions of pg_dump? I'm not sure what should be changed in pg_dump; delaying between records seems like a bad idea since it does the dump in a single TX, and besides, sleeping while a TX is open seems evil to me. I think making multiple files can be done by 'split', so the real issue is where the IO problem comes from, and how to reduce it. If pg_dump is the source of the I/O, then I can try to address it, but if the COPY command is the problem, that needs to be done by someone else... ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Hi Philip, Philip Warner writes: > At 02:26 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote: > >1. pg_dump or COPY take up a lot of I/O resources. That's not surprising > >considering the size of the tables, but the impact on the overall > >production system's performance is not acceptable. > > Firstly, you are not using '--inserts', I hope. It is very slow for both > backup & restore. Also, do you know if pg_dump is the IO bottleneck, or the > backend? Nope, I don't use --inserts. The bottleneck seems to be at the backend, since it also happens with COPY. > Other than that, I'm not sure what can be done about it - the I/O has to be > done some time. Possibly causing pg_dump to (optionally) pause between > records, but that seems like a bad idea, especially with 10s of millions of > records. > You could always do a file-based backup of the database, and restore it > somewhere else, and drop the tables you don't need. Not very elegant, I > realize. What file-based backup are you referring to? From reading previous posts, I'd inferred that I couldn't just copy my /var/lib/pgsql over to another server with Postgres and start postmaster. > >Any better ideas? Would this be a good feature to incorporate into future > >versions of pg_dump? > > I'm not sure what should be changed in pg_dump; delaying between records > seems like a bad idea since it does the dump in a single TX, and besides, > sleeping while a TX is open seems evil to me. I suppose it's a special case, then. I just need to move the table, and I'd rather have the I/O load spread over a longer period of time to soften the impact on the rest of the system. > I think making multiple files can be done by 'split', so the real issue is > where the IO problem comes from, and how to reduce it. If pg_dump is the > source of the I/O, then I can try to address it, but if the COPY command is > the problem, that needs to be done by someone else... And thanks for suggesting "split". I had forgotten it existed and was about to write it from scratch. Brian -- Brian Baquiran <brianb@edsamail.com> http://www.baquiran.com/ AIM: bbaquiran Work: +63(2)7182222 Home: +63(2) 9227123 I'm smarter than average. Therefore, average, to me, seems kind of stupid. People weren't purposely being stupid. It just came naturally. -- Bruce "Tog" Toganazzini
At 10:11 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote: > >Philip Warner writes: >> At 02:26 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote: >> >1. pg_dump or COPY take up a lot of I/O resources. That's not surprising >> >considering the size of the tables, but the impact on the overall >> >production system's performance is not acceptable. >> >> Firstly, you are not using '--inserts', I hope. It is very slow for both >> backup & restore. Also, do you know if pg_dump is the IO bottleneck, or the >> backend? > >Nope, I don't use --inserts. The bottleneck seems to be at the backend, >since it also happens with COPY. I'm not sure you can conclude that, but you are probably right. If possible can you use whatever tools you have to see which process *is* doing the I/O? Just to be sure. >> Other than that, I'm not sure what can be done about it - the I/O has to be >> done some time. Possibly causing pg_dump to (optionally) pause between >> records, but that seems like a bad idea, especially with 10s of millions of >> records. > >> You could always do a file-based backup of the database, and restore it >> somewhere else, and drop the tables you don't need. Not very elegant, I >> realize. > >What file-based backup are you referring to? From reading previous posts, >I'd inferred that I couldn't just copy my /var/lib/pgsql over to another >server with Postgres and start postmaster. You probably can't/shouldn't. To make it work, I suspect you'd have to kick all the users off the DB, which is not going to help. >> >> I'm not sure what should be changed in pg_dump; delaying between records >> seems like a bad idea since it does the dump in a single TX, and besides, >> sleeping while a TX is open seems evil to me. > >I suppose it's a special case, then. I just need to move the table, and I'd >rather have the I/O load spread over a longer period of time to soften the >impact on the rest of the system. It might bear thinking about...especially if anyone else has an opinion on the options. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.C.N. 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Have you tried playing with the priority of the process? If you gave it a high nice value that would lower the impact on the other processes and not drag down the system as much. my 2¢ Matt > -----Original Message----- > From: Philip Warner [SMTP:pjw@rhyme.com.au] > Sent: Friday, July 28, 2000 5:51 AM > To: brianb-pggeneral@edsamail.com > Cc: pgsql-general@postgresql.org; paul@edsamail.com > Subject: Re: [GENERAL] Backup/dump of huge tables and performance > > At 10:11 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote: > > > >Philip Warner writes: > >> At 02:26 28/07/00 GMT, brianb-pggeneral@edsamail.com wrote: > >> >1. pg_dump or COPY take up a lot of I/O resources. That's not > surprising > >> >considering the size of the tables, but the impact on the overall > >> >production system's performance is not acceptable. > >> > >> Firstly, you are not using '--inserts', I hope. It is very slow for > both > >> backup & restore. Also, do you know if pg_dump is the IO > bottleneck, or the > >> backend? > > > >Nope, I don't use --inserts. The bottleneck seems to be at the > backend, > >since it also happens with COPY. > > I'm not sure you can conclude that, but you are probably right. If > possible > can you use whatever tools you have to see which process *is* doing > the > I/O? Just to be sure. > > > >> Other than that, I'm not sure what can be done about it - the I/O > has to be > >> done some time. Possibly causing pg_dump to (optionally) pause > between > >> records, but that seems like a bad idea, especially with 10s of > millions of > >> records. > > > >> You could always do a file-based backup of the database, and > restore it > >> somewhere else, and drop the tables you don't need. Not very > elegant, I > >> realize. > > > >What file-based backup are you referring to? From reading previous > posts, > >I'd inferred that I couldn't just copy my /var/lib/pgsql over to > another > >server with Postgres and start postmaster. > > You probably can't/shouldn't. To make it work, I suspect you'd have to > kick > all the users off the DB, which is not going to help. > > > >> > >> I'm not sure what should be changed in pg_dump; delaying between > records > >> seems like a bad idea since it does the dump in a single TX, and > besides, > >> sleeping while a TX is open seems evil to me. > > > >I suppose it's a special case, then. I just need to move the table, > and I'd > >rather have the I/O load spread over a longer period of time to > soften the > >impact on the rest of the system. > > It might bear thinking about...especially if anyone else has an > opinion on > the options. > > > > > > ---------------------------------------------------------------- > Philip Warner | __---_____ > Albatross Consulting Pty. Ltd. |----/ - \ > (A.C.N. 008 659 498) | /(@) ______---_ > Tel: (+61) 0500 83 82 81 | _________ \ > Fax: (+61) 0500 83 82 82 | ___________ | > Http://www.rhyme.com.au | / \| > | --________-- > PGP key available upon request, | / > and from pgp5.ai.mit.edu:11371 |/