Thread: pg_restore takes ages
Hi, I use pg_restore to load a previously dumped database. (10mil records). the load of the data runs quite fast but when starting creating the triggers for foreign keys it takes forever. Isnt there are a faster way. after all the triggers in the source db already made sure the data was clean. Thanks Alex
Alex wrote: > Hi, > I use pg_restore to load a previously dumped database. (10mil records). > the load of the data runs quite fast but when starting creating the > triggers for foreign keys it takes forever. > > Isnt there are a faster way. after all the triggers in the source db > already made sure the data was clean. You can try creating index/triggers first and load the data. At the end it will take a while before you get a usable database with either approach but see what works faster for you. Personally I was in a situation where postgresql was hogging space while creating index on a table that had 81M rows with 3GB disk footprint. I dropped the table and recreated it. Also created index before loading data. The loading was slow with this approach but it finished in 3 hours. And I had an updated index as well. Just had to run vacuum over it. Take your pick.. Shridhar
>>>>> "SD" == Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes: SD> You can try creating index/triggers first and load the data. At the SD> end it will take a while before you get a usable database with either SD> approach but see what works faster for you. The triggers and FK's don't do much at the time they are created. They work upon update/insert/delete of data. SD> footprint. I dropped the table and recreated it. Also created index SD> before loading data. The loading was slow with this approach but it SD> finished in 3 hours. And I had an updated index as well. Just had to SD> run vacuum over it. I cannot believe that this was faster than load data followed by create index. Perhaps you needed to bump sort_mem so the index could be created more efficiently. I also find that bumping up checkpoint_segments to a high number speeds things up considerably. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Vivek Khera, Ph.D. Khera Communications, Inc. Internet: khera@kciLink.com Rockville, MD +1-240-453-8497 AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/
Vivek Khera wrote: > SD> footprint. I dropped the table and recreated it. Also created index > SD> before loading data. The loading was slow with this approach but it > SD> finished in 3 hours. And I had an updated index as well. Just had to > SD> run vacuum over it. > > I cannot believe that this was faster than load data followed by > create index. Perhaps you needed to bump sort_mem so the index could > be created more efficiently. I also find that bumping up > checkpoint_segments to a high number speeds things up considerably. Well, In my case speed wasn't the issue. I put $PGDATA on a 12/13GB partition and loaded 3GB of table. When I went to create index, it ran out of rest of the free space which was close to 9GB. Actually I killed it because when it started it had 9GB free and when I killed it, there was only 150MB free left. Oracle had same problems. With tablespaces set to auto extent it ate huge amount of space. I posted this earlier and Tom remarked it the same, saying that it should be same one way or other. Anyway the project abandoned all the database and went to in memory structures..:-) Shridhar
On Fri, 3 Oct 2003, Vivek Khera wrote: > >>>>> "SD" == Shridhar Daithankar <shridhar_daithankar@persistent.co.in> writes: > > SD> You can try creating index/triggers first and load the data. At the > SD> end it will take a while before you get a usable database with either > SD> approach but see what works faster for you. > > The triggers and FK's don't do much at the time they are created. > They work upon update/insert/delete of data. > > SD> footprint. I dropped the table and recreated it. Also created index > SD> before loading data. The loading was slow with this approach but it > SD> finished in 3 hours. And I had an updated index as well. Just had to > SD> run vacuum over it. > > I cannot believe that this was faster than load data followed by > create index. Perhaps you needed to bump sort_mem so the index could > be created more efficiently. I also find that bumping up > checkpoint_segments to a high number speeds things up considerably. also, turning off fsync during the load helps a lot. Be sure to turn it back on when you're done of course.
On Fri, Oct 03, 2003 at 01:06:26PM -0600, scott.marlowe wrote: > also, turning off fsync during the load helps a lot. Be sure to turn it > back on when you're done of course. I'm not sure I understand why this is so. If I turn fsync off, it means that I won't force the kernel to write WAL logs to disk, but they will have to be written eventually. If you have tons of RAM it may well be that the kernel will just keep dirty buffers in RAM, but if not there should not be any difference. Am I missing something? -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "Investigación es lo que hago cuando no sé lo que estoy haciendo" (Wernher von Braun)
>>>>> "sm" == scott marlowe <scott.marlowe> writes: sm> also, turning off fsync during the load helps a lot. Be sure to turn it sm> back on when you're done of course. Only if you don't have a battery-backed cache on your RAID. If you do, it won't likely make a big difference. For me it was about 2 seconds over a 4-hour restore.
On Fri, 3 Oct 2003, Alvaro Herrera wrote: > On Fri, Oct 03, 2003 at 01:06:26PM -0600, scott.marlowe wrote: > > > also, turning off fsync during the load helps a lot. Be sure to turn it > > back on when you're done of course. > > I'm not sure I understand why this is so. If I turn fsync off, it means > that I won't force the kernel to write WAL logs to disk, but they will > have to be written eventually. If you have tons of RAM it may well be > that the kernel will just keep dirty buffers in RAM, but if not there > should not be any difference. Am I missing something? Yes, you are. Basically, with fsync on, things have to happen in order. I.e. write to WAL what you're gonna do. WAIT for confirmation on write write the tuples out. wait for confirmation checkpoint the WAL. wait for confirmation Notice the wait for confirmation above. Without fsync, there's no wait, you just write it all out at once, and hope the machine / database doesn't ever crash in the middle of a transaction. Give it a try, turn off fsync, run pgbench -c 4 -t 1000, then turn it back on and see how much it slows down. Pull the plug while the transactions are running with fsync on, and your machine, assuming it has a meta-data journaling file system, will come right back, and postgresql will replay the WAL files and you'll have a nice consistent database. turn off fsync, initiate many transactions, pull the plug, and look at your corrupted database refuse to start on update. Note that if you're running on IDE drives, you already ARE probably running with fsync off if write caching is enabled, so you'll need to turn it off (hdparm -W0 /dev/hdx in linux) to ensure fsync actually works.
On Fri, 3 Oct 2003, Vivek Khera wrote: > >>>>> "sm" == scott marlowe <scott.marlowe> writes: > > sm> also, turning off fsync during the load helps a lot. Be sure to turn it > sm> back on when you're done of course. > > Only if you don't have a battery-backed cache on your RAID. If you > do, it won't likely make a big difference. For me it was about 2 > seconds over a 4-hour restore. True, very true. Have you done the "pull the plug" test on it to make sure it really works, by the way?
"scott.marlowe" <scott.marlowe@ihs.com> writes: > Yes, you are. Basically, with fsync on, things have to happen in order. > I.e. > write to WAL what you're gonna do. WAIT for confirmation on write > write the tuples out. wait for confirmation > checkpoint the WAL. wait for confirmation Not really. With fsync on, we *only* sync the WAL writes. Data writes can happen whenever, so long as we know the corresponding WAL writes went down first. We only wait for data writes to complete before considering that a checkpoint is complete --- which is something that is not in the main line of execution and doesn't block other activity. This is one good reason for keeping WAL on a separate drive from the data files --- you are then freeing the system to schedule data I/O as optimally as it can. > Note that if you're running on IDE drives, you already ARE probably > running with fsync off if write caching is enabled, so you'll need to turn > it off (hdparm -W0 /dev/hdx in linux) to ensure fsync actually works. It'd be interesting to think about whether a write-caching IDE drive could safely be used for data storage, if WAL is elsewhere. Right offhand I think the only problem is how to know when it's safe to consider a checkpoint complete. Maybe all that would be needed is a long enough time delay after issuing sync(2) in the checkpoint code. Do these drives guarantee "data will be written within 30 seconds" or something like that? Or can the delay be indefinite when load is heavy? regards, tom lane
On Fri, 3 Oct 2003, Tom Lane wrote: > "scott.marlowe" <scott.marlowe@ihs.com> writes: > > Yes, you are. Basically, with fsync on, things have to happen in order. > > I.e. > > write to WAL what you're gonna do. WAIT for confirmation on write > > write the tuples out. wait for confirmation > > checkpoint the WAL. wait for confirmation > > Not really. With fsync on, we *only* sync the WAL writes. Data writes > can happen whenever, so long as we know the corresponding WAL writes > went down first. We only wait for data writes to complete before > considering that a checkpoint is complete --- which is something that is > not in the main line of execution and doesn't block other activity. > > This is one good reason for keeping WAL on a separate drive from the > data files --- you are then freeing the system to schedule data I/O as > optimally as it can. Oh, Ok. That's why the WAL is such a choking point. > > Note that if you're running on IDE drives, you already ARE probably > > running with fsync off if write caching is enabled, so you'll need to turn > > it off (hdparm -W0 /dev/hdx in linux) to ensure fsync actually works. > > It'd be interesting to think about whether a write-caching IDE drive > could safely be used for data storage, if WAL is elsewhere. Well, I just so happen to have a machine with two drives in it. I'll get back to you on that. > Right offhand I think the only problem is how to know when it's safe > to consider a checkpoint complete. Maybe all that would be needed is > a long enough time delay after issuing sync(2) in the checkpoint code. > Do these drives guarantee "data will be written within 30 seconds" or > something like that? Or can the delay be indefinite when load is heavy? I don't know. My guess is that they probably don't wait too long, they just wait until they have enough to write to make it "worth their while" i.e. they don't write 8k at a time, they write 64k or 512k at a time, now that many have 8 Meg of cache, it would make sense to group writes to one area together. I'll test this out. Do you think "pgbench -c 50 -t 100000000" is enough thrash, or do I need more connections at once? The machine I'm on has 2 gig ram, so it can probably open several hundred connections, but the time to bring the database back up by replaying the WAL with hundreds of concurrent transactions is gonna be a bit.
On Fri, 3 Oct 2003, scott.marlowe wrote: > On Fri, 3 Oct 2003, Tom Lane wrote: > > > > It'd be interesting to think about whether a write-caching IDE drive > > could safely be used for data storage, if WAL is elsewhere. > > Well, I just so happen to have a machine with two drives in it. I'll get > back to you on that. Ok, I just tested it. I put pg_xlog and pg_clog on a drive that was set to write cache disabled, and left the data on a drive where caching was enabled. The tps on a pgbench -c 5 -t 500 on the single drive was 45 to 55. With the pg_[xc]log moved to another drive and all, I got up to 108 tps. About double performance, as you'd expect. I didn't test the data drive with write caching disabled, but my guess is it wouldn't be any slower since pgsql doesn't wait on the rest. I pulled the plug three times, and all three times the database came up in recovery mode and sucessfully recovered. I didn't bother testing to see if write caching would corrupt it as I'm pretty sure it would, it certainly did when everything was on one drive. Would you like to try some kind of wal patch out on it while I've got it for testing? I'd be glad to torture that poor little box some more if you're in the mood and the beta period is winding down. It's running 7.4 beta3, by the way.
scott.marlowe wrote: > On Fri, 3 Oct 2003, scott.marlowe wrote: > > > On Fri, 3 Oct 2003, Tom Lane wrote: > > > > > > It'd be interesting to think about whether a write-caching IDE drive > > > could safely be used for data storage, if WAL is elsewhere. > > > > Well, I just so happen to have a machine with two drives in it. I'll get > > back to you on that. > > Ok, I just tested it. I put pg_xlog and pg_clog on a drive that was set > to write cache disabled, and left the data on a drive where caching was > enabled. The tps on a pgbench -c 5 -t 500 on the single drive was 45 to > 55. With the pg_[xc]log moved to another drive and all, I got up to 108 > tps. About double performance, as you'd expect. I didn't test the data > drive with write caching disabled, but my guess is it wouldn't be any > slower since pgsql doesn't wait on the rest. > > I pulled the plug three times, and all three times the database came up in > recovery mode and sucessfully recovered. I didn't bother testing to see > if write caching would corrupt it as I'm pretty sure it would, it > certainly did when everything was on one drive. You would have had to pull the plug between the time the system did a checkpoint (and wrote to the write cache), and before it flushed the write cache to disk --- no idea how you would find that window, but my guess is that if you pulled the plug right after the checkpoint completed, the WAL recovery would fail. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On Sat, 4 Oct 2003, Bruce Momjian wrote: > scott.marlowe wrote: > > On Fri, 3 Oct 2003, scott.marlowe wrote: > > > > > On Fri, 3 Oct 2003, Tom Lane wrote: > > > > > > > > It'd be interesting to think about whether a write-caching IDE drive > > > > could safely be used for data storage, if WAL is elsewhere. > > > > > > Well, I just so happen to have a machine with two drives in it. I'll get > > > back to you on that. > > > > Ok, I just tested it. I put pg_xlog and pg_clog on a drive that was set > > to write cache disabled, and left the data on a drive where caching was > > enabled. The tps on a pgbench -c 5 -t 500 on the single drive was 45 to > > 55. With the pg_[xc]log moved to another drive and all, I got up to 108 > > tps. About double performance, as you'd expect. I didn't test the data > > drive with write caching disabled, but my guess is it wouldn't be any > > slower since pgsql doesn't wait on the rest. > > > > I pulled the plug three times, and all three times the database came up in > > recovery mode and sucessfully recovered. I didn't bother testing to see > > if write caching would corrupt it as I'm pretty sure it would, it > > certainly did when everything was on one drive. > > You would have had to pull the plug between the time the system did a > checkpoint (and wrote to the write cache), and before it flushed the > write cache to disk --- no idea how you would find that window, but my > guess is that if you pulled the plug right after the checkpoint > completed, the WAL recovery would fail. I'm not sure what you mean. Are you talking about the failure more with write cache enabled? That always failed when I tested it. I was testing it with 80 parallel transactions, by the way. I'll try it anyway just to be sure that it causes the problem I'm expecting it to (i.e. write cache enabled on pg_xlog causes database corruption under heavy parallel load when plug is pulled.)
scott.marlowe wrote: > > > Ok, I just tested it. I put pg_xlog and pg_clog on a drive that was set > > > to write cache disabled, and left the data on a drive where caching was > > > enabled. The tps on a pgbench -c 5 -t 500 on the single drive was 45 to > > > 55. With the pg_[xc]log moved to another drive and all, I got up to 108 > > > tps. About double performance, as you'd expect. I didn't test the data > > > drive with write caching disabled, but my guess is it wouldn't be any > > > slower since pgsql doesn't wait on the rest. > > > > > > I pulled the plug three times, and all three times the database came up in > > > recovery mode and sucessfully recovered. I didn't bother testing to see > > > if write caching would corrupt it as I'm pretty sure it would, it > > > certainly did when everything was on one drive. > > > > You would have had to pull the plug between the time the system did a > > checkpoint (and wrote to the write cache), and before it flushed the > > write cache to disk --- no idea how you would find that window, but my > > guess is that if you pulled the plug right after the checkpoint > > completed, the WAL recovery would fail. > > I'm not sure what you mean. Are you talking about the failure more with > write cache enabled? That always failed when I tested it. I was testing > it with 80 parallel transactions, by the way. I'll try it anyway just to > be sure that it causes the problem I'm expecting it to (i.e. write cache > enabled on pg_xlog causes database corruption under heavy parallel load > when plug is pulled.) The issue is that a force write to disk is required for the sync() that is run before the WAL files are recycled. You can get by with a write cache enabled on the data drive as long as the crash doesn't happen in the window between the sync (and WAL files removed) and the data actually making it to the platers. If it does, I don't think the system will recover, or if it does, it will not be consistent. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073