Thread: hundreds of millions row dBs
Hello all,
I am interested in using Postgresql for a dB of hundreds of millions of rows in several tables. The COPY command seems to be way too slow. Is there any bulk import program similar to Oracle’s SQL loader for Postgresql?
Doug Greer
Greer, Doug wrote: > Hello all, > I am interested in using Postgresql for a dB of hundreds of > millions of rows in several tables. The COPY command seems to be way > too slow. Is there any bulk import program similar to Oracle's SQL > loader for Postgresql? Sincerely, > Doug Greer We're getting about 64 million rows inserted in about 1.5 hrs into a table with a multiple-column primary key - that's the only index. That's seems pretty good to me - SQL Loader takes about 4 hrs to do the same job. -- Guy Rouillier
"Guy Rouillier" <guyr@masergy.com> writes: > Greer, Doug wrote: >> I am interested in using Postgresql for a dB of hundreds of >> millions of rows in several tables. The COPY command seems to be way >> too slow. Is there any bulk import program similar to Oracle's SQL >> loader for Postgresql? Sincerely, > We're getting about 64 million rows inserted in about 1.5 hrs into a > table with a multiple-column primary key - that's the only index. > That's seems pretty good to me - SQL Loader takes about 4 hrs to do the > same job. If you're talking about loading into an initially empty database, it's worth a try to load into bare tables and then create indexes and add foreign key constraints. Index build and FK checking are both significantly faster as "bulk" operations than "incremental". Don't forget to pump up sort_mem as much as you can stand in the backend doing such chores, too. I have heard of people who would actually drop and recreate indexes and/or FKs when adding a lot of data to an existing table. regards, tom lane
> We're getting about 64 million rows inserted in about 1.5 hrs into a > table with a multiple-column primary key - that's the only index. > That's seems pretty good to me - SQL Loader takes about 4 hrs to do the > same job. As I recall, the last time we rebuilt our database, it took about 3 hours to import 265 million rows of data. It then took another 16 hours to rebuild all the indexes. I think the entire pg_dumpall/reload process took about 21 hours +/-. I wonder what it will be like with 1.5 billion rows... Wes
-----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Wes Sent: Tuesday, January 04, 2005 8:59 AM To: Guy Rouillier; pgsql-general@postgresql.org; Greer, Doug [NTK] Subject: Re: [GENERAL] hundreds of millions row dBs > We're getting about 64 million rows inserted in about 1.5 hrs into a > table with a multiple-column primary key - that's the only index. > That's seems pretty good to me - SQL Loader takes about 4 hrs to do the > same job. As I recall, the last time we rebuilt our database, it took about 3 hours to import 265 million rows of data. >> 24537 rows per second. << It then took another 16 hours to rebuild all the indexes. I think the entire pg_dumpall/reload process took about 21 hours +/-. I wonder what it will be like with 1.5 billion rows... >> Load will probably scale linearly, so I think you could just multiply by 5.66 go get 17 hours to load. Building indexes is likely to be at least n*log(n) and maybe even n^2. For sure, it would take a whole weekend. Here is an instance where a really big ram disk might be handy. You could create a database on a big ram disk and load it, then build the indexes. Then shut down the database and move it to hard disk. It might save a few days of effort if you have billions of rows to load. <<
Wes <wespvp@syntegra.com> writes: > As I recall, the last time we rebuilt our database, it took about 3 hours to > import 265 million rows of data. It then took another 16 hours to rebuild > all the indexes. Out of curiosity, what value of sort_mem were you using? (In PG 8.0, the sort memory setting used by CREATE INDEX will be maintenance_work_mem not work_mem, which should help in getting larger values to be used. But in existing releases you usually need to think about a manual tweak.) regards, tom lane
"Dann Corbit" <DCorbit@connx.com> writes: > Here is an instance where a really big ram disk might be handy. > You could create a database on a big ram disk and load it, then build > the indexes. > Then shut down the database and move it to hard disk. Actually, if you have a RAM disk, just change the $PGDATA/base/nnn/pgsql_tmp subdirectory into a symlink to some temp directory on the RAM disk. Should get you pretty much all the win with no need to move stuff around afterwards. You have to be sure the RAM disk is bigger than your biggest index though. regards, tom lane
To speed up load : - make less checkpoints (tweak checkpoint interval and other parameters in config) - disable fsync (not sure if it really helps) - have source data, database tables, and log on three physically different disks - have the temporary on a different disk too, or in ramdisk - gunzip while restoring to read less data from the disk > "Dann Corbit" <DCorbit@connx.com> writes: >> Here is an instance where a really big ram disk might be handy. >> You could create a database on a big ram disk and load it, then build >> the indexes. >> Then shut down the database and move it to hard disk. > > Actually, if you have a RAM disk, just change the > $PGDATA/base/nnn/pgsql_tmp > subdirectory into a symlink to some temp directory on the RAM disk. > Should get you pretty much all the win with no need to move stuff around > afterwards. > > You have to be sure the RAM disk is bigger than your biggest index > though. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
> Out of curiosity, what value of sort_mem were you using? > > (In PG 8.0, the sort memory setting used by CREATE INDEX will be > maintenance_work_mem not work_mem, which should help in getting larger > values to be used. But in existing releases you usually need to think > about a manual tweak.) Normally it is set to 65535. However, during the load I bump it up to 655350. The system has 2GB ECC memory. > Here is an instance where a really big ram disk might be handy. > You could create a database on a big ram disk and load it, then build > the indexes. I'm afraid we don't have quite that much RAM... With just under 400 million rows right now, it is 74 GB. That will probably grow to around 300 GB or so before it stabilizes. > Actually, if you have a RAM disk, just change the $PGDATA/base/nnn/pgsql_tmp > subdirectory into a symlink to some temp directory on the RAM disk. > Should get you pretty much all the win with no need to move stuff around > afterwards. > > You have to be sure the RAM disk is bigger than your biggest index though. Hmm. That's a thought. I expect our largest index will still be bigger than available RAM though. How can I check index sizes? We already have pg_xlog on a dedicated mirrored disk. Would it help significantly to give pgsql_tmp its own mirrored disk? PGDATA is on an 8 disk hardware RAID 5. Wes