Thread: hundreds of millions row dBs

hundreds of millions row dBs

From

"Greer, Doug [NTK]"

Date:

04 January 2005, 01:48:52

Hello all,

I am interested in using Postgresql for a dB of hundreds of millions of rows in several tables. The COPY command seems to be way too slow. Is there any bulk import program similar to Oracle’s SQL loader for Postgresql?

Sincerely,

Doug Greer

Re: hundreds of millions row dBs

From

"Guy Rouillier"

Date:

04 January 2005, 04:58:48

Greer, Doug wrote:
> Hello all,
>         I am interested in using Postgresql for a dB of hundreds of
> millions of rows in several tables.  The COPY command seems to be way
> too slow.  Is there any bulk import program similar to Oracle's SQL
> loader for Postgresql? Sincerely,
> Doug Greer

We're getting about 64 million rows inserted in about 1.5 hrs into a
table with a multiple-column primary key - that's the only index.
That's seems pretty good to me - SQL Loader takes about 4 hrs to do the
same job.

--
Guy Rouillier

Re: hundreds of millions row dBs

From

Tom Lane

Date:

04 January 2005, 05:32:38

"Guy Rouillier" <guyr@masergy.com> writes:
> Greer, Doug wrote:
>> I am interested in using Postgresql for a dB of hundreds of
>> millions of rows in several tables.  The COPY command seems to be way
>> too slow.  Is there any bulk import program similar to Oracle's SQL
>> loader for Postgresql? Sincerely,

> We're getting about 64 million rows inserted in about 1.5 hrs into a
> table with a multiple-column primary key - that's the only index.
> That's seems pretty good to me - SQL Loader takes about 4 hrs to do the
> same job.

If you're talking about loading into an initially empty database, it's
worth a try to load into bare tables and then create indexes and add
foreign key constraints.  Index build and FK checking are both
significantly faster as "bulk" operations than "incremental".  Don't
forget to pump up sort_mem as much as you can stand in the backend doing
such chores, too.

I have heard of people who would actually drop and recreate indexes
and/or FKs when adding a lot of data to an existing table.

            regards, tom lane

Re: hundreds of millions row dBs

From

Wes

Date:

04 January 2005, 16:59:35

> We're getting about 64 million rows inserted in about 1.5 hrs into a
> table with a multiple-column primary key - that's the only index.
> That's seems pretty good to me - SQL Loader takes about 4 hrs to do the
> same job.

As I recall, the last time we rebuilt our database, it took about 3 hours to
import 265 million rows of data.  It then took another 16 hours to rebuild
all the indexes.  I think the entire pg_dumpall/reload process took about 21
hours +/-.  I wonder what it will be like with 1.5 billion rows...

Wes

Re: hundreds of millions row dBs

From

"Dann Corbit"

Date:

04 January 2005, 20:27:16


-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Wes
Sent: Tuesday, January 04, 2005 8:59 AM
To: Guy Rouillier; pgsql-general@postgresql.org; Greer, Doug [NTK]
Subject: Re: [GENERAL] hundreds of millions row dBs

> We're getting about 64 million rows inserted in about 1.5 hrs into a
> table with a multiple-column primary key - that's the only index.
> That's seems pretty good to me - SQL Loader takes about 4 hrs to do
the
> same job.

As I recall, the last time we rebuilt our database, it took about 3
hours to
import 265 million rows of data.
>>
24537 rows per second.
<<

It then took another 16 hours to rebuild
all the indexes.  I think the entire pg_dumpall/reload process took
about 21
hours +/-.  I wonder what it will be like with 1.5 billion rows...
>>
Load will probably scale linearly, so I think you could just multiply by
5.66 go get 17 hours to load.

Building indexes is likely to be at least n*log(n) and maybe even n^2.
For sure, it would take a whole weekend.

Here is an instance where a really big ram disk might be handy.
You could create a database on a big ram disk and load it, then build
the indexes.
Then shut down the database and move it to hard disk.
It might save a few days of effort if you have billions of rows to load.
<<

Re: hundreds of millions row dBs

From

Tom Lane

Date:

04 January 2005, 21:20:32

Wes <wespvp@syntegra.com> writes:
> As I recall, the last time we rebuilt our database, it took about 3 hours to
> import 265 million rows of data.  It then took another 16 hours to rebuild
> all the indexes.

Out of curiosity, what value of sort_mem were you using?

(In PG 8.0, the sort memory setting used by CREATE INDEX will be
maintenance_work_mem not work_mem, which should help in getting larger
values to be used.  But in existing releases you usually need to think
about a manual tweak.)

            regards, tom lane

Re: hundreds of millions row dBs

From

Tom Lane

Date:

04 January 2005, 21:37:00

"Dann Corbit" <DCorbit@connx.com> writes:
> Here is an instance where a really big ram disk might be handy.
> You could create a database on a big ram disk and load it, then build
> the indexes.
> Then shut down the database and move it to hard disk.

Actually, if you have a RAM disk, just change the $PGDATA/base/nnn/pgsql_tmp
subdirectory into a symlink to some temp directory on the RAM disk.
Should get you pretty much all the win with no need to move stuff around
afterwards.

You have to be sure the RAM disk is bigger than your biggest index though.

            regards, tom lane

Re: hundreds of millions row dBs

From

Pierre-Frédéric Caillaud

Date:

05 January 2005, 00:21:30

    To speed up load :
    - make less checkpoints (tweak checkpoint interval and other parameters
in config)
    - disable fsync (not sure if it really helps)
    - have source data, database tables, and log on three physically
different disks
    - have the temporary on a different disk too, or in ramdisk
    - gunzip while restoring to read less data from the disk



> "Dann Corbit" <DCorbit@connx.com> writes:
>> Here is an instance where a really big ram disk might be handy.
>> You could create a database on a big ram disk and load it, then build
>> the indexes.
>> Then shut down the database and move it to hard disk.
>
> Actually, if you have a RAM disk, just change the
> $PGDATA/base/nnn/pgsql_tmp
> subdirectory into a symlink to some temp directory on the RAM disk.
> Should get you pretty much all the win with no need to move stuff around
> afterwards.
>
> You have to be sure the RAM disk is bigger than your biggest index
> though.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>

Re: hundreds of millions row dBs

From

Wes

Date:

05 January 2005, 17:30:18

> Out of curiosity, what value of sort_mem were you using?
>
> (In PG 8.0, the sort memory setting used by CREATE INDEX will be
> maintenance_work_mem not work_mem, which should help in getting larger
> values to be used.  But in existing releases you usually need to think
> about a manual tweak.)

Normally it is set to 65535.  However, during the load I bump it up to
655350.  The system has 2GB ECC memory.

> Here is an instance where a really big ram disk might be handy.
> You could create a database on a big ram disk and load it, then build
> the indexes.

I'm afraid we don't have quite that much RAM...  With just under 400 million
rows right now, it is 74 GB.  That will probably grow to around 300 GB or so
before it stabilizes.

> Actually, if you have a RAM disk, just change the $PGDATA/base/nnn/pgsql_tmp
> subdirectory into a symlink to some temp directory on the RAM disk.
> Should get you pretty much all the win with no need to move stuff around
> afterwards.
>
> You have to be sure the RAM disk is bigger than your biggest index though.

Hmm.  That's a thought.  I expect our largest index will still be bigger
than available RAM though.  How can I check index sizes?

We already have pg_xlog on a dedicated mirrored disk.  Would it help
significantly to give pgsql_tmp its own mirrored disk?  PGDATA is on an 8
disk hardware RAID 5.

Wes