Thread: tuning Postgres for large data import (using Copy from)

tuning Postgres for large data import (using Copy from)

From
"Marc Mamin"
Date:
Hello,


I'd like to tune Postgres for large data import (using Copy from).


here are a few steps already done:



1) use 3 different disks for:

    -1: source data
    -2: index tablespaces
    -3: data tablespaces


2) define all foreign keys as initially deferred


3) tune some parameters:



    max_connections =20
    shared_buffers =30000
    work_mem = 8192
    maintenance_work_mem = 32768
    checkpoint_segments = 12

    (I also modified the kernel accordingly)




4) runs VACUUM regulary


The server runs RedHat and has 1GB RAM

In the production (which may run on a better server), I plan to:

- import a few millions rows per day,
- keep up to ca 100 millions rows in the db
- delete older data




I've seen a few posting on hash/btree indexes, which say that hash index do
not work very well in Postgres;
currently, I only use btree indexes. Could I gain performances whole using
hash indexes as well ?

How does Postgres handle concurrent copy from on: same table / different
tables ?


I'd be glad on any further suggestion on how to further increase my
performances.




Marc




--
+++ Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS +++
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail

Re: tuning Postgres for large data import (using Copy from)

From
Tom Lane
Date:
"Marc Mamin" <m.mamin@gmx.net> writes:
> 1) use 3 different disks for:

>     -1: source data
>     -2: index tablespaces
>     -3: data tablespaces

It's probably much more important to know where you put the WAL.

            regards, tom lane

Re: tuning Postgres for large data import (using Copy from)

From
John A Meinel
Date:
Marc Mamin wrote:
> Hello,
>

I'm not an expert, but I'll give some suggestions.

>
> I'd like to tune Postgres for large data import (using Copy from).
>

I believe that COPY FROM <file> is supposed to be faster than COPY FROM
STDIN, but <file> must be available to the backend process. If you can
do it, you should think about it, as it eliminates the communication
between the client and the backend.

>
> here are a few steps already done:
>
>
>
> 1) use 3 different disks for:
>
>     -1: source data
>     -2: index tablespaces
>     -3: data tablespaces
>

Make sure pg_xlog is on it's own filesystem. It contains the
write-ahead-log, and putting it by itself keeps the number of seeks
down. If you are constrained, I think pg_xlog is more important than
moving the index tablespaces.

>
> 2) define all foreign keys as initially deferred
>
>
> 3) tune some parameters:
>
>
>
>     max_connections =20
>     shared_buffers =30000
>     work_mem = 8192
>     maintenance_work_mem = 32768
>     checkpoint_segments = 12
>
>     (I also modified the kernel accordingly)
>

Don't forget to increase your free space map if you are going to be
doing deletes frequently.

>
>
>
> 4) runs VACUUM regulary
>
>
> The server runs RedHat and has 1GB RAM
>
> In the production (which may run on a better server), I plan to:
>
> - import a few millions rows per day,
> - keep up to ca 100 millions rows in the db
> - delete older data
>
>
>
>
> I've seen a few posting on hash/btree indexes, which say that hash index do
> not work very well in Postgres;
> currently, I only use btree indexes. Could I gain performances whole using
> hash indexes as well ?
>
I doubt it.

> How does Postgres handle concurrent copy from on: same table / different
> tables ?
>

I think it is better with different tables. If using the same table, and
there are indexes, it has to grab a lock for updating the index, which
causes contention between 2 processes writing to the same table.

>
> I'd be glad on any further suggestion on how to further increase my
> performances.
>

Since you are deleting data often, and copying often, I might recommend
using a partition scheme with a view to bind everything together. That
way you can just drop the old table rather than doing a delete. I don't
know how this would affect foreign key references.

But basically you can create a new table, and do a copy without having
any indexes, then build the indexes, analyze, update the view.

And when deleting you can update the view, and drop the old table.

Something like this:

CREATE TABLE table_2005_05_11 AS (blah);
COPY FROM ... ;
CREATE INDEX blah ON table_2005_05_11(blah);
CREATE OR REPLACE VIEW table AS
    SELECT * FROM table_2005_05_10
    UNION ALL SELECT * FROM table_2005_05_11;
VACUUM ANALYZE table_2005_05_11;
...

John
=:->

>
>
>
> Marc
>
>
>
>


Attachment

Re: tuning Postgres for large data import (using Copy from)

From
Josh Berkus
Date:
Marc,

> 1) use 3 different disks for:
>
>     -1: source data
>     -2: index tablespaces
>     -3: data tablespaces

Others have already told you about the importance of relocating WAL.  If you
are going to be building indexes on the imported data, you might find it
beneficial to relocate pgsql_tmp for the database in question as well.
Also, I generally find it more beneficial to seperate the few largest tables
to their own disk resources than to put all tables on one resource and all
disks on another.  For example, for TPCH-like tests, I do
array0: OS and pgsql_tmp
array1: LINEITEM
array2: LINEITEM Indexes
array3: all other tables and indexes
array4: pg_xlog
array5: source data

This allows me to load a 100G (actually 270G) TPCH-like database in < 2 hours,
not counting index-building.

> 2) define all foreign keys as initially deferred

It would be better to drop them before import and recreate them afterwards.
Same for indexes unless those indexes are over 2G in size.

>     max_connections =20
>     shared_buffers =30000
>     work_mem = 8192

Not high enough, unless you have very little RAM.  On an 8G machine I'm using
256MB.  You might want to use 64MB or 128MB.

>     maintenance_work_mem = 32768

REALLY not high enough.  You're going to need to build big indexes and
possibly vacuum large tables.  I use the maximum of 1.98GB.  Use up to 1/3 of
your RAM for this.

>     checkpoint_segments = 12

Also way too low.  Put pg_xlog on its own disk, give in 128 to 512 segments
(up to 8G).

> The server runs RedHat and has 1GB RAM

Make sure you're running a 2.6.10+ kernel.  Make sure ext3 is set noatime,
data=writeback.  Buy more RAM.  Etc.

> How does Postgres handle concurrent copy from on: same table / different
> tables ?

Same table is useless; the imports will effectively serialize  (unless you use
pseudo-partitioning).  You can parallel load on multiple tables up to the
lower of your number of disk channels or number of processors.

--
Josh Berkus
Aglio Database Solutions
San Francisco