Thread: tuning Postgres for large data import (using Copy from)
Hello, I'd like to tune Postgres for large data import (using Copy from). here are a few steps already done: 1) use 3 different disks for: -1: source data -2: index tablespaces -3: data tablespaces 2) define all foreign keys as initially deferred 3) tune some parameters: max_connections =20 shared_buffers =30000 work_mem = 8192 maintenance_work_mem = 32768 checkpoint_segments = 12 (I also modified the kernel accordingly) 4) runs VACUUM regulary The server runs RedHat and has 1GB RAM In the production (which may run on a better server), I plan to: - import a few millions rows per day, - keep up to ca 100 millions rows in the db - delete older data I've seen a few posting on hash/btree indexes, which say that hash index do not work very well in Postgres; currently, I only use btree indexes. Could I gain performances whole using hash indexes as well ? How does Postgres handle concurrent copy from on: same table / different tables ? I'd be glad on any further suggestion on how to further increase my performances. Marc -- +++ Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS +++ GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail
"Marc Mamin" <m.mamin@gmx.net> writes: > 1) use 3 different disks for: > -1: source data > -2: index tablespaces > -3: data tablespaces It's probably much more important to know where you put the WAL. regards, tom lane
Marc Mamin wrote: > Hello, > I'm not an expert, but I'll give some suggestions. > > I'd like to tune Postgres for large data import (using Copy from). > I believe that COPY FROM <file> is supposed to be faster than COPY FROM STDIN, but <file> must be available to the backend process. If you can do it, you should think about it, as it eliminates the communication between the client and the backend. > > here are a few steps already done: > > > > 1) use 3 different disks for: > > -1: source data > -2: index tablespaces > -3: data tablespaces > Make sure pg_xlog is on it's own filesystem. It contains the write-ahead-log, and putting it by itself keeps the number of seeks down. If you are constrained, I think pg_xlog is more important than moving the index tablespaces. > > 2) define all foreign keys as initially deferred > > > 3) tune some parameters: > > > > max_connections =20 > shared_buffers =30000 > work_mem = 8192 > maintenance_work_mem = 32768 > checkpoint_segments = 12 > > (I also modified the kernel accordingly) > Don't forget to increase your free space map if you are going to be doing deletes frequently. > > > > 4) runs VACUUM regulary > > > The server runs RedHat and has 1GB RAM > > In the production (which may run on a better server), I plan to: > > - import a few millions rows per day, > - keep up to ca 100 millions rows in the db > - delete older data > > > > > I've seen a few posting on hash/btree indexes, which say that hash index do > not work very well in Postgres; > currently, I only use btree indexes. Could I gain performances whole using > hash indexes as well ? > I doubt it. > How does Postgres handle concurrent copy from on: same table / different > tables ? > I think it is better with different tables. If using the same table, and there are indexes, it has to grab a lock for updating the index, which causes contention between 2 processes writing to the same table. > > I'd be glad on any further suggestion on how to further increase my > performances. > Since you are deleting data often, and copying often, I might recommend using a partition scheme with a view to bind everything together. That way you can just drop the old table rather than doing a delete. I don't know how this would affect foreign key references. But basically you can create a new table, and do a copy without having any indexes, then build the indexes, analyze, update the view. And when deleting you can update the view, and drop the old table. Something like this: CREATE TABLE table_2005_05_11 AS (blah); COPY FROM ... ; CREATE INDEX blah ON table_2005_05_11(blah); CREATE OR REPLACE VIEW table AS SELECT * FROM table_2005_05_10 UNION ALL SELECT * FROM table_2005_05_11; VACUUM ANALYZE table_2005_05_11; ... John =:-> > > > > Marc > > > >
Attachment
Marc, > 1) use 3 different disks for: > > -1: source data > -2: index tablespaces > -3: data tablespaces Others have already told you about the importance of relocating WAL. If you are going to be building indexes on the imported data, you might find it beneficial to relocate pgsql_tmp for the database in question as well. Also, I generally find it more beneficial to seperate the few largest tables to their own disk resources than to put all tables on one resource and all disks on another. For example, for TPCH-like tests, I do array0: OS and pgsql_tmp array1: LINEITEM array2: LINEITEM Indexes array3: all other tables and indexes array4: pg_xlog array5: source data This allows me to load a 100G (actually 270G) TPCH-like database in < 2 hours, not counting index-building. > 2) define all foreign keys as initially deferred It would be better to drop them before import and recreate them afterwards. Same for indexes unless those indexes are over 2G in size. > max_connections =20 > shared_buffers =30000 > work_mem = 8192 Not high enough, unless you have very little RAM. On an 8G machine I'm using 256MB. You might want to use 64MB or 128MB. > maintenance_work_mem = 32768 REALLY not high enough. You're going to need to build big indexes and possibly vacuum large tables. I use the maximum of 1.98GB. Use up to 1/3 of your RAM for this. > checkpoint_segments = 12 Also way too low. Put pg_xlog on its own disk, give in 128 to 512 segments (up to 8G). > The server runs RedHat and has 1GB RAM Make sure you're running a 2.6.10+ kernel. Make sure ext3 is set noatime, data=writeback. Buy more RAM. Etc. > How does Postgres handle concurrent copy from on: same table / different > tables ? Same table is useless; the imports will effectively serialize (unless you use pseudo-partitioning). You can parallel load on multiple tables up to the lower of your number of disk channels or number of processors. -- Josh Berkus Aglio Database Solutions San Francisco