Re: Large Database Design Help - Mailing list pgsql-performance
From | PFC |
---|---|
Subject | Re: Large Database Design Help |
Date | |
Msg-id | op.s4rvzrricigqcu@apollo13 Whole thread Raw |
In response to | Large Database Design Help (Orion Henry <lab@orangekids.org>) |
List | pgsql-performance |
> was origionally designed for Postgres 7.0 on a PIII 500Mhz and some Argh. > 1) The database is very large, the largest table has 40 million tuples. Is this simple types (like a few ints, text...) ? How much space does it use on disk ? can it fit in RAM ? > 2) The database needs to import 10's of thousands of tuples each night > quickly. The current method is VERY slow. You bet, COMMIT'ing after each insert or update is about the worst that can be done. It works fine on MySQL/MyISAM (which doesn't know about commit...) so I'd guess the system designer had a previous experience with MySQL. My advice woule be : - get a decent machine with some RAM (I guess you already knew this)... Now, the update. I would tend to do this : - Generate a text file with your update data, using whatever tool you like best (perl, php, python, java...) - CREATE TEMPORARY TABLE blah ... - COPY blah FROM your update file. COPY is super fast. I think temporary tables don't write to the xlog, so they are also very fast. This should not take more than a few seconds for a few 10 K's of simple rows on modern hardware. It actually takes a fraction of a second on my PC for about 9K rows with 5 INTEGERs on them. You can also add constraints on your temporary table, to sanitize your data, in order to be reasonably sure that the following updates will work. The data you feed to copy should be correct, or it will rollback. This is your script's job to escape everything. Now you got your data in the database. You have several options : - You are confident that the UPDATE will work without being rolled back by some constraint violation. Therefore, you issue a big joined UPDATE to update all the rows in your main table which are also in your temp table. Then you issue an INSERT INTO ... SELECT ... to insert the ones which were not already in the big table. Joined updates can be slow if your RAM is too small and it has to thrash the disk looking for every tuple around. You can cheat and CLUSTER your main table (say, once a week), so it is all in index order. Then you arrange your update data so it is in the same order (for instance, you SELECT INTO another temp table, with an ORDER BY corresponding to the CLUSTER on the main table). Having both in the same order will help reducing random disk accesses. - If you don't like this method, then you might want to use the same strategy as before (ie. a zillion queries), but write it in PSQL instead. PSQL is a lot faster, because everything is already parsed and planned beforehand. So you could do the following : - for each row in the temporary update table : - UPDATE the corresponding row in the main table - IF FOUND, then cool, it was updated, nothing more to do. You don't need to SELECT in order to know if the row is there. UPDATE does it for you, without the race condition. - IF NOT FOUND, then insert. This has a race condition. You know your application, so you'll know if it matters or not. What do you think ? > 3) I can't import new records with a COPY or drop my indexes b/c some of > them are new records (INSERTS) and some are altered records (UPDATES) > and the only way I can think of to identify these records is to perform > a select for each record. Yes and no ; if you must do this, then use PSQL, it's a lot faster. And skip the SELECT. Also, use the latest version. It really rocks. Like many said on the list, put pg_xlog on its own physical disk, with ext2fs. > 3) Wrap each load into a transaction ( tens of thousands of records per > load ) That's the idea. The first strategy here (big update) uses one transaction anyway. For the other one, your choice. You can either do it all in 1 transaction, or in bunches of 1000 rows... but 1 row at a time would be horrendously slow. Regards, P.F.C
pgsql-performance by date: