Thread: Large Database Design Help
Hello All, I've inherited a postgresql database that I would like to refactor. It was origionally designed for Postgres 7.0 on a PIII 500Mhz and some design decisions were made that don't make sense any more. Here's the problem: 1) The database is very large, the largest table has 40 million tuples. 2) The database needs to import 10's of thousands of tuples each night quickly. The current method is VERY slow. 3) I can't import new records with a COPY or drop my indexes b/c some of them are new records (INSERTS) and some are altered records (UPDATES) and the only way I can think of to identify these records is to perform a select for each record. Here is how the database is currently laid out and you'll see why I have a problem with it 1) The data is easily partitionable by client ID. In an attempt to keep the indexes small and the inserts fast one table was made per client ID. Thus the primary table in the database (the one with 40 million tuples) is really 133 tables each ending with a three digit suffix. The largest of these client tables has 8 million of the 40 million tuples. The system started with around a half dozen clients and is now a huge pain to manage with so many tables. I was hoping new hardware and new postgres features would allow for this data to be merged safely into a single table. 2) The imports are not done inside of transactions. I'm assuming the system designers excluded this for a reason. Will I run into problems performing tens of thousands of inserts and updates inside a single transaction? 3) The current code that bulk loads data into the database is a loop that looks like this: $result = exe("INSERT INTO $table ($name_str) SELECT $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)"); if ($result == 0) { $result = exe("UPDATE $table SET $non_keys WHERE $keys"); } Is there a faster way to bulk load data when it's not known ahead of time if it's a new record or an updated record? What I would LIKE to do but am afraid I will hit a serious performance wall (or am missing an obvious / better way to do it) 1) Merge all 133 client tables into a single new table, add a client_id column, do the data partitioning on the indexes not the tables as seen here: CREATE INDEX actioninfo_order_number_XXX_idx ON actioninfo ( order_number ) WHERE client_id = XXX; CREATE INDEX actioninfo_trans_date_XXX_idx ON actioninfo ( transaction_date ) WHERE client_id = XXX; (Aside question: if I were to find a way to use COPY and I were loading data on a single client_id, would dropping just the indexes for that client_id accelerate the load?) 2) Find some way to make the bulk loads faster or more efficent (help!) 3) Wrap each load into a transaction ( tens of thousands of records per load ) Is this a good plan? Is there a better way? Am I walking into a trap? Should I leave well enough alone and not try and fix something that's not broken? FWIW here's the hardware and the profile of the current uber table: Column | Type | Modifiers -------------------+---------+----------- order_number | integer | not null order_line_number | integer | not null action_number | integer | not null transaction_date | date | code | text | trans_group_code | text | quantity | integer | extension | money | sales_tax | money | shipping | money | discount | money | Dual Opteron 246, 4 disk SCSI RAID5, 4GB of RAM # du -sh /var/lib/postgres/data/ 16G /var/lib/postgres/data/ ( the current database is PG 7.4 - I intend to upgrade it to 8.1 if and when I do this refactoring ) ( the current OS is Debian Unstable but I intend to be running RHEL 4.0 if and when I do this refactoring )
On Thu, Feb 09, 2006 at 11:07:06AM -0800, Orion Henry wrote: > > Hello All, > > I've inherited a postgresql database that I would like to refactor. It > was origionally designed for Postgres 7.0 on a PIII 500Mhz and some > design decisions were made that don't make sense any more. Here's the > problem: > > 1) The database is very large, the largest table has 40 million tuples. > > 2) The database needs to import 10's of thousands of tuples each night > quickly. The current method is VERY slow. > > 3) I can't import new records with a COPY or drop my indexes b/c some of > them are new records (INSERTS) and some are altered records (UPDATES) > and the only way I can think of to identify these records is to perform > a select for each record. > > Here is how the database is currently laid out and you'll see why I have > a problem with it > > 1) The data is easily partitionable by client ID. In an attempt to keep > the indexes small and the inserts fast one table was made per client > ID. Thus the primary table in the database (the one with 40 million > tuples) is really 133 tables each ending with a three digit suffix. > The largest of these client tables has 8 million of the 40 million > tuples. The system started with around a half dozen clients and is now > a huge pain to manage with so many tables. I was hoping new hardware > and new postgres features would allow for this data to be merged safely > into a single table. If all the clients are equally active then partitioning by client probably makes little sense. If some clients are much more active than others then leaving this as-is could be a pretty big win. If the partitioning is done with either a view and rules or inherited tables and rules it shouldn't be too hard to manage. > 2) The imports are not done inside of transactions. I'm assuming the > system designers excluded this for a reason. Will I run into problems > performing tens of thousands of inserts and updates inside a single > transaction? Never attribute to thoughtful design that which can be fully explained by ignorance. :) I'd bet they just didn't know any better. > 3) The current code that bulk loads data into the database is a loop > that looks like this: > > $result = exe("INSERT INTO $table ($name_str) SELECT > $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)"); > if ($result == 0) > { > $result = exe("UPDATE $table SET $non_keys WHERE > $keys"); > } > > Is there a faster way to bulk load data when it's not known ahead of > time if it's a new record or an updated record? Uuuugly. :) Instead, load everything into a temp table using COPY and then UPDATE real_table ... FROM temp_table t WHERE real_table.key = t.key and INSERT SELECT ... WHERE NOT EXISTS. But take note that this is a race condition so you can only do it if you know nothing else will be inserting into the real table at the same time. You might want to look at the stats-proc code at http://cvs.distributed.net; it does exactly this type of thing. > What I would LIKE to do but am afraid I will hit a serious performance > wall (or am missing an obvious / better way to do it) > > 1) Merge all 133 client tables into a single new table, add a client_id > column, do the data partitioning on the indexes not the tables as seen here: > > CREATE INDEX actioninfo_order_number_XXX_idx ON actioninfo ( > order_number ) WHERE client_id = XXX; > CREATE INDEX actioninfo_trans_date_XXX_idx ON actioninfo ( > transaction_date ) WHERE client_id = XXX; > > (Aside question: if I were to find a way to use COPY and I were > loading data on a single client_id, would dropping just the indexes for > that client_id accelerate the load?) Hrm, I believe it would... > 2) Find some way to make the bulk loads faster or more efficent (help!) Don't do things row-by-row. If you can't ensure that there will be only one process inserting to eliminate the race condition I mentioned above then reply back and I'll point you at code that should still be much faster than what you're doing now. > 3) Wrap each load into a transaction ( tens of thousands of records per > load ) Getting rid of row-by-row will be your biggest win. If you do have to do row-by-row, at least wrap it in a transaction. As long as the transaction doesn't take *too* long it won't be an issue. > Is this a good plan? Is there a better way? Am I walking into a trap? > Should I leave well enough alone and not try and fix something that's > not broken? > > FWIW here's the hardware and the profile of the current uber table: > > Column | Type | Modifiers > -------------------+---------+----------- > order_number | integer | not null > order_line_number | integer | not null > action_number | integer | not null > transaction_date | date | > code | text | > trans_group_code | text | > quantity | integer | > extension | money | > sales_tax | money | > shipping | money | > discount | money | > > Dual Opteron 246, 4 disk SCSI RAID5, 4GB of RAM Remember that the write performance of raid5 normally stinks. > # du -sh /var/lib/postgres/data/ > 16G /var/lib/postgres/data/ > > ( the current database is PG 7.4 - I intend to upgrade it to 8.1 if and > when I do this refactoring ) Going to 8.1 would help in a large number of ways even if you don't refactor. The stats-proc code I mentioned runs 2x faster under 8.1 than it does under 7.4. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Orion Henry <lab@orangekids.org> writes: > What I would LIKE to do but am afraid I will hit a serious performance wall > (or am missing an obvious / better way to do it) > > 1) Merge all 133 client tables into a single new table, add a client_id column, > do the data partitioning on the indexes not the tables as seen here: > > CREATE INDEX actioninfo_order_number_XXX_idx ON actioninfo ( order_number ) > WHERE client_id = XXX; > CREATE INDEX actioninfo_trans_date_XXX_idx ON actioninfo ( transaction_date ) > WHERE client_id = XXX; The advantages to the partitioned scheme are a) you can drop a client quickly in a single operation b) the indexes are only half as wide since they don't include client_id and c) you can do a sequential scan of an entire client without using the index at all. Unless any of these are overwhelming I would say to go ahead and merge them. If you frequently scan all the records of a single client or frequently drop entire clients then the current scheme may be helpful. > (Aside question: if I were to find a way to use COPY and I were loading > data on a single client_id, would dropping just the indexes for that client_id > accelerate the load?) Dropping indexes would accelerate the load but unless you're loading a large number of records relative the current size I'm not sure it would be a win since you would then have to rebuild the index for the entire segment. > 2) Find some way to make the bulk loads faster or more efficent (help!) If your existing data isn't changing while you're doing the load (and if it is then your existing load process has a race condition btw) then you could do it in a couple big queries: COPY ${table}_new FROM '...'; CREATE TABLE ${table}_exists as SELECT * FROM ${table}_new WHERE EXISTS (select 1 from $table where ${table}_new.key = $table.key); CREATE TABLE ${table}_insert as SELECT * FROM ${table}_new WHERE NOT EXISTS (select 1 from $table where ${table}_new.key= $table.key); UPDATE $table set ... FROM ${table}_exists WHERE ${table}_exists.key = ${table}.key INSERT INTO $table (select * from ${table}_insert) actually you could skip the whole ${table_insert} step there and just do the insert I guess. There are also other approaches you could use like adding a new column to ${table}_new instead of creating new tables, etc. > 3) Wrap each load into a transaction ( tens of thousands of records per load ) Yes, Postgres is faster if you do more operations in a single transaction. Every COMMIT means waiting for an fsync. The only disadvantage to batching them into a large transaction is if it lasts a *long* time then it could create problems with your vacuum strategy. Any vacuum that runs while the transaction is still running won't be able to vacuum anything. You might consider running VACUUM FULL or CLUSTER on the table when you're done with the loading process. It will lock the table while it runs though. -- greg
On 2/9/06, Orion Henry <lab@orangekids.org> wrote: > > Hello All, > > I've inherited a postgresql database that I would like to refactor. It > was origionally designed for Postgres 7.0 on a PIII 500Mhz and some > design decisions were made that don't make sense any more. Here's the > problem: > > 1) The database is very large, the largest table has 40 million tuples. > > 2) The database needs to import 10's of thousands of tuples each night > quickly. The current method is VERY slow. > > 3) I can't import new records with a COPY or drop my indexes b/c some of > them are new records (INSERTS) and some are altered records (UPDATES) > and the only way I can think of to identify these records is to perform > a select for each record. [snip] > > 3) The current code that bulk loads data into the database is a loop > that looks like this: > > $result = exe("INSERT INTO $table ($name_str) SELECT > $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)"); > if ($result == 0) > { > $result = exe("UPDATE $table SET $non_keys WHERE > $keys"); > } > > Is there a faster way to bulk load data when it's not known ahead of > time if it's a new record or an updated record? I experimented with something like this and I was able to successively decrease the amount of time needed with an import. The final solution that took my import down from aproximately 24 hours to about 30 min was to use a C#/Java hashtable or a python dictionary. For example, the unique data in one particular table was "User_Agent" so I made it the key in my hashtable. I actually added a method to the hashtable so that when I added a new record to the hashtable it would do the insert into the db. The downside to this is that it used *GOBS* of RAM. Using Python, I was able to dramatically decrease the ram usage by switching to a GDB based dictionary instead of the standard dictionary. It only increased the time by about 50% so the total processing time was about 45 min vs the previous 30 min. I only had about 35 million records and my technique was getting to the point where it was unweldy, so with your 40 million and counting records you would probably want to start with the GDB technique unless you have a ton of available ram. You might interpret this as being a knock against PostgreSQL since I pulled the data out of the db, but it's not; You'd be hard pressed to find anything as fast as the in-memory hashtable or the on disk GDB; however it's usefullness is very limited and for anything more complex than just key=>value lookups moving to PostgreSQL is likely a big win. -- Matthew Nuzum www.bearfruit.org
Hi, Greg, Greg Stark wrote: >> (Aside question: if I were to find a way to use COPY and I were loading >>data on a single client_id, would dropping just the indexes for that client_id >>accelerate the load?) > Dropping indexes would accelerate the load but unless you're loading a large > number of records relative the current size I'm not sure it would be a win > since you would then have to rebuild the index for the entire segment. And, additionally, rebuilding a partial index with "WHERE client_id=42" needs a full table scan, which is very slow, so temporarily dropping the indices will not be useful if you merge the tables. Btw, I don't know whether PostgreSQL can make use of partial indices when building other partial indices. If yes, you could temporarily drop all but one of the partial indices for a specific client. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
Hi, Henry, Orion Henry wrote: > 1) The database is very large, the largest table has 40 million tuples. I'm afraid this doesn't qualify as '_very_ large' yet, but it definitively is large enough to have some deep thoughts about it. :-) > 1) The data is easily partitionable by client ID. In an attempt to keep > the indexes small and the inserts fast one table was made per client > ID. Thus the primary table in the database (the one with 40 million > tuples) is really 133 tables each ending with a three digit suffix. > The largest of these client tables has 8 million of the 40 million > tuples. The system started with around a half dozen clients and is now > a huge pain to manage with so many tables. I was hoping new hardware > and new postgres features would allow for this data to be merged safely > into a single table. It possibly is a good idea to merge them. If you decide to keep them separated for whatever reason, you might want to use schemas instead of three digit suffixes. Together with appropriate named users or 'set search_path', this may help you to simplify your software. In case you want to keep separate tables, but need some reports touching all tables from time to time, table inheritance may help you. Just create a base table, and then inherit all user specific tables from that base table. Of course, this can be combined with the schema approach by having the child tables in their appropriate schemas. > 2) The imports are not done inside of transactions. I'm assuming the > system designers excluded this for a reason. Will I run into problems > performing tens of thousands of inserts and updates inside a single > transaction? Yes, it should give you a huge boost. Every commit has to flush the WAL out to disk, which takes at least one disk spin. So on a simple 7200 RPM disk, you cannot have more than 120 transactions/second. It may make sense to split such a bulk load into transactions of some tens of thousands of rows, but that depends on how easy it is for your application to resume in the middle of the bulk if the connection aborts, and how much concurrent access you have on the backend. > 3) The current code that bulk loads data into the database is a loop > that looks like this: > > $result = exe("INSERT INTO $table ($name_str) SELECT > $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)"); > if ($result == 0) > { > $result = exe("UPDATE $table SET $non_keys WHERE > $keys"); > } > Is there a faster way to bulk load data when it's not known ahead of > time if it's a new record or an updated record? Perhaps the easiest way might be to issue the update first. Update returns a row count of the updated rows. If it is 0, you have to insert the row. This can even be encapsulated into a "before insert" trigger on the table, which tries the update and ignores the insert if the update succeeded. This way, you can even use COPY on the client side. We're using this approach for one of our databases, where a client side crash can result in occasional duplicates being COPYed to the table. > Dual Opteron 246, 4 disk SCSI RAID5, 4GB of RAM For lots non-read-only database workloads, RAID5 is a performance killer. Raid 1/0 might be better, or having two mirrors of two disks each, the first mirror holding system, swap, and the PostgreSQL WAL files, the second one holding the data. Don't forget to tune the postgresql settings appropriately. :-) > # du -sh /var/lib/postgres/data/ > 16G /var/lib/postgres/data/ Your database seems to be small enough to fit on a single disk, so the two mirrors approach I described above will be the best IMHO. > ( the current database is PG 7.4 - I intend to upgrade it to 8.1 if and > when I do this refactoring ) This is a very good idea, 8.1 is miles ahead of 7.4 in many aspects. > ( the current OS is Debian Unstable but I intend to be running RHEL 4.0 > if and when I do this refactoring ) This should not make too much difference. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
> was origionally designed for Postgres 7.0 on a PIII 500Mhz and some Argh. > 1) The database is very large, the largest table has 40 million tuples. Is this simple types (like a few ints, text...) ? How much space does it use on disk ? can it fit in RAM ? > 2) The database needs to import 10's of thousands of tuples each night > quickly. The current method is VERY slow. You bet, COMMIT'ing after each insert or update is about the worst that can be done. It works fine on MySQL/MyISAM (which doesn't know about commit...) so I'd guess the system designer had a previous experience with MySQL. My advice woule be : - get a decent machine with some RAM (I guess you already knew this)... Now, the update. I would tend to do this : - Generate a text file with your update data, using whatever tool you like best (perl, php, python, java...) - CREATE TEMPORARY TABLE blah ... - COPY blah FROM your update file. COPY is super fast. I think temporary tables don't write to the xlog, so they are also very fast. This should not take more than a few seconds for a few 10 K's of simple rows on modern hardware. It actually takes a fraction of a second on my PC for about 9K rows with 5 INTEGERs on them. You can also add constraints on your temporary table, to sanitize your data, in order to be reasonably sure that the following updates will work. The data you feed to copy should be correct, or it will rollback. This is your script's job to escape everything. Now you got your data in the database. You have several options : - You are confident that the UPDATE will work without being rolled back by some constraint violation. Therefore, you issue a big joined UPDATE to update all the rows in your main table which are also in your temp table. Then you issue an INSERT INTO ... SELECT ... to insert the ones which were not already in the big table. Joined updates can be slow if your RAM is too small and it has to thrash the disk looking for every tuple around. You can cheat and CLUSTER your main table (say, once a week), so it is all in index order. Then you arrange your update data so it is in the same order (for instance, you SELECT INTO another temp table, with an ORDER BY corresponding to the CLUSTER on the main table). Having both in the same order will help reducing random disk accesses. - If you don't like this method, then you might want to use the same strategy as before (ie. a zillion queries), but write it in PSQL instead. PSQL is a lot faster, because everything is already parsed and planned beforehand. So you could do the following : - for each row in the temporary update table : - UPDATE the corresponding row in the main table - IF FOUND, then cool, it was updated, nothing more to do. You don't need to SELECT in order to know if the row is there. UPDATE does it for you, without the race condition. - IF NOT FOUND, then insert. This has a race condition. You know your application, so you'll know if it matters or not. What do you think ? > 3) I can't import new records with a COPY or drop my indexes b/c some of > them are new records (INSERTS) and some are altered records (UPDATES) > and the only way I can think of to identify these records is to perform > a select for each record. Yes and no ; if you must do this, then use PSQL, it's a lot faster. And skip the SELECT. Also, use the latest version. It really rocks. Like many said on the list, put pg_xlog on its own physical disk, with ext2fs. > 3) Wrap each load into a transaction ( tens of thousands of records per > load ) That's the idea. The first strategy here (big update) uses one transaction anyway. For the other one, your choice. You can either do it all in 1 transaction, or in bunches of 1000 rows... but 1 row at a time would be horrendously slow. Regards, P.F.C
On Fri, 2006-02-10 at 11:24 +0100, Markus Schaber wrote: > For lots non-read-only database workloads, RAID5 is a performance > killer. Raid 1/0 might be better, or having two mirrors of two disks > each, the first mirror holding system, swap, and the PostgreSQL WAL > files, the second one holding the data. I was under the impression that it is preferable to keep the WAL on its own spindles with no other activity there, to take full advantage of the sequential nature of the WAL writes. That would mean one mirror for the WAL, and one for the rest. This, of course, may sometimes be too much wasted disk space, as the WAL typically will not use a whole disk, so you might partition this mirror into a small ext2 filesystem for WAL, and use the rest for files seldom accessed, such as backups. gnari
On Fri, 2006-02-10 at 16:39, Ragnar wrote: > On Fri, 2006-02-10 at 11:24 +0100, Markus Schaber wrote: > > > For lots non-read-only database workloads, RAID5 is a performance > > killer. Raid 1/0 might be better, or having two mirrors of two disks > > each, the first mirror holding system, swap, and the PostgreSQL WAL > > files, the second one holding the data. > > I was under the impression that it is preferable to keep the WAL on > its own spindles with no other activity there, to take full advantage > of the sequential nature of the WAL writes. > > That would mean one mirror for the WAL, and one for the rest. > This, of course, may sometimes be too much wasted disk space, as the WAL > typically will not use a whole disk, so you might partition this mirror > into a small ext2 filesystem for WAL, and use the rest for files seldom > accessed, such as backups. Well, on most database servers, the actual access to the OS and swap drives should drop to about zero over time, so this is a workable solution if you've only got enough drives / drive slots for two mirrors.
I just wanted to thank everyone for your input on my question. You've given me a lot of tools to solve my problem here. Orion