Performance on inserts - Mailing list pgsql-hackers
From | Jules Bean |
---|---|
Subject | Performance on inserts |
Date | |
Msg-id | 20000824110425.O17510@grommit.office.vi.net Whole thread Raw |
List | pgsql-hackers |
[Meta: Please advise if this is the wrong list. I think, since this observation relates to Pg internals, this might be the right one, but feel free to move it back to -general if I'm wrong; I subscribe to both in any case] As some of you will have inferred if you've read my last couple of posts, I'm working on a database with a structure like this (I'm abstracting; I'm not unfortunately allowed to show you what my client is really doing here) A table called 'things' with two columns 'name' and 'category'. The pair ('name','category') is a primary key. There are ~ 10 000 000 rows in the table, and category takes values from a more-or-less fixed set of ~1000 possibilities. As previously described the most popular category holds around half the rows, the next most popular holds nearly half of those left, and most categories occur very rarely indeed. The median is probably around 1000 (which is less than the 10 000 you'd expect). Anyhow, this question isn't about speeding up queries --- we already have that in the histogram thread. This question is about speeding up inserts. My standard configuration is to have a unique index on (name,category) and a non-unique index on (category). The main table is ~ 2G on disk, the index on (name,cat) is about the same size, the index on (cat) is around 0.6G. In this set-up inserts have dropped to the terrifyingly slow rate of several hours per 10 000 insertions. This is not adequate to my needs, I occasionally have to process 1 000 000 insertions or more! I have several ideas for speeding this up at the SQL level (including inserting into a temp table and then using INSERT ... SELECT to remove the overhead of separate inserts) but that's not what I want to talk about either... What I did to day, which made a staggering difference, is dropping the non-unique index on (category). Suddenly I can insert at approx 40 000 insertions per minute, which is fine for my needs! So why is updating the huge (2G) unique index on (name,cat) not too much of a problem, but updating the small (600M) non-unique index on (cat) sufficient to increase speed by around two orders of magnitude? A possible reason has occurred to me: The real slow-down is noticeable when I'm doing a bulk insert where all new rows belong to the most popular category. I know that some btree implementations don't behave particularly sanely with several million rows in a single key.. is the btree implementation used too slow in this case? I haven't collected all the performance statistics I'd like to have, due to external time pressures, but I can say that under the new faster configuration, the insertion process is CPU bound, with disk access far below the levels the machine is capable of. If I have a chance I'll collect these stats for the old method too. Any ideas as to what's going on here appreciated (if not, perhaps it will point you towards an optimisation you ought to make for 7.1) Jules
pgsql-hackers by date: