Thread: performance problems inserting random vals in index
Hi, I have a simple table that has indexes on 2 integer columns. Data is inserted very often (no updates, no deletes, just inserts): at least 4000/5000 rows per second. The input for the 2 indexed columns is very random. Everything is "fine" for the first 10-20M rows; after that, performance gets worse and worse, and by 50M rows I can't insert more than 1500 rows per second. How can I improve it? 1) the table is already partitioned; at the moment it's based on 30 partitions. What kind of "select" performance impact would I hit going to 3000 partitions? I know in the 3000 partition case there will be less rows per partition, but index seek times are O(logN), right? So: 3000*(log(N/3000)) is much bigger than 30*(log(N/30)) 2) How much improvement can I expect going to RAID1+0 on 15rpm disks instead of raid5 on 7200 disks? I know, I know, I shouldn't even be testing such a huge workload with such a setup, but can't try anything else at the moment. Disk array is pretty fast anyway: 190MB seq output, 320MB seq input, 1600 random seeks / s (bonnie++) 3) Is there anything else I can try to "help" postgres update those index faster?
Leonardo F <m_lists@yahoo.it> writes: > I have a simple table that has indexes on 2 integer columns. > Data is inserted very often (no updates, no deletes, just inserts): > at least 4000/5000 rows per second. > The input for the 2 indexed columns is very random. > Everything is "fine" for the first 10-20M rows; after that, performance > gets worse and worse, and by 50M rows I can't insert more than > 1500 rows per second. > How can I improve it? Does it help to reindex the index at that point? > 1) the table is already partitioned; at the moment it's based on > 30 partitions. What kind of "select" performance impact would I > hit going to 3000 partitions? Bad. The partitioning code isn't designed to scale beyond a few dozen partitions. regards, tom lane
On Mon, Apr 19, 2010 at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Leonardo F <m_lists@yahoo.it> writes: >> I have a simple table that has indexes on 2 integer columns. >> Data is inserted very often (no updates, no deletes, just inserts): >> at least 4000/5000 rows per second. >> The input for the 2 indexed columns is very random. > >> Everything is "fine" for the first 10-20M rows; after that, performance >> gets worse and worse, and by 50M rows I can't insert more than >> 1500 rows per second. > >> How can I improve it? > > Does it help to reindex the index at that point? > >> 1) the table is already partitioned; at the moment it's based on >> 30 partitions. What kind of "select" performance impact would I >> hit going to 3000 partitions? > > Bad. The partitioning code isn't designed to scale beyond a few dozen > partitions. On a few very narrow applications I've gotten good performance in the low hundreds. After that things fall apart quickly.
Leonardo F wrote: > Is there anything else I can try to "help" postgres update those > index faster? > It sounds like your problem is that read/write activities on the indexes are becoming increasingly large and random as more data is inserted. There are two standard ways to improve on that: 1) Periodically rebuild the indexes and hope that the new version is laid out better than what you've ended up with the random key value insertions. 2) Optimize your database server configuration to perform better under this particular situation. The usual set of tricks is to increase shared_buffers, checkpoint_segments, and checkpoint_timeout to cut down on the amount of time that the database has to write to the index blocks, and improve the odds that ones it needs to read are already in its cache. It's hard to say whether increasing raw disk speed will help you as much as you'd like or not. Index-related performance is often driven by whether the working set needed to work on them efficiently can fit in RAM or not. Once you've exceeded that, performance drops really fast, and a linear increase in disk speed may not recover very much of that. You can look at the size of all the active indexes using something like the first query at http://wiki.postgresql.org/wiki/Disk_Usage to get an idea how big they are relative to RAM. Sometimes having more memory is the only good way to scale upwards in this situation while retaining something close to original performance. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
> Does it help to reindex the index at that point? Didn't try; but I guess a reindex of such a big table would block inserts for a long time... but I'll try > Bad. The partitioning code isn't designed to scale > beyond a few dozen partitions. What kind of problems am I going to experience? It's just a N times access to the index or something worse? It's the fact that it will take a linear amount of time to "pick" the right partitions?
> On a few very narrow applications I've gotten good > performance in the > low hundreds. After that things fall apart > quickly. Ehm... what exactly "fall apart quickly" means? I can trade some "select" speed for "insert" speed... I don't have experience with partitioning, if some of you already know what's going to happen I could avoid testing or at least testing for the wrong setup... Thank you
> The usual set of tricks is to > increase shared_buffers, checkpoint_segments, and checkpoint_timeout to cut down Uh, didn't know shared_buffers played a role in index insertion as well... got to try that. Thank you.
> > The usual set of tricks is to > increase shared_buffers, > checkpoint_segments, and checkpoint_timeout to cut down That did it. Setting a much higher shared_buffers helped quite a lot. Thank you everybody for your suggestions.
I'm sorry I have to come back at this, but the solution the list gave helped, but didn't fully solved my problems... To sum up: I have a simple table that has indexes on 2 integer columns. Data is inserted very often (no updates, no deletes, just inserts): at least 4000/5000 rows per second. The input for the 2 indexed columns is very random. I was experiencing bad insertion times past 20M rows. Some of you suggested that it might have to do with the indexes being too big to fit in shared_buffers, so I raised it and, in fact, that solved the issue... but for another 50M rows: basically at roughly 70M rows I'm back at poor insertion times (disk always 100% busy). The indexes at this point are 1.7GB each. My shared buffers is 10GB (machine has 32GB). So I expect the whole indexes to fit in ram, yet the disk array is 100% busy... What's going on? What can I try to get back at good insertion performance?