Thread: Partitions and joins lead to index lookups on all partitions
Hi there,
Currently, we are running into serious performance problems with our paritioning setup, because index lookups are mostly done on allpartions, in stead of the one partition it should know that it can find the needed row.
Simple example, were we have a partitioned tables named part_table. So here it goes:
select * from part_table where id = 12123231
Will do an index lookup only in the partition that it knows it can find the id there. However:
select * from part_table where id = (select 12123231)
Will do an index lookup in ALL partitions, meaning it is significantly slower, even more since the database will not fit into memory.
So okay, we could just not use parameterized queries... Well.. not so fast. Consider a second table referencing to the first:
ref_table:
group_id bigint
part_table_id bigint
Now when I join the two:
select part_table.* from part_table
join ref_table on (ref_table.part_table_id = part_table.id and group_id = 12321)
It will also do index loopups on ALL partitions.
How do we handle this? Above queries are simplified versions of the things gooing on but the idea is clear. I tried dooing this in 9.1 (we are currently using 9.0), but this does not matter. So what is actually the practicial use of partitioning if you can't even use it effectively for simple joins?
constraint_exclusion is enabled correctly, and as far as I can see, this behaviour is according to the book.
Are there any progresses in maybe 9.2 to make this any better? If not, how schould we handle this? We can also not choose to parition, but how will that perform on a 100 GB table?
Kind regards,
Christiaan Willemsen
Hi, On 8 December 2011 02:15, Christiaan Willemsen <cwillemsen@technocon.com> wrote: > Currently, we are running into serious performance problems with our > paritioning setup, because index lookups are mostly done on allpartions, in > stead of the one partition it should know that it can find the needed row. Planner is not very smart about partitions. If expression can't be evaluated to constant (or you use stable/volatile function) during planning time then you get index/seq scan across all partitions. > Now when I join the two: > > select part_table.* from part_table > > join ref_table on (ref_table.part_table_id = part_table.id and group_id = > 12321) I had to add extra where conditions which help to decide the right partitions i.e. where part_col between X and Y. It would be quite hard to this in your case. You can execute another query like - select part_table_id from ref_table where group_id = 12321 - or select min(part_table_id), max(part_table_id) from ref_table where group_id = 12321 and the use in() or between X and Y in second query (so have to execute two queries). -- Ondrej Ivanic (ondrej.ivanic@gmail.com)
Hi Ondrej, Your solution has occurred to me, and wil even work in some cases. But in more advanced queries, where for example, I would need the group ID again to do some window function magic, this will sadly not work, without again doing a reverse lookup to the ref_table to find it again. This scheme might still be faster though even though it would take more queries. Im now testing some of queries against a non-paritioned version of our dataset to see what the difference is. I'm wondering how much the insert performance wil be impacted when not-paritioning our data. We do have a few indexes and constriants on these tables, but not whole lot. I'll so some measurements to see how this wil work out. The general dilemma would be as follows: What if the suggested max of 100 partions would mean that a partition table will also not fit into memory efficiently, and/or that the access pattern is such that because of the query planner, it needs to work it's way though all the partitions for virtually most of the serious queries being done on the data set. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Partitions-and-joins-lead-to-index-lookups-on-all-partitions-tp5055965p5058853.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
Back again, I did some tests with our test machine, having a difficult query doing some fancy stuff ;) I made two versions, one using partitioned data, one, using unpartitioned data, both having the same equivalent indexes. It's using two of those big tables, one 28GB data and 17GB index, one 25GB data and 41GB indexes (both for the unpartitioned versions). Our test machine has 32GB of memory, short config: maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 effective_cache_size = 22GB work_mem = 80MB wal_buffers = 8MB checkpoint_segments = 16 shared_buffers = 7680MB max_connections = 400 At first I tested the query performance. It turned out that the unpartitioned version was about 36 times faster, of course for the obvious reason stated in my initial post. both are fully using the indexes they have, and the partitioned version even has it's indexes on SSD. Then I did some insert tests using generate_series to insert 100000 rows into one of the tables. It turns out that the unpartitioned version is again faster, this time 30.9 vs 1.8 seconds. This is a huge difference. For the second table, with the huge 41GB index it's 30.5 vs 5.2 seconds, still a big difference. Conclusion: partitioning does not benefit us, and probably others, specially when doing lots of joins and using parameterized queries. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Partitions-and-joins-lead-to-index-lookups-on-all-partitions-tp5055965p5074907.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.