Re: huge disparities in =/IN/BETWEEN performance - Mailing list pgsql-sql
From | George Pavlov |
---|---|
Subject | Re: huge disparities in =/IN/BETWEEN performance |
Date | |
Msg-id | 8C5B026B51B6854CBE88121DBF097A868160D3@ehost010-33.exch010.intermedia.net Whole thread Raw |
In response to | Re: huge disparities in =/IN/BETWEEN performance (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: huge disparities in =/IN/BETWEEN performance
(Tom Lane <tgl@sss.pgh.pa.us>)
|
List | pgsql-sql |
Thanks all for the various useful thoughts. Let me backtrack a bit and state my real underlying issue a bit with actual examples. Hope not to bore you with the length of this. Looks to me like an optimizer issue unless I am missing something. So, suppose I have a query: select * from stuff inner join ( -- just getting the distinct -- user-stuff associations -- since there may be multiple; -- ultimatelyI need to query by user select stuff_id, user_id from stuff_user group by 1,2 -- GROUP BY outperforms DISTINCT) su using (stuff_id) left join ( -- this obtains summary statistics -- about each stuff item select stuff_id, count(*) from stuff_events group by 1 ) se using (stuff_id) where user_id = 41 This is a very pared down version of what I have. And yes this specific query can be rewritten as a single GROUP BY, but in the real world I am gathering the aggregate statistics from several tables, so I actually have several sub-recordsets similar to the one called "se" above. Rewriting ALL those as a single GROUP BY is not feasible. I know, all this cries for a single summarized rollup table, but let's not go there (for now). So running the above is inefficient. This particular user_id has only one associated stuff_id and does not even have much data for that in stuff_events. The query runs in ~4600ms. Were I to query by stuff_id instead, things look great (if I change the where clause to the stuff_id it runs in 25ms). When I query based on stuff_id the optimizer uses an index on stuff_events.stuff_id. However, when I query by user_id it does a Seq Scan on stuff_events. I somehow wish I could tell the optimizer to first figure out which stuff_ids are related to the user_id that is being asked for and then look ONLY those up in the stuff_events table using the index on stuff_id. It would seem (and this is where we get back to my original question) that one should be able to just say: select * from stuff left join (select stuff_id, count(*) from stuff_events group by 1 ) se using (stuff_id) where stuff_id in (select distinct stuff_id from stuff_user where user_id = 41 ) You'd think that the subquery in the IN would be (very quickly) resolved to a list of stuff_ids and then stuff_events would be accessed via its stuff_id index. Instead, the Seq Scan on stuff_events still happens and the query actually is even slower than the original, running in ~5500ms. So one (very ugly) way to optimize the first query is to add an extra join to stuff_user INSIDE the "se" subquery: select * from stuff inner join (select stuff_id, user_id from stuff_user group by 1,2 ) su using (stuff_id) left join (select stuff_id, user_id, count(*) from stuff_events inner join ( -- same subquery as above select stuff_id,user_id from stuff_user group by 1,2 ) su2 using (stuff_id) group by 1,2 ) se using (stuff_id) where user_id = 41; This does improve things a lot, bringing the execution time for this particular user to 3ms (!), but it is quite ugly and not fast enough for me for a user_id with lots of associated stuff_ids. George