Home > mailing lists

Re: huge disparities in =/IN/BETWEEN performance - Mailing list pgsql-sql

From	George Pavlov
Subject	Re: huge disparities in =/IN/BETWEEN performance
Date	February 9, 2007 18:22:08
Msg-id	8C5B026B51B6854CBE88121DBF097A868160D3@ehost010-33.exch010.intermedia.net Whole thread Raw
In response to	Re: huge disparities in =/IN/BETWEEN performance (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: huge disparities in =/IN/BETWEEN performance (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-sql

Tree view

Thanks all for the various useful thoughts. Let me backtrack a bit and
state my real underlying issue a bit with actual examples. Hope not to
bore you with the length of this. Looks to me like an optimizer issue
unless I am missing something.

So, suppose I have a query:

select *
from stuff
inner join  ( -- just getting the distinct   -- user-stuff associations   -- since there may be multiple;   --
ultimatelyI need to query by user  select stuff_id, user_id   from stuff_user  group by 1,2 -- GROUP BY outperforms
DISTINCT) su using (stuff_id) 
left join ( -- this obtains summary statistics   -- about each stuff item  select stuff_id, count(*)  from stuff_events
group by 1 ) se 
using (stuff_id)
where user_id = 41

This is a very pared down version of what I have. And yes this specific
query can be rewritten as a single GROUP BY, but in the real world I am
gathering the aggregate statistics from several tables, so I actually
have several sub-recordsets similar to the one called "se" above.
Rewriting ALL those as a single GROUP BY is not feasible. I know, all
this cries for a single summarized rollup table, but let's not go there
(for now).

So running the above is inefficient. This particular user_id has only
one associated stuff_id and does not even have much data for that in
stuff_events. The query runs in ~4600ms. Were I to query by stuff_id
instead, things look great (if I change the where clause to the stuff_id
it runs in 25ms). When I query based on stuff_id the optimizer uses an
index on stuff_events.stuff_id. However, when I query by user_id it does
a Seq Scan on stuff_events. I somehow wish I could tell the optimizer to
first figure out which stuff_ids are related to the user_id that is
being asked for and then look ONLY those up in the stuff_events table
using the index on stuff_id.

It would seem (and this is where we get back to my original question)
that one should be able to just say:

select *
from stuff
left join (select stuff_id, count(*)  from stuff_events  group by 1 ) se
using (stuff_id)
where stuff_id in  (select distinct stuff_id  from stuff_user  where user_id = 41 )

You'd think that the subquery in the IN would be (very quickly) resolved
to a list of stuff_ids and then stuff_events would be accessed via its
stuff_id index. Instead, the Seq Scan on stuff_events still happens and
the query actually is even slower than the original, running in ~5500ms.

So one (very ugly) way to optimize the first query is to add an extra
join to stuff_user INSIDE the "se" subquery:

select *
from stuff
inner join  (select stuff_id, user_id   from stuff_user  group by 1,2 ) su using (stuff_id)
left join (select stuff_id, user_id, count(*)  from stuff_events  inner join     ( -- same subquery as above     select
stuff_id,user_id      from stuff_user     group by 1,2      ) su2    using (stuff_id)  group by 1,2 ) se 
using (stuff_id)
where user_id = 41;

This does improve things a lot, bringing the execution time for this
particular user to 3ms (!), but it is quite ugly and not fast enough for
me for a user_id with lots of associated stuff_ids.

George

pgsql-sql by date:

From: "Ezequias Rodrigues da Rocha"
Date: 09 February 2007, 17:50:22
Subject: Have anyone this man e-mail ?

From: Tom Lane
Date: 09 February 2007, 21:38:51
Subject: Re: huge disparities in =/IN/BETWEEN performance

Re: huge disparities in =/IN/BETWEEN performance - Mailing list pgsql-sql

Previous

Next