On Tue, Mar 25, 2014 at 2:00 PM, Brian Crowell <brian@fluggo.com> wrote:
Hello, it's me, a Postgres n00b again. I'm dealing with a query that scans a rather large table (94,000,000 tuples or so) and just picks out certain rows and sums them:
select dci.snapshot_time as "time", round(sum(dci.todays_pl)::numeric,0) as pl from dbo._pl_data_cache_intraday dci where dci.snapshot_time between '2014-03-25 11:32:40.004552-05'::timestamptz and '2014-03-25 12:02:40.015177-05'::timestamptz and dci.symbol in (select sec.symbol from dbo.security_underliers sec where sec.ultimate_underlier = 'SPY') and dci.manager = 'BJC' and dci.account in (select account from pl2.visible_accounts where is_fund) group by dci.snapshot_time order by dci.snapshot_time;
For the most part, Postgres is doing the right thing: snapshot_time is the lead column in all of the table's indexes, so it's able to pick up the source rows fairly quickly in its index scan. It's also enforcing "dci.manager = 'BJC'" in the same scan, and does a Hash Semi Join for "dci.symbol in (...)".
The trouble comes when enforcing the "dci.account in (...)" search condition: pl2.visible_accounts is a view that determines which accounts the current user can see, which, depending on who you are, can be several hundred or none at all. Postgres estimates the output of this query as two rows, but in my case, it's actually 240.
Unfortunately, that leads the query planner to try to think a nested loop is cheap enough to enforce this, when actually it's really expensive.
Can you show the explain plan for that? I can't get it to use anything but a hash join for this type of thing even when the estimated rows in the in-list are 2, unless I disable hash joins altogether. So I'm curious how your plan differs from the ones I've dummied up.