Thread: [NOVICE] Queries with Joins before filtering taking too much time! Filter(where clause) *first* -- suggestions ?

Hi:

Postgres 9.6.1 on a large linux box.

I have 2 tables under consideration:

1. property (about 800,000 rows)
This has a attribute called current_owner which is FK pointing to uid in
users table. This attribute can be null.

2. users (about 400 rows)
This has users in the system, each user has a uid.

=====
Query:

SELECT
*
FROM
property p
LEFT OUTER JOIN users ON (p.current_owner = users.uid)
WHERE
p.pid in
(SELECT pid FROM reward WHERE reward_type = 'daily'
ORDER BY reward_date DESC LIMIT 30)
;


This query takes a looong time. I tried an EXPLAIN ANALYSE and this
is what I got:

QUERY PLAN
------------------------------------------------------------------------------

Merge Cond: (users.uid = p.current_owner)
-> Index Scan using pk_users on users (cost=0.27..393.66 rows=414
width=198) (actual time=0.007..0.007 rows=1 loops=1)
-> Index Scan using idx_property_current_owner on property p
(cost=0.43..9889803.81 rows=3201653 width=775) (actual
time=10785.283..10785.2..83 rows=0 loops=1)
Filter: (SubPlan 1)
Rows Removed by Filter: 760651
SubPlan 1
-> Limit (cost=1.31..1.32 rows=1 width=16) (actual time=0.011..0.011
rows=0 loops=760651)
-> Sort (cost=1.31..1.32 rows=1 width=16) (actual time=0.008..0.008
rows=0 loops=760651)
Sort Key: reward.reward_date DESC
Sort Method: quicksort Memory: 25kB
-> Index Scan using idx_reward_reward_type on reward (cost=0.29..1.30
rows=1 width=16) (actual time=0.005..0.005 rows=0.. loops=760651)
Index Cond: (reward_type = 'daily'::text)
Planning time: 5.575 ms
Execution time: 10785.510 ms
---------------------------------

You can see the index scan on property is done first, like so:

Index Scan using idx_property_current_owner on property p
(cost=0.43..9889803.81 rows=3201653 width=775)
(actual time=10785.283..10785.2..83 rows=0 loops=1)

This take 10+ seconds. But this is wasted, since the number of property
rows are then filtered to less than 20 later on (via the WHERE clause).

Any suggestions on how to do the filtering FIRST and then do the join ?
I need to left outer join with users since the user (current_owner)
might be null.

Best,
--J



As a followup:

I noticed that there isn't any "pid" column in the reward table. So that
was an error and maybe that's why a sequential scan was being forced. To
recap, the query was:

SELECT
  *
FROM
property p
LEFT OUTER JOIN users ON (p.current_owner = users.uid)
WHERE
p.pid in
(SELECT pid FROM reward WHERE reward_type = 'daily'
ORDER BY reward_date DESC LIMIT 30)

But weirdly, the query still runs without any errors. Why ?

===

I tried this (a simpler version of the above):

SELECT * from property
WHERE pid in (
   SELECT pid FROM reward WHERE reward_type = 'DAILYPROMO_WIN')
;

And this also runs (again taking a long time) but without any errors.
How can this be ? (there is no pid in the reward table, so why does the
subquery not throw an error)? In fact, because there was no error, I
didn't notice my typo (no pid in reward table) until right now...

Best,
--J


On Sunday, January 15, 2017, Hursh Jain <hurshj@gmail.com> wrote:

SELECT * from property
WHERE pid in (
  SELECT pid FROM reward WHERE reward_type = 'DAILYPROMO_WIN')
;

And this also runs (again taking a long time) but without any errors. How can this be ? (there is no pid in the reward table, so why does the
subquery not throw an error)?


The "pid" column is the one from the outer query property table - this is a correlated subquery that you've written. 

As long as one record with that reward_type exists your query devolves to "where true" thus making indexes useless and returning every row in property.

David J.

Thanks David.

I didn't realize that the column names in the subquery would be scoped
to the outer query table. I thought one *had* to use an alias for that
and by default the column names were always scoped to the inner query
(and would error out otherwise).

Definitely learnt something new today..thanks again for the reply!

Best,
--H

David G. Johnston wrote:
> On Sunday, January 15, 2017, Hursh Jain <hurshj@gmail.com
> <mailto:hurshj@gmail.com>> wrote:
>
>
>     SELECT * from property
>     WHERE pid in (
>        SELECT pid FROM reward WHERE reward_type = 'DAILYPROMO_WIN')
>     ;
>
>     And this also runs (again taking a long time) but without any
>     errors. How can this be ? (there is no pid in the reward table, so
>     why does the
>     subquery not throw an error)?
>
>
>
> The "pid" column is the one from the outer query property table - this
> is a correlated subquery that you've written.
>
> As long as one record with that reward_type exists your query devolves
> to "where true" thus making indexes useless and returning every row in
> property.
>
> David J.
>