Thread: Index is not used for "IN (non-correlated subquery)"

Index is not used for "IN (non-correlated subquery)"

From
George
Date:
My use case:

I have a table which I expect to reach a size of more than 10M rows.
This table will have a column "partner_uuid" which will have a maximum
envisioned cardinality of 10.

I want different users of my web application to see different subsets
of that table. I am using row-level security for this, with a USING
clause similar to this:

        partner_uuid in (
            select p.uuid
            from userpartner up
            join partner p on p.id = up.partner_id
            where up.user_id::varchar = get_parameter('WEB_LOGGED_IN_USER_ID')
        )

I want to make sure that when running SELECTs the index of the
partner_uuid column will be used. It appears though that it is not
being used. Is there some way to make the query planner use the index
for this case or will I always have to run a query to load the allowed
partner_uuids, add them to my query so that they are hardcoded, and
then finally run the query so that it uses the index?

For example, compare the following simplified and similar two cases,
one of which uses the index and one which does not:

explain  select * from wg3ppbm_transaction where partner_uuid in
('0f50ce66-6dcf-11e6-8b77-86f30ca893d3');

"Index Scan using wg3ppbm_transaction_f9b3d985 on wg3ppbm_transaction
(cost=0.28..227.67 rows=323 width=482)"
"  Index Cond: ((partner_uuid)::text =
'0f50ce66-6dcf-11e6-8b77-86f30ca893d3'::text)"


explain select * from wg3ppbm_transaction where partner_uuid in (
            select p.uuid
            from wg3ppbm_userpartner up
            join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join  (cost=2.07..425.72 rows=2960 width=482)"
"  Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
"  ->  Seq Scan on wg3ppbm_transaction  (cost=0.00..375.19 rows=5919 width=482)"
"  ->  Hash  (cost=2.06..2.06 rows=1 width=37)"
"        ->  Nested Loop  (cost=0.00..2.06 rows=1 width=37)"
"              Join Filter: (up.partner_id = p.id)"
"              ->  Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4)"
"              ->  Seq Scan on wg3ppbm_partner p  (cost=0.00..1.02
rows=2 width=41)"


Re: Index is not used for "IN (non-correlated subquery)"

From
Tom Lane
Date:
George <pinkisntwell@gmail.com> writes:
> explain select * from wg3ppbm_transaction where partner_uuid in (
>             select p.uuid
>             from wg3ppbm_userpartner up
>             join wg3ppbm_partner p on p.id = up.partner_id
> );

> "Hash Semi Join  (cost=2.07..425.72 rows=2960 width=482)"
> "  Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
> "  ->  Seq Scan on wg3ppbm_transaction  (cost=0.00..375.19 rows=5919 width=482)"
> "  ->  Hash  (cost=2.06..2.06 rows=1 width=37)"
> "        ->  Nested Loop  (cost=0.00..2.06 rows=1 width=37)"
> "              Join Filter: (up.partner_id = p.id)"
> "              ->  Seq Scan on wg3ppbm_userpartner up
> (cost=0.00..1.01 rows=1 width=4)"
> "              ->  Seq Scan on wg3ppbm_partner p  (cost=0.00..1.02
> rows=2 width=41)"

This plan is expecting to have to return about half of the rows in
wg3ppbm_transaction, a situation for which an indexscan would NOT
be a better choice.  The usual rule of thumb is that you need to be
retrieving at most one or two percent of a table's rows for an indexscan
on it to be faster than a seqscan.

I think however that the "half" may be a default estimate occasioned
by the other tables being empty and therefore not having any statistics.
Another rule of thumb is that the plans you get for tiny tables have
little to do with what happens once there's lots of data.

            regards, tom lane


Re: Index is not used for "IN (non-correlated subquery)"

From
Merlin Moncure
Date:
On Wed, Nov 30, 2016 at 10:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> George <pinkisntwell@gmail.com> writes:
>> explain select * from wg3ppbm_transaction where partner_uuid in (
>>             select p.uuid
>>             from wg3ppbm_userpartner up
>>             join wg3ppbm_partner p on p.id = up.partner_id
>> );
>
>> "Hash Semi Join  (cost=2.07..425.72 rows=2960 width=482)"
>> "  Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
>> "  ->  Seq Scan on wg3ppbm_transaction  (cost=0.00..375.19 rows=5919 width=482)"
>> "  ->  Hash  (cost=2.06..2.06 rows=1 width=37)"
>> "        ->  Nested Loop  (cost=0.00..2.06 rows=1 width=37)"
>> "              Join Filter: (up.partner_id = p.id)"
>> "              ->  Seq Scan on wg3ppbm_userpartner up
>> (cost=0.00..1.01 rows=1 width=4)"
>> "              ->  Seq Scan on wg3ppbm_partner p  (cost=0.00..1.02
>> rows=2 width=41)"
>
> This plan is expecting to have to return about half of the rows in
> wg3ppbm_transaction, a situation for which an indexscan would NOT
> be a better choice.  The usual rule of thumb is that you need to be
> retrieving at most one or two percent of a table's rows for an indexscan
> on it to be faster than a seqscan.
>
> I think however that the "half" may be a default estimate occasioned
> by the other tables being empty and therefore not having any statistics.
> Another rule of thumb is that the plans you get for tiny tables have
> little to do with what happens once there's lots of data.

Yeah, don't make query plan assumptions against empty or nearly empty
tables.  As the data grows, the plans will suitably change.  Perhaps
OP just recently loaded a bunch of data and the tables haven't been
analyzed yet?

merlin


Re: Index is not used for "IN (non-correlated subquery)"

From
George
Date:
On Wed, Nov 30, 2016 at 6:45 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Nov 30, 2016 at 10:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> George <pinkisntwell@gmail.com> writes:
>>> explain select * from wg3ppbm_transaction where partner_uuid in (
>>>             select p.uuid
>>>             from wg3ppbm_userpartner up
>>>             join wg3ppbm_partner p on p.id = up.partner_id
>>> );
>>
>>> "Hash Semi Join  (cost=2.07..425.72 rows=2960 width=482)"
>>> "  Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
>>> "  ->  Seq Scan on wg3ppbm_transaction  (cost=0.00..375.19 rows=5919 width=482)"
>>> "  ->  Hash  (cost=2.06..2.06 rows=1 width=37)"
>>> "        ->  Nested Loop  (cost=0.00..2.06 rows=1 width=37)"
>>> "              Join Filter: (up.partner_id = p.id)"
>>> "              ->  Seq Scan on wg3ppbm_userpartner up
>>> (cost=0.00..1.01 rows=1 width=4)"
>>> "              ->  Seq Scan on wg3ppbm_partner p  (cost=0.00..1.02
>>> rows=2 width=41)"
>>
>> This plan is expecting to have to return about half of the rows in
>> wg3ppbm_transaction, a situation for which an indexscan would NOT
>> be a better choice.  The usual rule of thumb is that you need to be
>> retrieving at most one or two percent of a table's rows for an indexscan
>> on it to be faster than a seqscan.
>>
>> I think however that the "half" may be a default estimate occasioned
>> by the other tables being empty and therefore not having any statistics.
>> Another rule of thumb is that the plans you get for tiny tables have
>> little to do with what happens once there's lots of data.
>
> Yeah, don't make query plan assumptions against empty or nearly empty
> tables.  As the data grows, the plans will suitably change.  Perhaps
> OP just recently loaded a bunch of data and the tables haven't been
> analyzed yet?

I just added a significant number of rows to the table. I now have
1.3M rows in total but only 8K rows that contain the value I am
seeking. I also ran ANALYZE after loading the data. The query plans
for the two queries did not change. Also, the simple query returns in
45 ms while the one with the subquery needs 1.5 s, i.e. it is about
30x slower.

So there is definitely something wrong here. This situation makes many
row-level security use cases cumbersome since you need to have
almost the same WHERE clause both in the row-level security policy and
in every SELECT query in order for the index to be used.


Re: Index is not used for "IN (non-correlated subquery)"

From
Merlin Moncure
Date:
On Wed, Nov 30, 2016 at 11:05 AM, George <pinkisntwell@gmail.com> wrote:
> On Wed, Nov 30, 2016 at 6:45 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> On Wed, Nov 30, 2016 at 10:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> George <pinkisntwell@gmail.com> writes:
>>>> explain select * from wg3ppbm_transaction where partner_uuid in (
>>>>             select p.uuid
>>>>             from wg3ppbm_userpartner up
>>>>             join wg3ppbm_partner p on p.id = up.partner_id
>>>> );
>>>
>>>> "Hash Semi Join  (cost=2.07..425.72 rows=2960 width=482)"
>>>> "  Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
>>>> "  ->  Seq Scan on wg3ppbm_transaction  (cost=0.00..375.19 rows=5919 width=482)"
>>>> "  ->  Hash  (cost=2.06..2.06 rows=1 width=37)"
>>>> "        ->  Nested Loop  (cost=0.00..2.06 rows=1 width=37)"
>>>> "              Join Filter: (up.partner_id = p.id)"
>>>> "              ->  Seq Scan on wg3ppbm_userpartner up
>>>> (cost=0.00..1.01 rows=1 width=4)"
>>>> "              ->  Seq Scan on wg3ppbm_partner p  (cost=0.00..1.02
>>>> rows=2 width=41)"
>>>
>>> This plan is expecting to have to return about half of the rows in
>>> wg3ppbm_transaction, a situation for which an indexscan would NOT
>>> be a better choice.  The usual rule of thumb is that you need to be
>>> retrieving at most one or two percent of a table's rows for an indexscan
>>> on it to be faster than a seqscan.
>>>
>>> I think however that the "half" may be a default estimate occasioned
>>> by the other tables being empty and therefore not having any statistics.
>>> Another rule of thumb is that the plans you get for tiny tables have
>>> little to do with what happens once there's lots of data.
>>
>> Yeah, don't make query plan assumptions against empty or nearly empty
>> tables.  As the data grows, the plans will suitably change.  Perhaps
>> OP just recently loaded a bunch of data and the tables haven't been
>> analyzed yet?
>
> I just added a significant number of rows to the table. I now have
> 1.3M rows in total but only 8K rows that contain the value I am
> seeking. I also ran ANALYZE after loading the data. The query plans
> for the two queries did not change. Also, the simple query returns in
> 45 ms while the one with the subquery needs 1.5 s, i.e. it is about
> 30x slower.
>
> So there is definitely something wrong here. This situation makes many
> row-level security use cases cumbersome since you need to have
> almost the same WHERE clause both in the row-level security policy and
> in every SELECT query in order for the index to be used.

can you give EXPLAIN ANALYZE for the 'good' query and the 'bad' query?

merlin


Re: Index is not used for "IN (non-correlated subquery)"

From
Tom Lane
Date:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Wed, Nov 30, 2016 at 11:05 AM, George <pinkisntwell@gmail.com> wrote:
>> So there is definitely something wrong here. This situation makes many
>> row-level security use cases cumbersome since you need to have
>> almost the same WHERE clause both in the row-level security policy and
>> in every SELECT query in order for the index to be used.

> can you give EXPLAIN ANALYZE for the 'good' query and the 'bad' query?

Planning for queries affected by RLS is definitely an area where we need
to improve (I'm working on a patch for that).  Whether the OP's particular
query is being hit by that is impossible to tell, though, since there
isn't any actual RLS usage in the doubtless-oversimplified example.

            regards, tom lane


Re: Index is not used for "IN (non-correlated subquery)"

From
George
Date:
On Wed, Nov 30, 2016 at 8:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Wed, Nov 30, 2016 at 11:05 AM, George <pinkisntwell@gmail.com> wrote:
>>> So there is definitely something wrong here. This situation makes many
>>> row-level security use cases cumbersome since you need to have
>>> almost the same WHERE clause both in the row-level security policy and
>>> in every SELECT query in order for the index to be used.
>
>> can you give EXPLAIN ANALYZE for the 'good' query and the 'bad' query?
>
> Planning for queries affected by RLS is definitely an area where we need
> to improve (I'm working on a patch for that).  Whether the OP's particular
> query is being hit by that is impossible to tell, though, since there
> isn't any actual RLS usage in the doubtless-oversimplified example.

The example is not over-simplified, I basically just took the clause
that the RLS would have to add and stuck it in the WHERE. Thus I
verified that even the normal, non-RLS planner is affected.

When I get to work tomorrow morning (Europe) I will post the EXPLAIN
ANALYZE output.


Re: Index is not used for "IN (non-correlated subquery)"

From
George
Date:
On Wed, Nov 30, 2016 at 10:08 PM, George <pinkisntwell@gmail.com> wrote:
> On Wed, Nov 30, 2016 at 8:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Merlin Moncure <mmoncure@gmail.com> writes:
>>> On Wed, Nov 30, 2016 at 11:05 AM, George <pinkisntwell@gmail.com> wrote:
>>>> So there is definitely something wrong here. This situation makes many
>>>> row-level security use cases cumbersome since you need to have
>>>> almost the same WHERE clause both in the row-level security policy and
>>>> in every SELECT query in order for the index to be used.
>>
>>> can you give EXPLAIN ANALYZE for the 'good' query and the 'bad' query?
>>
>> Planning for queries affected by RLS is definitely an area where we need
>> to improve (I'm working on a patch for that).  Whether the OP's particular
>> query is being hit by that is impossible to tell, though, since there
>> isn't any actual RLS usage in the doubtless-oversimplified example.
>
> The example is not over-simplified, I basically just took the clause
> that the RLS would have to add and stuck it in the WHERE. Thus I
> verified that even the normal, non-RLS planner is affected.
>
> When I get to work tomorrow morning (Europe) I will post the EXPLAIN
> ANALYZE output.

Here are the EXPLAIN ANALYZE results:

explain analyze
select *
from wg3ppbm_transaction
where partner_uuid in ('80228212-2247-4bdd-a130-80239cb33c5c');

"Index Scan using wg3ppbm_transaction_f9b3d985 on wg3ppbm_transaction
(cost=0.43..2838.57 rows=8186 width=380) (actual time=0.458..5.265
rows=7827 loops=1)"
"  Index Cond: ((partner_uuid)::text =
'80228212-2247-4bdd-a130-80239cb33c5c'::text)"
"Planning time: 0.155 ms"
"Execution time: 6.992 ms"


explain analyze select *
from wg3ppbm_transaction where partner_uuid in (
            select p.uuid
            from wg3ppbm_userpartner up
            join wg3ppbm_partner p on p.id = up.partner_id
);

"Hash Semi Join  (cost=2.07..65628.14 rows=663727 width=380) (actual
time=0.346..1542.730 rows=1 loops=1)"
"  Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
"  ->  Seq Scan on wg3ppbm_transaction  (cost=0.00..54757.54
rows=1327454 width=380) (actual time=0.004..878.568 rows=1327587
loops=1)"
"  ->  Hash  (cost=2.06..2.06 rows=1 width=37) (actual
time=0.017..0.017 rows=1 loops=1)"
"        Buckets: 1024  Batches: 1  Memory Usage: 5kB"
"        ->  Nested Loop  (cost=0.00..2.06 rows=1 width=37) (actual
time=0.011..0.012 rows=1 loops=1)"
"              Join Filter: (up.partner_id = p.id)"
"              Rows Removed by Join Filter: 1"
"              ->  Seq Scan on wg3ppbm_userpartner up
(cost=0.00..1.01 rows=1 width=4) (actual time=0.004..0.004 rows=1
loops=1)"
"              ->  Seq Scan on wg3ppbm_partner p  (cost=0.00..1.02
rows=2 width=41) (actual time=0.001..0.001 rows=2 loops=1)"
"Planning time: 1.484 ms"
"Execution time: 1542.799 ms"


Re: Index is not used for "IN (non-correlated subquery)"

From
Tom Lane
Date:
George <pinkisntwell@gmail.com> writes:
> explain analyze select *
> from wg3ppbm_transaction where partner_uuid in (
>             select p.uuid
>             from wg3ppbm_userpartner up
>             join wg3ppbm_partner p on p.id = up.partner_id
> );

> "Hash Semi Join  (cost=2.07..65628.14 rows=663727 width=380) (actual
> time=0.346..1542.730 rows=1 loops=1)"
> "  Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
> "  ->  Seq Scan on wg3ppbm_transaction  (cost=0.00..54757.54
> rows=1327454 width=380) (actual time=0.004..878.568 rows=1327587
> loops=1)"

So you're still getting the 50% default estimate, which is why it doesn't
want to use the index ...

> "  ->  Hash  (cost=2.06..2.06 rows=1 width=37) (actual
> time=0.017..0.017 rows=1 loops=1)"
> "        Buckets: 1024  Batches: 1  Memory Usage: 5kB"
> "        ->  Nested Loop  (cost=0.00..2.06 rows=1 width=37) (actual
> time=0.011..0.012 rows=1 loops=1)"
> "              Join Filter: (up.partner_id = p.id)"
> "              Rows Removed by Join Filter: 1"
> "              ->  Seq Scan on wg3ppbm_userpartner up
> (cost=0.00..1.01 rows=1 width=4) (actual time=0.004..0.004 rows=1
> loops=1)"
> "              ->  Seq Scan on wg3ppbm_partner p  (cost=0.00..1.02
> rows=2 width=41) (actual time=0.001..0.001 rows=2 loops=1)"

... and you still don't have any meaningful number of rows in
wg3ppbm_userpartner or wg3ppbm_partner.  However, I don't understand how
it knows that there's only one or two rows in those tables and yet is
producing the stupid default estimate for the semijoin.  I spent some time
trying to duplicate that behavior, without success.  What PG version is
that, exactly?  Have you vacuumed and/or analyzed those two tables?  What
do you get for

select * from pg_stats where tablename = 'wg3ppbm_userpartner';

and likewise for wg3ppbm_partner?

            regards, tom lane


Re: Index is not used for "IN (non-correlated subquery)"

From
George
Date:
On Thu, Dec 1, 2016 at 6:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> George <pinkisntwell@gmail.com> writes:
>> explain analyze select *
>> from wg3ppbm_transaction where partner_uuid in (
>>             select p.uuid
>>             from wg3ppbm_userpartner up
>>             join wg3ppbm_partner p on p.id = up.partner_id
>> );
>
>> "Hash Semi Join  (cost=2.07..65628.14 rows=663727 width=380) (actual
>> time=0.346..1542.730 rows=1 loops=1)"
>> "  Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
>> "  ->  Seq Scan on wg3ppbm_transaction  (cost=0.00..54757.54
>> rows=1327454 width=380) (actual time=0.004..878.568 rows=1327587
>> loops=1)"
>
> So you're still getting the 50% default estimate, which is why it doesn't
> want to use the index ...
>
>> "  ->  Hash  (cost=2.06..2.06 rows=1 width=37) (actual
>> time=0.017..0.017 rows=1 loops=1)"
>> "        Buckets: 1024  Batches: 1  Memory Usage: 5kB"
>> "        ->  Nested Loop  (cost=0.00..2.06 rows=1 width=37) (actual
>> time=0.011..0.012 rows=1 loops=1)"
>> "              Join Filter: (up.partner_id = p.id)"
>> "              Rows Removed by Join Filter: 1"
>> "              ->  Seq Scan on wg3ppbm_userpartner up
>> (cost=0.00..1.01 rows=1 width=4) (actual time=0.004..0.004 rows=1
>> loops=1)"
>> "              ->  Seq Scan on wg3ppbm_partner p  (cost=0.00..1.02
>> rows=2 width=41) (actual time=0.001..0.001 rows=2 loops=1)"
>
> ... and you still don't have any meaningful number of rows in
> wg3ppbm_userpartner or wg3ppbm_partner.  However, I don't understand how
> it knows that there's only one or two rows in those tables and yet is
> producing the stupid default estimate for the semijoin.  I spent some time
> trying to duplicate that behavior, without success.  What PG version is
> that, exactly?

"PostgreSQL 9.5.5 on i686-pc-linux-gnu, compiled by gcc (GCC) 4.4.7
20120313 (Red Hat 4.4.7-17), 32-bit"

> Have you vacuumed and/or analyzed those two tables?

Yes.

> What
> do you get for
>
> select * from pg_stats where tablename = 'wg3ppbm_userpartner';
> and likewise for wg3ppbm_partner?

It is a wide table. Do you want me to dump csv here?

In the meantime, with the help of the folks at #postgresql I was able
to wisen up the query planner by using either one of the following two
settings:

 SET enable_seqscan = false

 SET cpu_tuple_cost = 0.1

I think this should be helpful.


Re: Index is not used for "IN (non-correlated subquery)"

From
Tom Lane
Date:
George <pinkisntwell@gmail.com> writes:
> On Thu, Dec 1, 2016 at 6:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What
>> do you get for
>> select * from pg_stats where tablename = 'wg3ppbm_userpartner';
>> and likewise for wg3ppbm_partner?

> It is a wide table. Do you want me to dump csv here?

Shouldn't be *that* wide, with only one row in the underlying table ;-)
Maybe psql \x format would be suitable.

> In the meantime, with the help of the folks at #postgresql I was able
> to wisen up the query planner by using either one of the following two
> settings:
>  SET enable_seqscan = false
>  SET cpu_tuple_cost = 0.1

Well, that proves it is considering the indexscan option.  But there is
something funny going on, if you have stats for these tables and yet
you're getting a default rowcount estimate.

            regards, tom lane