Thread: DISTINCT on jsonb fields and Indexes

DISTINCT on jsonb fields and Indexes

From

Sankar P

Date:

22 June 2020, 04:43:37

I have a table with the schema:

CREATE TABLE fluent (id BIGSERIAL, record JSONB);

Then I created a couple of indexes:
1) CREATE INDEX idx_records ON fluent USING GIN (record);
2) CREATE INDEX idx_btree_records ON fluent USING BTREE ((record ->>
'destinationServiceName'));

Now, if I run a query to look up the distinct values of the field
`destinationServiceName`, via:

select distinct record ->> 'destinationServiceName' from fluent;

This query takes a lot of time, and does full table scan. The query planning is:

# explain analyze select distinct record ->> 'destinationServiceName'
from fluent;
                                                            QUERY PLAN
                                              Unique
(cost=1103803.97..1118803.97 rows=3000000 width=32) (actual
time=77282.528..78549.877 rows=10 loops=1)
   ->  Sort  (cost=1103803.97..1111303.97 rows=3000000 width=32)
(actual time=77282.525..78046.992 rows=3000000 loops=1)
         Sort Key: ((record ->> 'destinationServiceName'::text))
         Sort Method: external merge  Disk: 117456kB
         ->  Seq Scan on fluent  (cost=0.00..637500.00 rows=3000000
width=32) (actual time=14.440..69545.867 rows=3000000 loops=1)
 Planning Time: 0.187 ms
 Execution Time: 78574.221 ms

I see that none of the indexes are used. I want to do a few
aggregations, like "what are the distinct pairs of
`destinationServiceName` and `sourceServiceName` etc. " in these
records. Now, is such a querying possible at all without doing full
table scans ? I get such kind of aggregation support in elasticsearch
+ kibana, without doing full-table scan (I believe so, but I do not
have data to back this claim) and I am trying to see if this is
possible with any other extra index creation in postgres.

Any suggestions ? Thanks.

-- 
Sankar P
http://psankar.blogspot.com

Re: DISTINCT on jsonb fields and Indexes

From

David Rowley

Date:

22 June 2020, 07:13:33

On Mon, 22 Jun 2020 at 16:44, Sankar P <sankar.curiosity@gmail.com> wrote:
> select distinct record ->> 'destinationServiceName' from fluent;

> This query takes a lot of time, and does full table scan. The query planning is:

> I see that none of the indexes are used. I want to do a few
> aggregations, like "what are the distinct pairs of
> `destinationServiceName` and `sourceServiceName` etc. " in these
> records. Now, is such a querying possible at all without doing full
> table scans ? I get such kind of aggregation support in elasticsearch
> + kibana, without doing full-table scan (I believe so, but I do not
> have data to back this claim) and I am trying to see if this is
> possible with any other extra index creation in postgres.

There is some work in progress to improve this type of query, but
it'll be at least PG14 before we see that.

For your version, you might want to look at
https://wiki.postgresql.org/wiki/Loose_indexscan and try some of the
proposed solutions from there.

David

Re: DISTINCT on jsonb fields and Indexes

From

Sankar P

Date:

22 June 2020, 07:21:32

>
> There is some work in progress to improve this type of query, but
> it'll be at least PG14 before we see that.

oh okay.

>
> For your version, you might want to look at
> https://wiki.postgresql.org/wiki/Loose_indexscan and try some of the
> proposed solutions from there.

Thanks a lot :)

-- 
Sankar P
http://psankar.blogspot.com

Re: DISTINCT on jsonb fields and Indexes

From

Michael Lewis

Date:

22 June 2020, 16:45:44

On Sun, Jun 21, 2020 at 10:43 PM Sankar P <sankar.curiosity@gmail.com> wrote:

I have a table with the schema:

CREATE TABLE fluent (id BIGSERIAL, record JSONB);

Then I created a couple of indexes:
1) CREATE INDEX idx_records ON fluent USING GIN (record);

What about using non-default jsonb_path_ops?

2) CREATE INDEX idx_btree_records ON fluent USING BTREE ((record ->>
'destinationServiceName'));

Is this key always set? If so, make it a proper column so you get statistics on common values and number of distinct values as such.

If it is rarely used, create a partial index perhaps. I am a little surprised that the plain btree index wasn't used from my naive point of view. Did you check execution time with sequential scan disabled to try to strongly encourage the use of index scan?

Re: DISTINCT on jsonb fields and Indexes

From

Tom Lane

Date:

22 June 2020, 17:05:14

Michael Lewis <mlewis@entrata.com> writes:
> On Sun, Jun 21, 2020 at 10:43 PM Sankar P <sankar.curiosity@gmail.com>
> wrote:
>> 2) CREATE INDEX idx_btree_records ON fluent USING BTREE ((record ->>
>> 'destinationServiceName'));

> If it is rarely used, create a partial index perhaps. I am a little
> surprised that the plain btree index wasn't used from my naive point of
> view.

It's not terribly surprising unfortunately.  The planner will seldom
recognize that an expression index is good for anything except
searches, that is "WHERE indexed_expression indexable_operator constant".
There's some mention of this at

https://www.postgresql.org/docs/current/indexes-index-only-scans.html

The core of the problem is not wanting to expend cycles on trying to
match every subexpression in the query to every index expression;
so at least in early planning stages, only potentially-indexable
subexpressions of WHERE clauses get matched to indexes.  In the
example of "select distinct expression", the planner will never notice
that that expression has anything to do with an index.

            regards, tom lane

Re: DISTINCT on jsonb fields and Indexes

From

Michael Lewis

Date:

22 June 2020, 20:10:19

In the example of "select distinct expression", the planner will never notice
that that expression has anything to do with an index.

Thanks for that explanation. I assume re-writing as a 'group by' would have no bearing on that planner decision.

Re: DISTINCT on jsonb fields and Indexes

From

Tom Lane

Date:

22 June 2020, 20:41:17

Michael Lewis <mlewis@entrata.com> writes:
>> In the example of "select distinct expression", the planner will never
>> notice that that expression has anything to do with an index.

> Thanks for that explanation. I assume re-writing as a 'group by' would have
> no bearing on that planner decision.

Hmm ... actually, now that you mention it, it might.  Using GROUP BY
moves the expression into the category of "things we might want to
sort on", and we do pay attention to indexes for that.  Trying it
just now, I find that I can get a index scan but not an index-only
scan for something like "select expr from tab group by expr".
Not sure if that will move the needle enough to help for the
original problem.

            regards, tom lane

Re: DISTINCT on jsonb fields and Indexes

From

Tom Lane

Date:

22 June 2020, 22:00:54

Michael Lewis <mlewis@entrata.com> writes:
>> In the example of "select distinct expression", the planner will never
>> notice that that expression has anything to do with an index.

> Thanks for that explanation. I assume re-writing as a 'group by' would have
> no bearing on that planner decision.

Hmm ... actually, now that you mention it, it might.  Using GROUP BY
moves the expression into the category of "things we might want to
sort on", and we do pay attention to indexes for that.  Trying it
just now, I find that I can get a index scan but not an index-only
scan for something like "select expr from tab group by expr".
Not sure if that will move the needle enough to help for the
original problem.

            regards, tom lane