Thread: DISTINCT on jsonb fields and Indexes
I have a table with the schema: CREATE TABLE fluent (id BIGSERIAL, record JSONB); Then I created a couple of indexes: 1) CREATE INDEX idx_records ON fluent USING GIN (record); 2) CREATE INDEX idx_btree_records ON fluent USING BTREE ((record ->> 'destinationServiceName')); Now, if I run a query to look up the distinct values of the field `destinationServiceName`, via: select distinct record ->> 'destinationServiceName' from fluent; This query takes a lot of time, and does full table scan. The query planning is: # explain analyze select distinct record ->> 'destinationServiceName' from fluent; QUERY PLAN Unique (cost=1103803.97..1118803.97 rows=3000000 width=32) (actual time=77282.528..78549.877 rows=10 loops=1) -> Sort (cost=1103803.97..1111303.97 rows=3000000 width=32) (actual time=77282.525..78046.992 rows=3000000 loops=1) Sort Key: ((record ->> 'destinationServiceName'::text)) Sort Method: external merge Disk: 117456kB -> Seq Scan on fluent (cost=0.00..637500.00 rows=3000000 width=32) (actual time=14.440..69545.867 rows=3000000 loops=1) Planning Time: 0.187 ms Execution Time: 78574.221 ms I see that none of the indexes are used. I want to do a few aggregations, like "what are the distinct pairs of `destinationServiceName` and `sourceServiceName` etc. " in these records. Now, is such a querying possible at all without doing full table scans ? I get such kind of aggregation support in elasticsearch + kibana, without doing full-table scan (I believe so, but I do not have data to back this claim) and I am trying to see if this is possible with any other extra index creation in postgres. Any suggestions ? Thanks. -- Sankar P http://psankar.blogspot.com
On Mon, 22 Jun 2020 at 16:44, Sankar P <sankar.curiosity@gmail.com> wrote: > select distinct record ->> 'destinationServiceName' from fluent; > This query takes a lot of time, and does full table scan. The query planning is: > I see that none of the indexes are used. I want to do a few > aggregations, like "what are the distinct pairs of > `destinationServiceName` and `sourceServiceName` etc. " in these > records. Now, is such a querying possible at all without doing full > table scans ? I get such kind of aggregation support in elasticsearch > + kibana, without doing full-table scan (I believe so, but I do not > have data to back this claim) and I am trying to see if this is > possible with any other extra index creation in postgres. There is some work in progress to improve this type of query, but it'll be at least PG14 before we see that. For your version, you might want to look at https://wiki.postgresql.org/wiki/Loose_indexscan and try some of the proposed solutions from there. David
> > There is some work in progress to improve this type of query, but > it'll be at least PG14 before we see that. oh okay. > > For your version, you might want to look at > https://wiki.postgresql.org/wiki/Loose_indexscan and try some of the > proposed solutions from there. Thanks a lot :) -- Sankar P http://psankar.blogspot.com
On Sun, Jun 21, 2020 at 10:43 PM Sankar P <sankar.curiosity@gmail.com> wrote:
I have a table with the schema:
CREATE TABLE fluent (id BIGSERIAL, record JSONB);
Then I created a couple of indexes:
1) CREATE INDEX idx_records ON fluent USING GIN (record);
What about using non-default jsonb_path_ops?
2) CREATE INDEX idx_btree_records ON fluent USING BTREE ((record ->>
'destinationServiceName'));
Is this key always set? If so, make it a proper column so you get statistics on common values and number of distinct values as such.
If it is rarely used, create a partial index perhaps. I am a little surprised that the plain btree index wasn't used from my naive point of view. Did you check execution time with sequential scan disabled to try to strongly encourage the use of index scan?
Michael Lewis <mlewis@entrata.com> writes: > On Sun, Jun 21, 2020 at 10:43 PM Sankar P <sankar.curiosity@gmail.com> > wrote: >> 2) CREATE INDEX idx_btree_records ON fluent USING BTREE ((record ->> >> 'destinationServiceName')); > If it is rarely used, create a partial index perhaps. I am a little > surprised that the plain btree index wasn't used from my naive point of > view. It's not terribly surprising unfortunately. The planner will seldom recognize that an expression index is good for anything except searches, that is "WHERE indexed_expression indexable_operator constant". There's some mention of this at https://www.postgresql.org/docs/current/indexes-index-only-scans.html The core of the problem is not wanting to expend cycles on trying to match every subexpression in the query to every index expression; so at least in early planning stages, only potentially-indexable subexpressions of WHERE clauses get matched to indexes. In the example of "select distinct expression", the planner will never notice that that expression has anything to do with an index. regards, tom lane
In the example of "select distinct expression", the planner will never notice
that that expression has anything to do with an index.
Thanks for that explanation. I assume re-writing as a 'group by' would have no bearing on that planner decision.
Michael Lewis <mlewis@entrata.com> writes: >> In the example of "select distinct expression", the planner will never >> notice that that expression has anything to do with an index. > Thanks for that explanation. I assume re-writing as a 'group by' would have > no bearing on that planner decision. Hmm ... actually, now that you mention it, it might. Using GROUP BY moves the expression into the category of "things we might want to sort on", and we do pay attention to indexes for that. Trying it just now, I find that I can get a index scan but not an index-only scan for something like "select expr from tab group by expr". Not sure if that will move the needle enough to help for the original problem. regards, tom lane
Michael Lewis <mlewis@entrata.com> writes: >> In the example of "select distinct expression", the planner will never >> notice that that expression has anything to do with an index. > Thanks for that explanation. I assume re-writing as a 'group by' would have > no bearing on that planner decision. Hmm ... actually, now that you mention it, it might. Using GROUP BY moves the expression into the category of "things we might want to sort on", and we do pay attention to indexes for that. Trying it just now, I find that I can get a index scan but not an index-only scan for something like "select expr from tab group by expr". Not sure if that will move the needle enough to help for the original problem. regards, tom lane