Re: Feature request for adoptive indexes - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Feature request for adoptive indexes
Date
Msg-id f1f33e93-9fe6-b27c-9dc2-367595d24f49@enterprisedb.com
Whole thread Raw
In response to Re: Feature request for adoptive indexes  (Hayk Manukyan <manukyantt@gmail.com>)
List pgsql-hackers
On 11/1/21 1:24 PM, Hayk Manukyan wrote:
> I agree with the above mentioned.  
> The only concern I have is that we compare little wrong things.
> For read we should compare  
>  (job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp,
> year,  issue_flag  ) VS  (job, nlp, year, sequence, Scan_ID, issue_flag)
> OR  (job, nlp, year INCLUDE(sequence, Scan_ID, issue_flag) )
> Because our proposed index for reading should be closer to a combination
> of those 3 and we have current solutions like index on all or with
> Include statement. 

I don't follow.

The whole point of the experiment was to show the gap between a "best
case" and "worst case" alternatives, with the assumption the gap would
be substantial and the new index type might get close to the best case.

Are you suggesting those are not the actual best/worst cases and we
should use some other indexes? If yes, which ones?


IMHO those best/worst cases are fine because:

1) best case (job, nlp, year, sequence)

I don't see how we could get anything better for queries on "sequence"
than this index, because that's literally one of the indexes that would
be included in the whole index.

Yes, if you need to support queries on additional columns, you might
need more indexes, but that's irrelevant - why would anyone define those
indexes, when the "worst case" btree index with all the columns is so
close to the best case?


2) worst case (job, nlp, year, scan_id, issue_flag, sequence)

I think an index with INCLUDE is entirely irrelevant here. The reason to
use INCLUDE is to define UNIQUE index on a subset of columns, but that's
not what we need here. I repeated the benchmark with such index, and the
timing is ~150ms, so about 50% slower than the simple index. Sorting on
all columns is clearly beneficial even for the last column.


So I still think those best/worst cases are sensible, and the proposed
index would need to beat the worst case. Which seems challenging,
considering how close it is to the best case. Or it might break the best
case, if there's some sort of revolutionary way to store the small
indexes or something like that.

The fact that there's no size difference between the two cases is mostly
a coincidence, due to the columns being just 2B each, and with wider
values the difference might be substantial, making the gap larger. But
then the new index would have to improve on this, but there's no
proposal on how to do that.


> We should try to find a gap between these three cases.
> For DML queries 
>  (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year
> INCLUDE(sequence, Scan_ID, issue_flag) ) VS  (job, nlp, year, sequence)
> AND (job, nlp, year, Scan_ID) and (job, nlp, year,  issue_flag  )
> Because again the proposed index should be just one and cover all 3
> separate ones. 
> 
> If you agree with these cases I will try to find a bigger time frame to
> compare these two cases deeper. 
>
> The issue is not high prio but I strongly believe it can help and can be
> nice feature for even more complicated cases.
> 

You don't need my approval to run benchmarks etc. If you believe this is
beneficial then just do the tests and you'll see if it makes sense ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Antonin Houska
Date:
Subject: Re: storing an explicit nonce
Next
From: Fujii Masao
Date:
Subject: Re: [PATCH] Added TRANSFORM FOR for COMMENT tab completion