Re: WIP: Covering + unique indexes. - Mailing list pgsql-hackers

From Anastasia Lubennikova
Subject Re: WIP: Covering + unique indexes.
Date
Msg-id 569530FC.2010900@postgrespro.ru
Whole thread Raw
In response to Re: WIP: Covering + unique indexes.  (David Rowley <david.rowley@2ndquadrant.com>)
Responses Re: WIP: Covering + unique indexes.  (Jeff Janes <jeff.janes@gmail.com>)
Re: WIP: Covering + unique indexes.  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
08.01.2016 00:12, David Rowley:
On 7 January 2016 at 06:36, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Jan 5, 2016 at 11:55 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> create table ab (a int,b int);
> insert into ab select x,y from generate_series(1,20) x(x),
> generate_series(10,1,-1) y(y);
> create index on ab (a) including (b);
> explain select * from ab order by a,b;
>                         QUERY PLAN
> ----------------------------------------------------------
>  Sort  (cost=10.64..11.14 rows=200 width=8)
>    Sort Key: a, b
>    ->  Seq Scan on ab  (cost=0.00..3.00 rows=200 width=8)
> (3 rows)

If you set enable_sort=off, then you get the index-only scan with no
sort.  So it believes the index can be used for ordering (correctly, I
think), just sometimes it thinks it is not faster to do it that way.

I'm not sure why this would be a correctness problem.  The covered
column does not participate in uniqueness checks, but it still usually
participates in index ordering.  (That is why dummy op-classes are
needed if you want to include non-sortable-type columns as being
covered.)

If that's the case, then it appears that I've misunderstood INCLUDING. From reading _bt_doinsert() it appeared that it'll ignore the INCLUDING columns and just find the insert position based on the key columns. Yet that's not the way that it appears to work. I was also a bit confused, as from working with another database which has very similar syntax to this, that one only includes the columns to allow index only scans, and the included columns are not indexed, therefore can't be part of index quals and the index only provides a sorted path for the indexed columns, and not the included columns.

Thank you for properly testing. Order by clause in this case definitely doesn't work as expected.
The problem is fixed by patching a planner function "build_index_pathkeys()'. It disables using of index if sorting of included columns is required.
Test example works correctly now - it always performs seq scan and sort.

Saying that, I'm now a bit confused to why the following does not produce 2 indexes which are the same size:

create table t1 (a int, b text);
insert into t1 select x,md5(random()::text) from generate_series(1,1000000) x(x);
create index t1_a_inc_b_idx on t1 (a) including (b);
create index t1_a_b_idx on t1 (a,b);
select pg_relation_Size('t1_a_b_idx'),pg_relation_size('t1_a_inc_b_idx');
 pg_relation_size | pg_relation_size 
------------------+------------------
         59064320 |         58744832
(1 row)

I suppose you've already found that in discussion above. Included columns are stored only in leaf index pages. The difference is the size of attributes 'b' which are situated in inner pages of index "t1_a_b_idx".

Also, if we want INCLUDING() to mean "uniqueness is not enforced on these columns, but they're still in the index", then I don't really think allowing types without a btree opclass is a good idea. It's likely too surprised filled and might not be what the user actually wants. I'd suggest that these non-indexed columns would be better defined by further expanding the syntax, the first (perhaps not very good) thing that comes to mind is:

create unique index idx_name on table (unique_col) also index (other,idx,cols) including (leaf,onlycols);

Looking up thread, I don't think I was the first to be confused by this.

Included columns are still in the index physically - they are stored in the index relation. But they are not indexed in the true sense of the word. It's impossible to use them for index scan or ordering. At the beginning, I've got an idea that included columns are supposed to be used for combination of unique index on one columns and covering on others. In a very rare instances one could prefer a non-unique index with included columns "t1_a_inc_b_idx" to a regular multicolumn index "t1_a_b_idx". Frankly, I didn't see such use cases at all. Index size reduction is not considerable, while we lose some useful index functionality on included column. I think that it should be mentioned as a note in documentation, but I need help to phrase it clear.

But now I see the reason to create non-unique index with included columns - lack of suitable opclass on column "b".
It's
impossible to add it into the index as a key column, but that's not a problem with INCLUDING clause.
Look at example.

create table t1 (a int, b box);
create index t1_a_inc_b_idx on t1 (a) including (b);
create index on tbl (a,b);
ERROR:  data type box has no default operator class for access method "btree"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.
create index on tbl (a) including (b);
CREATE INDEX

This functionality is provided by the attached patch "omit_opclass_4.0", which must be applied over covering_unique_4.0.patch.


I see what you were confused about, I'd had the same question at the very beginning of the discussion of this patch.
Now it seems a bit more clear to me. INCLUDING columns are not used for the searching or ordering of records, so there is no need to check whether they have an opclass.  INCLUDING columns perform as expected and it agrees with other database experience. And this patch is completed.

But it isn't perfect definitely... I found test case to explain that. See below.
That's why we need optional_opclass functionality, which will use opclass where possible and omit it in other cases.
This idea have been already described in a message Re: [PROPOSAL] Covering + unique indexes as "partially unique index".
I suggest to separate optional_opclass task to ease syntax discussion and following review. And I'll implement it in the next patch a bit later.

Test case:
1) patch covering_unique_4.0 + test_covering_unique_4.0
If included columns' opclasses are used, new query plan is the same with the old one.
and have nearly the same execution time:

                                                         QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using oldcoveringidx on oldt  (cost=0.43..301.72 rows=1 width=8) (actual time=0.021..0.676 rows=6 loops=1)
   Index Cond: ((c1 < 10000) AND (c3 < 20))
   Heap Fetches: 0
 Planning time: 0.101 ms
 Execution time: 0.697 ms
(5 rows)

                                                     QUERY PLAN                                                    
--------------------------------------------------------------------------------------------------------------------
 Index Only Scan using newidx on newt  (cost=0.43..276.51 rows=1 width=8) (actual time=0.020..0.665 rows=6 loops=1)
   Index Cond: ((c1 < 10000) AND (c3 < 20))
   Heap Fetches: 0
 Planning time: 0.082 ms
 Execution time: 0.687 ms
(5 rows)

2) patch covering_unique_4.0 + patch omit_opclass_4.0 + test_covering_unique_4.0
Otherwise, new query can not use included column in Index Cond and uses filter instead. It slows down the query significantly.
                                                         QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using oldcoveringidx on oldt  (cost=0.43..230.39 rows=1 width=8) (actual time=0.021..0.722 rows=6 loops=1)
   Index Cond: ((c1 < 10000) AND (c3 < 20))
   Heap Fetches: 0
 Planning time: 0.091 ms
 Execution time: 0.744 ms
(5 rows)

                                                     QUERY PLAN                                                    
--------------------------------------------------------------------------------------------------------------------
 Index Only Scan using newidx on newt  (cost=0.43..374.68 rows=1 width=8) (actual time=0.018..2.595 rows=6 loops=1)
   Index Cond: (c1 < 10000)
   Filter: (c3 < 20)
   Rows Removed by Filter: 9993
   Heap Fetches: 0
 Planning time: 0.078 ms
 Execution time: 2.612 ms
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: proposal: PL/Pythonu - function ereport
Next
From: Catalin Iacob
Date:
Subject: Re: proposal: PL/Pythonu - function ereport