Thread: WIP: Covering + unique indexes.

WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:

Hi hackers,

I'm working on a patch that allows to combine covering and unique functionality for btree indexes.

Previous discussion was here:
1) Proposal thread
2) Message with proposal clarification

In a nutshell, the feature allows to create index with "key" columns and "included" columns.
"key" columns can be used as scan keys. Unique constraint relates only to "key" columns.
"included" columns may be used as scan keys if they have suitable opclass.
Both "key" and "included" columns can be returned from index by IndexOnlyScan.

Btree is the default index and it's used everywhere. So it requires properly testing.  Volunteers are welcome)

Use case:
- We have a table (c1, c2, c3, c4);
- We need to have an unique index on (c1, c2).
- We would like to have a covering index on all columns to avoid reading of heap pages.

Old way:
CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2);
CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4);

What's wrong?
Two indexes contain repeated data. Overhead to data manipulation operations and database size.

New way:
CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4);

The patch is attached.
In 'test.sql' you can find a test with detailed comments on each step, and comparison of old and new indexes.

New feature has following syntax:
CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4);
Keyword INCLUDING defines the "included" columns of index. These columns aren't concern to unique constraint.
Also, them are not stored in index inner pages. It allows to decrease index size.

Results:
1) Additional covering index is not required anymore.
2) New index can use IndexOnlyScan on queries, where old index can't.

For example,
explain analyze select c1, c2 from newt where c1<10000 and c3<20;

*more examples in 'test.sql'

Future work:
To do opclasses for "included" columns optional.

CREATE TABLE tbl (c1 int, c4 box);
CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4);

If we don't need c4 as an index scankey, we don't need any btree opclass on it.
But we still want to have it in covering index for queries like

SELECT c4 FROM tbl WHERE c1=1000;
SELECT * FROM tbl WHERE c1=1000;
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Thom Brown
Date:
On 8 October 2015 at 16:18, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>
> Hi hackers,
>
> I'm working on a patch that allows to combine covering and unique
> functionality for btree indexes.
>
> Previous discussion was here:
> 1) Proposal thread
> 2) Message with proposal clarification
>
> In a nutshell, the feature allows to create index with "key" columns and
> "included" columns.
> "key" columns can be used as scan keys. Unique constraint relates only to
> "key" columns.
> "included" columns may be used as scan keys if they have suitable opclass.
> Both "key" and "included" columns can be returned from index by
> IndexOnlyScan.
>
> Btree is the default index and it's used everywhere. So it requires properly
> testing.  Volunteers are welcome)
>
> Use case:
> - We have a table (c1, c2, c3, c4);
> - We need to have an unique index on (c1, c2).
> - We would like to have a covering index on all columns to avoid reading of
> heap pages.
>
> Old way:
> CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2);
> CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4);
>
> What's wrong?
> Two indexes contain repeated data. Overhead to data manipulation operations
> and database size.
>
> New way:
> CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4);
>
> The patch is attached.
> In 'test.sql' you can find a test with detailed comments on each step, and
> comparison of old and new indexes.
>
> New feature has following syntax:
> CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4);
> Keyword INCLUDING defines the "included" columns of index. These columns
> aren't concern to unique constraint.
> Also, them are not stored in index inner pages. It allows to decrease index
> size.
>
> Results:
> 1) Additional covering index is not required anymore.
> 2) New index can use IndexOnlyScan on queries, where old index can't.
>
> For example,
> explain analyze select c1, c2 from newt where c1<10000 and c3<20;
>
> *more examples in 'test.sql'
>
> Future work:
> To do opclasses for "included" columns optional.
>
> CREATE TABLE tbl (c1 int, c4 box);
> CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4);
>
> If we don't need c4 as an index scankey, we don't need any btree opclass on
> it.
> But we still want to have it in covering index for queries like
>
> SELECT c4 FROM tbl WHERE c1=1000;
> SELECT * FROM tbl WHERE c1=1000;

The definition output needs a space after "INCLUDING":

# SELECT pg_get_indexdef('people_first_name_last_name_email_idx'::regclass::oid);
            pg_get_indexdef
 

--------------------------------------------------------------------------------------------------------------------------CREATE
UNIQUEINDEX people_first_name_last_name_email_idx ON people
 
USING btree (first_name, last_name) INCLUDING(email)
(1 row)


There is also no collation output:

# CREATE UNIQUE INDEX test_idx ON people (first_name COLLATE "en_GB",
last_name) INCLUDING (email COLLATE "pl_PL");
CREATE INDEX

# SELECT pg_get_indexdef('test_idx'::regclass::oid);                                              pg_get_indexdef
-------------------------------------------------------------------------------------------------------------CREATE
UNIQUEINDEX test_idx ON people USING btree (first_name
 
COLLATE "en_GB", last_name) INCLUDING(email)
(1 row)


As for functioning, it works as described:

# EXPLAIN SELECT email FROM people WHERE (first_name,last_name) =
('Paul','Freeman');                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------Index Only
Scanusing people_first_name_last_name_email_idx on people(cost=0.28..1.40 rows=1 width=21)  Index Cond: ((first_name =
'Paul'::text)AND (last_name = 'Freeman'::text))
 
(2 rows)


Typo:

"included columns must not intersects with key columns"

should be:

"included columns must not intersect with key columns"


One thing I've noticed you can do with your patch, which you haven't
mentioned, is have a non-unique covering index:

# CREATE INDEX covering_idx ON people (first_name) INCLUDING (last_name);
CREATE INDEX

# EXPLAIN SELECT first_name, last_name FROM people WHERE first_name = 'Paul';                                  QUERY
PLAN
---------------------------------------------------------------------------------Index Only Scan using covering_idx on
people (cost=0.28..1.44 rows=4 width=13)  Index Cond: (first_name = 'Paul'::text)
 
(2 rows)

But this appears to behave as if it were a regular multi-column index,
in that it will use the index for ordering rather than sort after
fetching from the index.  So is this really stored the same as a
multi-column index?  The index sizes aren't identical, so something is
different.

Thom



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:

08.10.2015 19:31, Thom Brown пишет:
> On 8 October 2015 at 16:18, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> Hi hackers,
>>
>> I'm working on a patch that allows to combine covering and unique
>> functionality for btree indexes.
>>
>> Previous discussion was here:
>> 1) Proposal thread
>> 2) Message with proposal clarification
>>
>> In a nutshell, the feature allows to create index with "key" columns and
>> "included" columns.
>> "key" columns can be used as scan keys. Unique constraint relates only to
>> "key" columns.
>> "included" columns may be used as scan keys if they have suitable opclass.
>> Both "key" and "included" columns can be returned from index by
>> IndexOnlyScan.
>>
>> Btree is the default index and it's used everywhere. So it requires properly
>> testing.  Volunteers are welcome)
>>
>> Use case:
>> - We have a table (c1, c2, c3, c4);
>> - We need to have an unique index on (c1, c2).
>> - We would like to have a covering index on all columns to avoid reading of
>> heap pages.
>>
>> Old way:
>> CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2);
>> CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4);
>>
>> What's wrong?
>> Two indexes contain repeated data. Overhead to data manipulation operations
>> and database size.
>>
>> New way:
>> CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4);
>>
>> The patch is attached.
>> In 'test.sql' you can find a test with detailed comments on each step, and
>> comparison of old and new indexes.
>>
>> New feature has following syntax:
>> CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4);
>> Keyword INCLUDING defines the "included" columns of index. These columns
>> aren't concern to unique constraint.
>> Also, them are not stored in index inner pages. It allows to decrease index
>> size.
>>
>> Results:
>> 1) Additional covering index is not required anymore.
>> 2) New index can use IndexOnlyScan on queries, where old index can't.
>>
>> For example,
>> explain analyze select c1, c2 from newt where c1<10000 and c3<20;
>>
>> *more examples in 'test.sql'
>>
>> Future work:
>> To do opclasses for "included" columns optional.
>>
>> CREATE TABLE tbl (c1 int, c4 box);
>> CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4);
>>
>> If we don't need c4 as an index scankey, we don't need any btree opclass on
>> it.
>> But we still want to have it in covering index for queries like
>>
>> SELECT c4 FROM tbl WHERE c1=1000;
>> SELECT * FROM tbl WHERE c1=1000;
> The definition output needs a space after "INCLUDING":
>
> # SELECT pg_get_indexdef('people_first_name_last_name_email_idx'::regclass::oid);
>                                                       pg_get_indexdef
>
--------------------------------------------------------------------------------------------------------------------------
>   CREATE UNIQUE INDEX people_first_name_last_name_email_idx ON people
> USING btree (first_name, last_name) INCLUDING(email)
> (1 row)
>
>
> There is also no collation output:
>
> # CREATE UNIQUE INDEX test_idx ON people (first_name COLLATE "en_GB",
> last_name) INCLUDING (email COLLATE "pl_PL");
> CREATE INDEX
>
> # SELECT pg_get_indexdef('test_idx'::regclass::oid);
>                                                 pg_get_indexdef
> -------------------------------------------------------------------------------------------------------------
>   CREATE UNIQUE INDEX test_idx ON people USING btree (first_name
> COLLATE "en_GB", last_name) INCLUDING(email)
> (1 row)
>
>
> As for functioning, it works as described:
>
> # EXPLAIN SELECT email FROM people WHERE (first_name,last_name) =
> ('Paul','Freeman');
>                                                  QUERY PLAN
> ----------------------------------------------------------------------------------------------------------
>   Index Only Scan using people_first_name_last_name_email_idx on people
>   (cost=0.28..1.40 rows=1 width=21)
>     Index Cond: ((first_name = 'Paul'::text) AND (last_name = 'Freeman'::text))
> (2 rows)
>
>
> Typo:
>
> "included columns must not intersects with key columns"
>
> should be:
>
> "included columns must not intersect with key columns"

Thank you for testing. Mentioned issues are fixed.

> One thing I've noticed you can do with your patch, which you haven't
> mentioned, is have a non-unique covering index:
>
> # CREATE INDEX covering_idx ON people (first_name) INCLUDING (last_name);
> CREATE INDEX
>
> # EXPLAIN SELECT first_name, last_name FROM people WHERE first_name = 'Paul';
>                                     QUERY PLAN
> ---------------------------------------------------------------------------------
>   Index Only Scan using covering_idx on people  (cost=0.28..1.44 rows=4 width=13)
>     Index Cond: (first_name = 'Paul'::text)
> (2 rows)
>
> But this appears to behave as if it were a regular multi-column index,
> in that it will use the index for ordering rather than sort after
> fetching from the index.  So is this really stored the same as a
> multi-column index?  The index sizes aren't identical, so something is
> different.

Yes, it behaves as a regular multi-column index.
Index sizes are different, because included attributes are not stored in
index inner pages.
It allows to decrease index size. I don't sure that it doesn't decrease
search speed.
But I assumed that we are never execute search on included columns
without clause on key columns.
So it must be not too costly to recheck included attributes on leaf pages.

Furthermore, it's a first step of work on "optional oplasses for
included columns".
If attribute hasn't opclass, we certainly don't need to store it in
inner index page.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
Finally, completed patch "covering_unique_3.0.patch" is here.
It includes the functionality discussed above in the thread, regression tests and docs update.
I think it's quite ready for review.

Future work:
Besides that, I'd like to get feedback about attached patch "optional_opclass_3.0.patch".
It should be applied on the "covering_unique_3.0.patch".

Actually, this patch is the first step to do opclasses for "included" columns optional
and implement real covering indexing.

Example:
CREATE TABLE tbl (c1 int, c4 box);
CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4);

If we don't need c4 as an index scankey, we don't need any btree opclass on it.
But we still want to have it in covering index for queries like

SELECT c4 FROM tbl WHERE c1=1000; // uses the IndexOnlyScan
SELECT * FROM tbl WHERE c1=1000; // uses the IndexOnlyScan

The patch "optional_opclass" completely ignores opclasses of included attributes.
To see the difference, look at the explain analyze output:

explain analyze select * from tbl where c1=2 and c4 && box '(0,0,1,1)';
                                                  QUERY PLAN                                                  
---------------------------------------------------------------------------------------------------------------
 Index Only Scan using idx on tbl  (cost=0.13..4.15 rows=1 width=36) (actual time=0.010..0.013 rows=1 loops=1)
   Index Cond: (c1 = 2)
   Filter: (c4 && '(1,1),(0,0)'::box)

"Index Cond" shows the index ScanKey conditions and "Filter" is for conditions which are used after index scan. Anyway it is faster than SeqScan that we had before, because IndexOnlyScan avoids extra heap fetches.

As I already said, this patch is just WIP, so included opclass is not "optional" but actually "ignored".
And following example works worse than without the patch. Please, don't care about it.

CREATE TABLE tbl2 (c1 int, c2 int);
CREATE UNIQUE INDEX idx2 ON tbl2 USING btree (c1) INCLUDING (c2);
explain analyze select * from tbl2 where c1<20 and c2<5;
                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Index Only Scan using idx2 on tbl2  (cost=0.28..4.68 rows=10 width=8) (actual time=0.055..0.066 rows=9 loops=1)
   Index Cond: (c1 < 20)
   Filter: (c2 < 5)

The question is more about suitable syntax.
We have two different optimizations here:
1. INCLUDED columns
2. Optional opclasses
It's logical to provide optional opclasses only for included columns.
Is it ok, to handle it using the same syntax and resolve all opclass conflicts while create index?

CREATE TABLE tbl2 (c1 int, c2 int, c4 box);
CREATE UNIQUE INDEX idx2 ON tbl2 USING btree (c1) INCLUDING (c2, c4);
CREATE UNIQUE INDEX idx3 ON tbl2 USING btree (c1) INCLUDING (c4, c2);

Of course, order of attributes is important.
Attrs which have oplass and want to use it in ScanKey must be situated before the others.
idx2 will use c2 in IndexCond, while idx3 will not.
But I think that it's the job for DBA.

If you see any related changes in planner, please mention them. I haven't explored that part of code yet and could have missed something.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: WIP: Covering + unique indexes.

From
Robert Haas
Date:
On Tue, Dec 1, 2015 at 7:53 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> If we don't need c4 as an index scankey, we don't need any btree opclass on
> it.
> But we still want to have it in covering index for queries like
>
> SELECT c4 FROM tbl WHERE c1=1000; // uses the IndexOnlyScan
> SELECT * FROM tbl WHERE c1=1000; // uses the IndexOnlyScan
>
> The patch "optional_opclass" completely ignores opclasses of included
> attributes.

OK, I don't get it.  Why have an opclass here at all, even optionally?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:

03.12.2015 04:03, Robert Haas пишет:
> On Tue, Dec 1, 2015 at 7:53 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> If we don't need c4 as an index scankey, we don't need any btree opclass on
>> it.
>> But we still want to have it in covering index for queries like
>>
>> SELECT c4 FROM tbl WHERE c1=1000; // uses the IndexOnlyScan
>> SELECT * FROM tbl WHERE c1=1000; // uses the IndexOnlyScan
>>
>> The patch "optional_opclass" completely ignores opclasses of included
>> attributes.
> OK, I don't get it.  Why have an opclass here at all, even optionally?

We haven't opclass on c4 and there's no need to have it.
But now (without a patch) it's impossible to create covering index, 
which contains columns with no opclass for btree.

test=# create index on tbl using btree (c1, c4);
ERROR:  data type box has no default operator class for access method 
"btree"

ComputeIndexAttrs() processes the list of index attributes and trying to 
get an opclass for each of them via GetIndexOpClass().
The patch drops this check for included attributes. So it makes possible 
to store any datatype in btree  and use IndexOnlyScan advantages.

I hope that this helps to clarify.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
Jeff Janes
Date:
On Tue, Dec 1, 2015 at 4:53 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Finally, completed patch "covering_unique_3.0.patch" is here.
> It includes the functionality discussed above in the thread, regression
> tests and docs update.
> I think it's quite ready for review.

Thanks for the patch.

I get a compiler warning when building it on gcc (SUSE Linux) 4.8.1
20130909 [gcc-4_8-branch revision 202388]:

nbtinsert.c: In function '_bt_check_unique':
nbtinsert.c:256:2: warning: ISO C90 forbids mixed declarations and
code [-Wdeclaration-after-statement] SnapshotData SnapshotDirty; ^


And the dblink contrib module fails its make check.

I'm trying to find a good test case for it.  Unfortunately in most of
my natural use cases, the inclusion of the extra column causes the
updates to become non-HOT, which causes more problems than it solves.

Cheers,

Jeff



Re: WIP: Covering + unique indexes.

From
Jeff Janes
Date:
On Sat, Dec 26, 2015 at 5:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> And the dblink contrib module fails its make check.

Ignore the dblink complaint.  It seems to have been some wonky build
issue that is not reproducible.



Re: WIP: Covering + unique indexes.

From
David Rowley
Date:
On 2 December 2015 at 01:53, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
Finally, completed patch "covering_unique_3.0.patch" is here.
It includes the functionality discussed above in the thread, regression tests and docs update.
I think it's quite ready for review.

Hi Anastasia,

I've maybe mentioned before that I think this is a great feature and I think it will be very useful to have, so I've signed up to review the patch, and below is the results of my first pass from reading the code. Apologies if some of the things seem like nitpicks, I've basically just listed everything I've noticed during, no matter how small.

- int natts = rel->rd_rel->relnatts;
+ int nkeyatts = rel->rd_index->indnkeyatts;
+
+ Assert (rel->rd_index != NULL);
+ Assert(rel->rd_index->indnatts != 0);
+ Assert(rel->rd_index->indnkeyatts != 0);
+
  SnapshotData SnapshotDirty;

There's a couple of problems here. According to [1] the C code must follow the C89 standard, but this appears not to. You have some statements before the final variable declaration, and also there's a problem as you're Asserting that rel->rd_index != NULL after already trying to dereference it in the assignment to nkeyatts, which makes the Assert() useless.

+   An access method that supports this feature sets <structname>pg_am</>.<structfield>amcanincluding</> true.

I don't think this belongs under the "Index Uniqueness Checks" title. I think the "Columns included with clause INCLUDING  aren't used to enforce uniqueness." that you've added before it is a good idea, but perhaps the details of amcanincluding are best explained elsewhere.

-   indexed columns are equal in multiple rows.
+   indexed columns are equal in multiple rows. Columns included with clause
+   INCLUDING  aren't used to enforce constraints (UNIQUE, PRIMARY KEY, etc).

<literal> is missing around "INCLUDING" here. Perhaps this part needs more explanation in a new paragraph. Likely it's good idea to also inform the reader that the columns which are part of the INCLUDING clause exist only to allow the query planner to skip having to perform a lookup to the heap when all of the columns required for the relation are present in the indexed columns, or in the INCLUDING columns. I think you should explain that the index can also only be used as pre-sorted input for columns which are in the "indexed columns" part of the index, and the INCLUDING column are not searchable as index quals.

--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -137,7 +137,6 @@ CheckIndexCompatible(Oid oldId,
  Relation irel;
  int i;
  Datum d;
-
  /* Caller should already have the relation locked in some way. */

You've accidentally removed an empty line here.

+ /*
+ * All information about key and included cols is in numberOfKeyAttributes number.
+ * So we can concat all index params into one list.
+ */
+ stmt->indexParams = list_concat(stmt->indexParams, stmt->indexIncludingParams);

I think this should be explained with a better comment, perhaps:

/*
 * We append any INCLUDING columns onto the indexParams list so that
 * we have one list with all columns. Later we can determine which of these
 * are indexed, and which are just part of the INCLUDING list by check the list
 * position. A list item in a position less than ii_NumIndexKeyAttrs is part of
 * the indexed columns, and anything equal to and over is part of the
 * INCLUDING columns.
 */

+ stack = _bt_search(rel, IndexRelationGetNumberOfKeyAttributes(rel), itup_scankey,

This line is longer than 80 chars.

+ /* Truncate nonkey attributes when inserting on nonleaf pages */
+ if (wstate->index->rd_index->indnatts != wstate->index->rd_index->indnkeyatts)
+ {
+ BTPageOpaque pageop = (BTPageOpaque) PageGetSpecialPointer(npage);
+
+ if (!P_ISLEAF(pageop))
+ {
+ itup = index_reform_tuple(wstate->index, itup, wstate->index->rd_index->indnatts, wstate->index->rd_index->indnkeyatts);
+ itupsz = IndexTupleDSize(*itup);
+ itupsz = MAXALIGN(itupsz);
+ }
+ }

A few of the lines here are over 80 chars.


+        This clause specifies additional columns to be appended to the set of index columns.
+        Included columns don't support any constraints <literal>(UNIQUE, PRMARY KEY, EXCLUSION CONSTRAINT)</>.
+        These columns can improve the performance of some queries  through using advantages of index-only scan
+        (Or so called <firstterm>covering</firstterm> indexes. Covering index is the index that
+        covers all columns required in the query and prevents a table access).
+        Besides that, included attributes are not stored in index inner pages.
+        It allows to decrease index size and furthermore it provides a way to extend included
+        columns to store atttributes without suitable opclass (not implemented yet).
+        This clause could be applied to both unique and nonunique indexes.
+        It's possible to have non-unique covering index, which behaves as a regular
+        multi-column index with a bit smaller index-size.
+        Currently, only the B-tree access method supports this feature.

"PRMARY KEY" should be "PRIMARY KEY". I ended up rewriting this paragraph as follows.

"An optional <literal>INCLUDING</> clause allows a list of columns to be specified which will be included in the index, in the non-key portion of the index. Columns which are part of this clause cannot also exist in the indexed columns portion of the index, and vice versa. The <literal>INCLUDING</> columns exist solely to allow more queries to benefit from <firstterm>index only scans</> by including certain columns in the index, the value of which would otherwise have to be obtained by reading the table's heap. Having these columns in the <literal>INCLUDING</> clause in some cases allows <productname>PostgreSQL</> to skip the heap read completely. This also allows <literal>UNIQUE</> indexes to be defined on one set of columns, which can include another set of column in the <literal>INCLUDING</> clause, on which the uniqueness is not enforced upon. This can also be useful for non-unique indexes as any columns which are not required for the searching or ordering of records can defined in the <literal>INCLUDING</> clause, which can often reduce the size of the index."

Maybe not perfect, but maybe it's an improvement?


+   To create an unique B-tree index on the column <literal>title</literal> in

and

+   To create an unique B-tree index on the column <literal>title</literal>

Although "unique" starts with a vowel, "an" is not correct here: This is best explained in someone else's words:

"The choice between a and an is governed not by whether the next written letter is a consonant or vowel but by whether the next word begins with the sound of a vowel or consonant. Unique begins with a "y" sound, hence a unique is correct."

- int natts = rel->rd_rel->relnatts;
+ int nkeyatts = rel->rd_rel->relnatts;
  ScanKey itup_scankey;
  BTStack stack;
  Buffer buf;
  OffsetNumber offset;
 
+ Assert (rel->rd_index != NULL);
+ Assert(rel->rd_index->indnatts != 0);
+ Assert(rel->rd_index->indnkeyatts != 0);
+ nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+

nkeyatts is assigned twice.

+ /* Truncate nonkey attributes when inserting on nonleaf pages. */
+ if (rel->rd_index->indnatts != rel->rd_index->indnkeyatts)
+ if (!P_ISLEAF(lpageop))
+ itup = index_reform_tuple(rel, itup, rel->rd_index->indnatts, rel->rd_index->indnkeyatts);

I don't recall having seen any places in the code which skip on the outer {} braces in this way before, although I can't see anything in the coding standards which states that this is wrong. In either case, perhaps it's better to just use an && instead of the extra if (). The assignment line also exceeds 80 chars.

+/*
+ * Reform index tuple. Truncate nonkey (INCLUDED) attributes.
+ */

I guess "INCLUDED" should be "INCLUDING"? The capitalisation makes me think you're talking about the syntax.

+ if (!colno || colno == keyno + 1) {
  appendStringInfoString(&buf, quote_identifier(attname));
+ if ((attrsOnly)&&(keyno >= idxrec->indnkeyatts))
+ appendStringInfoString(&buf, " (included)");
+ }

The { brace here should be on the next line. I'm also a bit unsure what the "(included)" is for. There's also surplus parenthesis in the 2nd "if" statement, and also missing whitespace.

+ bool amcanincluding; /* does AM support INCLUDING columns? */

Perhaps this should be called "amcaninclude". I don't think we really need to use the same word as is used in the SQL syntax here, do we?
Same for the new column in pg_am.

Perhaps this needs the comment updated from the standard one.

int16 indnatts; /* number of columns in index */

maybe just say /* total number of columns in index */ ?

+ int ii_NumIndexKeyAttrs;

The struct comment needs an entry for ii_NumIndexKeyAttrs.

+ List   *indexIncludingParams; /* additional columns to index: a list of IndexElem */

This should wrap at 80 chars. struct RestrictInfo has some examples of how this is normally done.

 /*
+ * RelationGetNumberOfAttributes
+ * Returns the number of attributes in a relation.
+ */
+#define IndexRelationGetNumberOfKeyAttributes(relation) ((relation)->rd_index->indnkeyatts)
+

Copy paste problem. You missed editing the comment.

I've not tested the patch yet. I will send another email soon with the results of that.

Thanks for working on this.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Covering + unique indexes.

From
David Rowley
Date:
On 4 January 2016 at 21:49, David Rowley <david.rowley@2ndquadrant.com> wrote:
I've not tested the patch yet. I will send another email soon with the results of that.

Hi,

As promised I've done some testing on this, and I've found something which is not quite right:

create table ab (a int,b int);
insert into ab select x,y from generate_series(1,20) x(x), generate_series(10,1,-1) y(y);
create index on ab (a) including (b);
explain select * from ab order by a,b;
                        QUERY PLAN                        
----------------------------------------------------------
 Sort  (cost=10.64..11.14 rows=200 width=8)
   Sort Key: a, b
   ->  Seq Scan on ab  (cost=0.00..3.00 rows=200 width=8)
(3 rows)

This is what I'd expect

truncate table ab;
insert into ab select x,y from generate_series(1,20) x(x), generate_series(10,1,-1) y(y);
explain select * from ab order by a,b;
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Index Only Scan using ab_a_b_idx on ab  (cost=0.15..66.87 rows=2260 width=8)
(1 row)

This index, as we've defined it should not be able to satisfy the query's order by, although it does give correct results, that's because the index seems to be built wrongly in cases where the rows are added after the index exists.

If we then do:

reindex table ab;
explain select * from ab order by a,b;
                        QUERY PLAN                        
----------------------------------------------------------
 Sort  (cost=10.64..11.14 rows=200 width=8)
   Sort Key: a, b
   ->  Seq Scan on ab  (cost=0.00..3.00 rows=200 width=8)
(3 rows)

It looks normal again.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Covering + unique indexes.

From
Jeff Janes
Date:
On Tue, Jan 5, 2016 at 11:55 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 4 January 2016 at 21:49, David Rowley <david.rowley@2ndquadrant.com>
> wrote:
>>
>> I've not tested the patch yet. I will send another email soon with the
>> results of that.
>
>
> Hi,
>
> As promised I've done some testing on this, and I've found something which
> is not quite right:
>
> create table ab (a int,b int);
> insert into ab select x,y from generate_series(1,20) x(x),
> generate_series(10,1,-1) y(y);
> create index on ab (a) including (b);
> explain select * from ab order by a,b;
>                         QUERY PLAN
> ----------------------------------------------------------
>  Sort  (cost=10.64..11.14 rows=200 width=8)
>    Sort Key: a, b
>    ->  Seq Scan on ab  (cost=0.00..3.00 rows=200 width=8)
> (3 rows)

If you set enable_sort=off, then you get the index-only scan with no
sort.  So it believes the index can be used for ordering (correctly, I
think), just sometimes it thinks it is not faster to do it that way.

I'm not sure why this would be a correctness problem.  The covered
column does not participate in uniqueness checks, but it still usually
participates in index ordering.  (That is why dummy op-classes are
needed if you want to include non-sortable-type columns as being
covered.)

>
> This is what I'd expect
>
> truncate table ab;
> insert into ab select x,y from generate_series(1,20) x(x),
> generate_series(10,1,-1) y(y);
> explain select * from ab order by a,b;
>                                   QUERY PLAN
> ------------------------------------------------------------------------------
>  Index Only Scan using ab_a_b_idx on ab  (cost=0.15..66.87 rows=2260
> width=8)
> (1 row)
>
> This index, as we've defined it should not be able to satisfy the query's
> order by, although it does give correct results, that's because the index
> seems to be built wrongly in cases where the rows are added after the index
> exists.

I think this just causes differences in planner statistics leading to
different plans.  ANALYZE the table and it goes back to doing the
sort, because it thinks the sort is faster.

Cheers,

Jeff



Re: WIP: Covering + unique indexes.

From
David Rowley
Date:
On 7 January 2016 at 06:36, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Jan 5, 2016 at 11:55 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> create table ab (a int,b int);
> insert into ab select x,y from generate_series(1,20) x(x),
> generate_series(10,1,-1) y(y);
> create index on ab (a) including (b);
> explain select * from ab order by a,b;
>                         QUERY PLAN
> ----------------------------------------------------------
>  Sort  (cost=10.64..11.14 rows=200 width=8)
>    Sort Key: a, b
>    ->  Seq Scan on ab  (cost=0.00..3.00 rows=200 width=8)
> (3 rows)

If you set enable_sort=off, then you get the index-only scan with no
sort.  So it believes the index can be used for ordering (correctly, I
think), just sometimes it thinks it is not faster to do it that way.

I'm not sure why this would be a correctness problem.  The covered
column does not participate in uniqueness checks, but it still usually
participates in index ordering.  (That is why dummy op-classes are
needed if you want to include non-sortable-type columns as being
covered.)

If that's the case, then it appears that I've misunderstood INCLUDING. From reading _bt_doinsert() it appeared that it'll ignore the INCLUDING columns and just find the insert position based on the key columns. Yet that's not the way that it appears to work. I was also a bit confused, as from working with another database which has very similar syntax to this, that one only includes the columns to allow index only scans, and the included columns are not indexed, therefore can't be part of index quals and the index only provides a sorted path for the indexed columns, and not the included columns.

Saying that, I'm now a bit confused to why the following does not produce 2 indexes which are the same size:

create table t1 (a int, b text);
insert into t1 select x,md5(random()::text) from generate_series(1,1000000) x(x);
create index t1_a_inc_b_idx on t1 (a) including (b);
create index t1_a_b_idx on t1 (a,b);
select pg_relation_Size('t1_a_b_idx'),pg_relation_size('t1_a_inc_b_idx');
 pg_relation_size | pg_relation_size 
------------------+------------------
         59064320 |         58744832
(1 row)

Also, if we want INCLUDING() to mean "uniqueness is not enforced on these columns, but they're still in the index", then I don't really think allowing types without a btree opclass is a good idea. It's likely too surprised filled and might not be what the user actually wants. I'd suggest that these non-indexed columns would be better defined by further expanding the syntax, the first (perhaps not very good) thing that comes to mind is:

create unique index idx_name on table (unique_col) also index (other,idx,cols) including (leaf,onlycols);

Looking up thread, I don't think I was the first to be confused by this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:

04.01.2016 11:49, David Rowley:
On 2 December 2015 at 01:53, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
Finally, completed patch "covering_unique_3.0.patch" is here.
It includes the functionality discussed above in the thread, regression tests and docs update.
I think it's quite ready for review.

Hi Anastasia,

I've maybe mentioned before that I think this is a great feature and I think it will be very useful to have, so I've signed up to review the patch, and below is the results of my first pass from reading the code. Apologies if some of the things seem like nitpicks, I've basically just listed everything I've noticed during, no matter how small.

First of all, I would like to thank you for writing such a detailed review.
All mentioned style problems, comments and typos are fixed in the patch v4.0.
+   An access method that supports this feature sets <structname>pg_am</>.<structfield>amcanincluding</> true.

I don't think this belongs under the "Index Uniqueness Checks" title. I think the "Columns included with clause INCLUDING  aren't used to enforce uniqueness." that you've added before it is a good idea, but perhaps the details of amcanincluding are best explained elsewhere.
agree

+        This clause specifies additional columns to be appended to the set of index columns.
+        Included columns don't support any constraints <literal>(UNIQUE, PRMARY KEY, EXCLUSION CONSTRAINT)</>.
+        These columns can improve the performance of some queries  through using advantages of index-only scan
+        (Or so called <firstterm>covering</firstterm> indexes. Covering index is the index that
+        covers all columns required in the query and prevents a table access).
+        Besides that, included attributes are not stored in index inner pages.
+        It allows to decrease index size and furthermore it provides a way to extend included
+        columns to store atttributes without suitable opclass (not implemented yet).
+        This clause could be applied to both unique and nonunique indexes.
+        It's possible to have non-unique covering index, which behaves as a regular
+        multi-column index with a bit smaller index-size.
+        Currently, only the B-tree access method supports this feature.

"PRMARY KEY" should be "PRIMARY KEY". I ended up rewriting this paragraph as follows.

"An optional <literal>INCLUDING</> clause allows a list of columns to be specified which will be included in the index, in the non-key portion of the index. Columns which are part of this clause cannot also exist in the indexed columns portion of the index, and vice versa. The <literal>INCLUDING</> columns exist solely to allow more queries to benefit from <firstterm>index only scans</> by including certain columns in the index, the value of which would otherwise have to be obtained by reading the table's heap. Having these columns in the <literal>INCLUDING</> clause in some cases allows <productname>PostgreSQL</> to skip the heap read completely. This also allows <literal>UNIQUE</> indexes to be defined on one set of columns, which can include another set of column in the <literal>INCLUDING</> clause, on which the uniqueness is not enforced upon. This can also be useful for non-unique indexes as any columns which are not required for the searching or ordering of records can defined in the <literal>INCLUDING</> clause, which can often reduce the size of the index."

Maybe not perfect, but maybe it's an improvement?


Yes, this explanation is much better. I've just added couple of notes.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
08.01.2016 00:12, David Rowley:
On 7 January 2016 at 06:36, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Jan 5, 2016 at 11:55 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> create table ab (a int,b int);
> insert into ab select x,y from generate_series(1,20) x(x),
> generate_series(10,1,-1) y(y);
> create index on ab (a) including (b);
> explain select * from ab order by a,b;
>                         QUERY PLAN
> ----------------------------------------------------------
>  Sort  (cost=10.64..11.14 rows=200 width=8)
>    Sort Key: a, b
>    ->  Seq Scan on ab  (cost=0.00..3.00 rows=200 width=8)
> (3 rows)

If you set enable_sort=off, then you get the index-only scan with no
sort.  So it believes the index can be used for ordering (correctly, I
think), just sometimes it thinks it is not faster to do it that way.

I'm not sure why this would be a correctness problem.  The covered
column does not participate in uniqueness checks, but it still usually
participates in index ordering.  (That is why dummy op-classes are
needed if you want to include non-sortable-type columns as being
covered.)

If that's the case, then it appears that I've misunderstood INCLUDING. From reading _bt_doinsert() it appeared that it'll ignore the INCLUDING columns and just find the insert position based on the key columns. Yet that's not the way that it appears to work. I was also a bit confused, as from working with another database which has very similar syntax to this, that one only includes the columns to allow index only scans, and the included columns are not indexed, therefore can't be part of index quals and the index only provides a sorted path for the indexed columns, and not the included columns.

Thank you for properly testing. Order by clause in this case definitely doesn't work as expected.
The problem is fixed by patching a planner function "build_index_pathkeys()'. It disables using of index if sorting of included columns is required.
Test example works correctly now - it always performs seq scan and sort.

Saying that, I'm now a bit confused to why the following does not produce 2 indexes which are the same size:

create table t1 (a int, b text);
insert into t1 select x,md5(random()::text) from generate_series(1,1000000) x(x);
create index t1_a_inc_b_idx on t1 (a) including (b);
create index t1_a_b_idx on t1 (a,b);
select pg_relation_Size('t1_a_b_idx'),pg_relation_size('t1_a_inc_b_idx');
 pg_relation_size | pg_relation_size 
------------------+------------------
         59064320 |         58744832
(1 row)

I suppose you've already found that in discussion above. Included columns are stored only in leaf index pages. The difference is the size of attributes 'b' which are situated in inner pages of index "t1_a_b_idx".

Also, if we want INCLUDING() to mean "uniqueness is not enforced on these columns, but they're still in the index", then I don't really think allowing types without a btree opclass is a good idea. It's likely too surprised filled and might not be what the user actually wants. I'd suggest that these non-indexed columns would be better defined by further expanding the syntax, the first (perhaps not very good) thing that comes to mind is:

create unique index idx_name on table (unique_col) also index (other,idx,cols) including (leaf,onlycols);

Looking up thread, I don't think I was the first to be confused by this.

Included columns are still in the index physically - they are stored in the index relation. But they are not indexed in the true sense of the word. It's impossible to use them for index scan or ordering. At the beginning, I've got an idea that included columns are supposed to be used for combination of unique index on one columns and covering on others. In a very rare instances one could prefer a non-unique index with included columns "t1_a_inc_b_idx" to a regular multicolumn index "t1_a_b_idx". Frankly, I didn't see such use cases at all. Index size reduction is not considerable, while we lose some useful index functionality on included column. I think that it should be mentioned as a note in documentation, but I need help to phrase it clear.

But now I see the reason to create non-unique index with included columns - lack of suitable opclass on column "b".
It's
impossible to add it into the index as a key column, but that's not a problem with INCLUDING clause.
Look at example.

create table t1 (a int, b box);
create index t1_a_inc_b_idx on t1 (a) including (b);
create index on tbl (a,b);
ERROR:  data type box has no default operator class for access method "btree"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.
create index on tbl (a) including (b);
CREATE INDEX

This functionality is provided by the attached patch "omit_opclass_4.0", which must be applied over covering_unique_4.0.patch.


I see what you were confused about, I'd had the same question at the very beginning of the discussion of this patch.
Now it seems a bit more clear to me. INCLUDING columns are not used for the searching or ordering of records, so there is no need to check whether they have an opclass.  INCLUDING columns perform as expected and it agrees with other database experience. And this patch is completed.

But it isn't perfect definitely... I found test case to explain that. See below.
That's why we need optional_opclass functionality, which will use opclass where possible and omit it in other cases.
This idea have been already described in a message Re: [PROPOSAL] Covering + unique indexes as "partially unique index".
I suggest to separate optional_opclass task to ease syntax discussion and following review. And I'll implement it in the next patch a bit later.

Test case:
1) patch covering_unique_4.0 + test_covering_unique_4.0
If included columns' opclasses are used, new query plan is the same with the old one.
and have nearly the same execution time:

                                                         QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using oldcoveringidx on oldt  (cost=0.43..301.72 rows=1 width=8) (actual time=0.021..0.676 rows=6 loops=1)
   Index Cond: ((c1 < 10000) AND (c3 < 20))
   Heap Fetches: 0
 Planning time: 0.101 ms
 Execution time: 0.697 ms
(5 rows)

                                                     QUERY PLAN                                                    
--------------------------------------------------------------------------------------------------------------------
 Index Only Scan using newidx on newt  (cost=0.43..276.51 rows=1 width=8) (actual time=0.020..0.665 rows=6 loops=1)
   Index Cond: ((c1 < 10000) AND (c3 < 20))
   Heap Fetches: 0
 Planning time: 0.082 ms
 Execution time: 0.687 ms
(5 rows)

2) patch covering_unique_4.0 + patch omit_opclass_4.0 + test_covering_unique_4.0
Otherwise, new query can not use included column in Index Cond and uses filter instead. It slows down the query significantly.
                                                         QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using oldcoveringidx on oldt  (cost=0.43..230.39 rows=1 width=8) (actual time=0.021..0.722 rows=6 loops=1)
   Index Cond: ((c1 < 10000) AND (c3 < 20))
   Heap Fetches: 0
 Planning time: 0.091 ms
 Execution time: 0.744 ms
(5 rows)

                                                     QUERY PLAN                                                    
--------------------------------------------------------------------------------------------------------------------
 Index Only Scan using newidx on newt  (cost=0.43..374.68 rows=1 width=8) (actual time=0.018..2.595 rows=6 loops=1)
   Index Cond: (c1 < 10000)
   Filter: (c3 < 20)
   Rows Removed by Filter: 9993
   Heap Fetches: 0
 Planning time: 0.078 ms
 Execution time: 2.612 ms
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: WIP: Covering + unique indexes.

From
Jeff Janes
Date:
On Tue, Jan 12, 2016 at 8:59 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 08.01.2016 00:12, David Rowley:
>
> On 7 January 2016 at 06:36, Jeff Janes <jeff.janes@gmail.com> wrote:
>>

> But now I see the reason to create non-unique index with included columns -
> lack of suitable opclass on column "b".
> It's impossible to add it into the index as a key column, but that's not a
> problem with INCLUDING clause.
> Look at example.
>
> create table t1 (a int, b box);
> create index t1_a_inc_b_idx on t1 (a) including (b);
> create index on tbl (a,b);
> ERROR:  data type box has no default operator class for access method
> "btree"
> HINT:  You must specify an operator class for the index or define a default
> operator class for the data type.
> create index on tbl (a) including (b);
> CREATE INDEX
>
> This functionality is provided by the attached patch "omit_opclass_4.0",
> which must be applied over covering_unique_4.0.patch.

Thanks for the updates.

Why is omit_opclass a separate patch?  If the included columns now
never participate in the index ordering, shouldn't it be an inherent
property of the main patch that you can "cover" things without btree
opclasses?

Are you keeping them separate just to make review easier?  Or do you
think there might be a reason to commit one but not the other?  I
think that if we decide not to use the omit_opclass patch, then we
should also not allow covering columns to be specified on non-unique
indexes.

It looks like the "covering" patch, with or without the "omit_opclass"
patch, does not support expressions as included columns:

create table foobar (x text, y xml);
create index on foobar (x) including  (md5(x));
ERROR:  unrecognized node type: 904
create index on foobar (x) including  ((y::text));
ERROR:  unrecognized node type: 911

I think we would probably want it to work with those (or at least to
throw a better error message).

Thanks,

Jeff



Re: WIP: Covering + unique indexes.

From
David Rowley
Date:
On 13 January 2016 at 05:59, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
08.01.2016 00:12, David Rowley:
On 7 January 2016 at 06:36, Jeff Janes <jeff.janes@gmail.com> wrote:
On Tue, Jan 5, 2016 at 11:55 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> create table ab (a int,b int);
> insert into ab select x,y from generate_series(1,20) x(x),
> generate_series(10,1,-1) y(y);
> create index on ab (a) including (b);
> explain select * from ab order by a,b;
>                         QUERY PLAN
> ----------------------------------------------------------
>  Sort  (cost=10.64..11.14 rows=200 width=8)
>    Sort Key: a, b
>    ->  Seq Scan on ab  (cost=0.00..3.00 rows=200 width=8)
> (3 rows)

If you set enable_sort=off, then you get the index-only scan with no
sort.  So it believes the index can be used for ordering (correctly, I
think), just sometimes it thinks it is not faster to do it that way.

I'm not sure why this would be a correctness problem.  The covered
column does not participate in uniqueness checks, but it still usually
participates in index ordering.  (That is why dummy op-classes are
needed if you want to include non-sortable-type columns as being
covered.)

If that's the case, then it appears that I've misunderstood INCLUDING. From reading _bt_doinsert() it appeared that it'll ignore the INCLUDING columns and just find the insert position based on the key columns. Yet that's not the way that it appears to work. I was also a bit confused, as from working with another database which has very similar syntax to this, that one only includes the columns to allow index only scans, and the included columns are not indexed, therefore can't be part of index quals and the index only provides a sorted path for the indexed columns, and not the included columns.

Thank you for properly testing. Order by clause in this case definitely doesn't work as expected.
The problem is fixed by patching a planner function "build_index_pathkeys()'. It disables using of index if sorting of included columns is required.
Test example works correctly now - it always performs seq scan and sort.


Thank you for updating the patch.
That's cleared up my confusion. All the code I read seemed to indicate that INCLUDING columns were leaf only, it just confused me as to why the indexed appeared to search and order on all columns, including the including columns. Thanks for clearing up my confusion and fixing the patch.
 
Saying that, I'm now a bit confused to why the following does not produce 2 indexes which are the same size:

create table t1 (a int, b text);
insert into t1 select x,md5(random()::text) from generate_series(1,1000000) x(x);
create index t1_a_inc_b_idx on t1 (a) including (b);
create index t1_a_b_idx on t1 (a,b);
select pg_relation_Size('t1_a_b_idx'),pg_relation_size('t1_a_inc_b_idx');
 pg_relation_size | pg_relation_size 
------------------+------------------
         59064320 |         58744832
(1 row)

I suppose you've already found that in discussion above. Included columns are stored only in leaf index pages. The difference is the size of attributes 'b' which are situated in inner pages of index "t1_a_b_idx".

Yeah, I saw that from the code too. I just was confused as they appeared to work like normal indexes.

I've made another pass of the covering_unique_4.0.patch. Again somethings are nit picky (sorry), but it made sense to write them down as I noticed them.

-   multiple entries with identical keys.  An access method that supports this
+   multiple entries with identical keys. An access method that supports this

Space removed by mistake.

    feature sets <structname>pg_am</>.<structfield>amcanunique</> true.
-   (At present, only b-tree supports it.)
+   Columns included with clause INCLUDING  aren't used to enforce uniqueness.
+   (At present, only b-tree supports them.)

Maybe 

+   (At present <structfield>amcanunique</> is only supported by b-tree
+   indexes.)

As the extra line you've added confuses what "it" or "them" means, so maybe best to clarify that.


+   <literal>INCLUDING</literal> aren't used to enforce constraints (UNIQUE, PRIMARY KEY, etc).

Goes beyond 80 chars.


  right_item = CopyIndexTuple(item);
+ right_item = index_reform_tuple(rel, right_item, rel->rd_index->indnatts, rel->rd_index->indnkeyatts);

Duplicate assignment. Should this perhaps be:

+ if (rel->rd_index->indnatts == rel->rd_index->indnkeyatts)
+   right_item = CopyIndexTuple(item);
+ else
+ right_item = index_reform_tuple(rel, right_item, rel->rd_index->indnatts, rel->rd_index->indnkeyatts);

?

- natts = RelationGetNumberOfAttributes(rel);
- indoption = rel->rd_indoption;
 
- skey = (ScanKey) palloc(natts * sizeof(ScanKeyData));
+ Assert(rel->rd_index->indnkeyatts != 0);
+ Assert(rel->rd_index->indnkeyatts <= rel->rd_index->indnatts);
 
- for (i = 0; i < natts; i++)
+ nkeyatts = rel->rd_index->indnkeyatts;

Since RelationGetNumberOfAttributes() was previously used, maybe you should do:

+ nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); 


Yet I'm not really sure if there is some rule about when RelationGetNumberOfAttributes(rel) is used and when rel->->rd_rel->relnatts is used. It seems so mixed up.

  accessMethodName = stmt->accessMethod;
+
  tuple = SearchSysCache1(AMNAME, PointerGetDatum(accessMethodName));

Unrelated change.

+#define Anum_pg_am_amcaninclude 15

Needs 1 more tab so that "15" lines up with the other numbers.

 typedef struct IndexInfo
 {
  NodeTag type;
- int ii_NumIndexAttrs;
+ int ii_NumIndexAttrs; /* total number of columns in index */
+ int ii_NumIndexKeyAttrs; /* number of key columns in index */

The comment above this struct still needs a comment for "NumIndexKeyAttrs". I'm not sure exactly why there's comments in both places with that struct, but it makes sense to follow what's been done already.


+ * Returns the number of key attributes in a relation.

I think "relation" should be "index".

Here's a few things that I'm not too sure on, which maybe Jeff or others could give their opinion on:

ERROR:  duplicate key value violates unique constraint "covering_index_index"
DETAIL:  Key (f1, f2, f3 (included))=(1, 2, BBB) already exists.

Should we only display the key columns here? f3 feels like it does not belong in any reports about unique violations.

+ if(list_intersection(stmt->indexParams, stmt->indexIncludingParams) != NIL)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("included columns must not intersect with key columns")));
+

I wonder if a bit more effort should be spent here to generate a better message. We do a bit more in cases like:

# create table a (a int, b int, c int, a int, b int);
ERROR:  column "a" specified more than once

Perhaps it would be a good idea to also report the first matching intersect item found. Any thoughts?

# create index on ab using hash (a) including (b);
WARNING:  hash indexes are not WAL-logged and their use is discouraged
ERROR:  access method "hash" does not support multicolumn indexes

I wonder if it's better to report: errmsg("access method \"%s\" does not support included columns") before the multicolumn check? It probably does not mater that much, but if a user thought (a) including (b) was a single column index on "a", then it's a bit confusing.

I've also done some testing:

create table ab (a int, b int);
insert into ab select a,b from generate_Series(1,10) a(a), generate_series(1,10000) b(b);
set enable_bitmapscan=off;
set enable_indexscan=off;

select * from ab where a = 1 and b=1;
 a | b
---+---
 1 | 1
(1 row)

set enable_indexscan = on;
select * from ab where a = 1 and b=1;
 a | b
---+---
(0 rows)

This is broken. I've not looked into why yet, but from looking at the EXPLAIN output I was a bit surprised to see b=1 as an index condition. I'd have expected a Filter maybe, but I've not looked at the EXPLAIN code to see how those are determined yet.

I've not looked at the other patch yet.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Covering + unique indexes.

From
David Rowley
Date:
On 13 January 2016 at 06:47, Jeff Janes <jeff.janes@gmail.com> wrote:

Why is omit_opclass a separate patch?  If the included columns now
never participate in the index ordering, shouldn't it be an inherent
property of the main patch that you can "cover" things without btree
opclasses?


I also wondered this. We can't have covering indexes without fixing the problem with the following arrays:

  info->indexkeys = (int *) palloc(sizeof(int) * ncolumns);
  info->indexcollations = (Oid *) palloc(sizeof(Oid) * ncolumns);
  info->opfamily = (Oid *) palloc(sizeof(Oid) * ncolumns);

These need to be sized according to the number of key columns, not the total number of columns. Of course, the TODO item in the patch states this too.

I don't personally think the covering_unique_4.0.patch is that close to being too big to review, I think things would make more sense of the omit_opclass_4.0.patch was included together with this.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:


13.01.2016 04:47, David Rowley :
On 13 January 2016 at 06:47, Jeff Janes <jeff.janes@gmail.com> wrote:

Why is omit_opclass a separate patch?  If the included columns now
never participate in the index ordering, shouldn't it be an inherent
property of the main patch that you can "cover" things without btree
opclasses?


I don't personally think the covering_unique_4.0.patch is that close to being too big to review, I think things would make more sense of the omit_opclass_4.0.patch was included together with this.


I agree that these patches should be merged. It'll be fixed it the next updates.
I kept them separate only for historical reasons, it was more convenient for me to debug them. Furthermore, I wanted to show some performance degradation caused by "omit_opclass" and give a way to reproduce it performing test with and whithot the patch.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
13.01.2016 04:27, David Rowley:
> I've also done some testing:
>
> create table ab (a int, b int);
> insert into ab select a,b from generate_Series(1,10) a(a), 
> generate_series(1,10000) b(b);
> set enable_bitmapscan=off;
> set enable_indexscan=off;
>
> select * from ab where a = 1 and b=1;
>  a | b
> ---+---
>  1 | 1
> (1 row)
>
> set enable_indexscan = on;
> select * from ab where a = 1 and b=1;
>  a | b
> ---+---
> (0 rows)
>
> This is broken. I've not looked into why yet, but from looking at the 
> EXPLAIN output I was a bit surprised to see b=1 as an index condition. 
> I'd have expected a Filter maybe, but I've not looked at the EXPLAIN 
> code to see how those are determined yet.

Hmm... Do you use both patches?
And could you provide index definition, I can't reproduce the problem 
assuming that index is created by the statement
CREATE INDEX idx ON ab (a) INCLUDING (b);

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
David Rowley
Date:
On 14 January 2016 at 02:58, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
13.01.2016 04:27, David Rowley:
I've also done some testing:

create table ab (a int, b int);
insert into ab select a,b from generate_Series(1,10) a(a), generate_series(1,10000) b(b);
set enable_bitmapscan=off;
set enable_indexscan=off;

select * from ab where a = 1 and b=1;
 a | b
---+---
 1 | 1
(1 row)

set enable_indexscan = on;
select * from ab where a = 1 and b=1;
 a | b
---+---
(0 rows)

This is broken. I've not looked into why yet, but from looking at the EXPLAIN output I was a bit surprised to see b=1 as an index condition. I'd have expected a Filter maybe, but I've not looked at the EXPLAIN code to see how those are determined yet.

Hmm... Do you use both patches?
And could you provide index definition, I can't reproduce the problem assuming that index is created by the statement
CREATE INDEX idx ON ab (a) INCLUDING (b);

Sorry, I forgot the index, and yes you guessed correctly about that.

The problem only exists without the omit_opclass_4.0.patch and with the covering_unique_4.0.patch, so please ignore.

I will try to review the omit_opclass_4.0.patch soon.

David

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Covering + unique indexes.

From
David Rowley
Date:
On 14 January 2016 at 08:24, David Rowley <david.rowley@2ndquadrant.com> wrote:
I will try to review the omit_opclass_4.0.patch soon.

Hi, as promised, here's my review of the omit_opclass_4.0.patch patch.

The following comment needs to be updated:

 * indexkeys[], indexcollations[], opfamily[], and opcintype[]
 * each have ncolumns entries.

I think you'll need to do quite a bit of refactoring in this comment to explain how it all works now, and which arrays we expect to be which length.

The omit_opclass_4.0.patch patch should remove the following comment which you added in the other patch:
 
/* TODO
* All these arrays below still have length = ncolumns.
* Fix, when optional opclass functionality will be added.
*
* Generally, any column could be returned by IndexOnlyScan.
* Even if it doesn't have opclass for that type of index.
*
* For example,
* we have an index "create index on tbl(c1) including c2".
* If there's no suitable oplass on c2
* query "select c2 from tbl where c2 < 10" can't use index-only scan
* and query "select c2 from tbl where c1 < 10" can.
* But now it doesn't because of requirement that
* each indexed column must have an opclass.
*/

The following comment should be updated to mention that this is only the case for
key attributes, and we just take the type from the index for including attributes.
Perhaps the comment is better outside of the if (i < nkeyatts) block too, and just
explain both at once.

/*
* The provided data is not necessarily of the type stored in the
* index; rather it is of the index opclass's input type. So look
* at rd_opcintype not the index tupdesc.
*
* Note: this is a bit shaky for opclasses that have pseudotype
* input types such as ANYARRAY or RECORD.  Currently, the
* typoutput functions associated with the pseudotypes will work
* okay, but we might have to try harder in future.
*/

In BuildIndexInfo() numKeys is a bit confusing. Perhaps that needs renamed to numAtts?
Also this makes me think that the name ii_KeyAttrNumbers is now out-of-date, as it contains
the including columns too by the looks of it. Maybe it just needs to drop the "Key" and become
"ii_AttrNumbers". It would be interesting to hear what others think of that.

IndexInfo *
BuildIndexInfo(Relation index)
{
IndexInfo  *ii = makeNode(IndexInfo);
Form_pg_index indexStruct = index->rd_index;
int i;
int numKeys;

/* check the number of keys, and copy attr numbers into the IndexInfo */
numKeys = indexStruct->indnatts;
if (numKeys < 1 || numKeys > INDEX_MAX_KEYS)
elog(ERROR, "invalid indnatts %d for index %u",
numKeys, RelationGetRelid(index));
ii->ii_NumIndexAttrs = numKeys;
ii->ii_NumIndexKeyAttrs = indexStruct->indnkeyatts;
Assert(ii->ii_NumIndexKeyAttrs != 0);
Assert(ii->ii_NumIndexKeyAttrs <= ii->ii_NumIndexAttrs);


Here you've pushed a chunk of code over one tab, but you don't have to do that. Just add:

+ if (i >= indexInfo->ii_NumIndexKeyAttrs)
+ continue;

This'll make the patch a bit smaller. Also, maybe it's time to get rid of you debug stuff that you've commented out?

for (i = 0; i < numKeys; i++)
ii->ii_KeyAttrNumbers[i] = indexStruct->indkey.values[i];
- if (OidIsValid(keyType) && keyType != to->atttypid)
+ if (i < indexInfo->ii_NumIndexKeyAttrs)
  {
- /* index value and heap value have different types */
- tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(keyType));
+ /*
+ * Check the opclass and index AM to see if either provides a keytype


Same for this part:

- /*
- * Identify the exclusion operator, if any.
- */
- if (nextExclOp)
+ if (attn < nkeycols)

Could become:

+ if (attn >= nkeycols)
+ continue;


I'm also wondering if indexkeys is still a good name for the IndexOptInfo struct member. 
Including columns are not really keys, but I feel renaming that might cause a fair bit of code churn, so I'd be interested to hear what other's have to say.

  info->indexkeys = (int *) palloc(sizeof(int) * ncolumns);
- info->indexcollations = (Oid *) palloc(sizeof(Oid) * ncolumns);
- info->opfamily = (Oid *) palloc(sizeof(Oid) * ncolumns);
- info->opcintype = (Oid *) palloc(sizeof(Oid) * ncolumns);
+ info->indexcollations = (Oid *) palloc(sizeof(Oid) * nkeycolumns);
+ info->opfamily = (Oid *) palloc(sizeof(Oid) * nkeycolumns);
+ info->opcintype = (Oid *) palloc(sizeof(Oid) * nkeycolumns);


In quite a few places you do: int natts, nkeyatts;
but the areas you've done this don't seem to ever declare multiple variables per type. Maybe it's best to follow what's there and just write "int" again on the next line.

If you submit an updated patch I can start looking over the change fairly soon.

Many thanks

David

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:


18.01.2016 01:02, David Rowley пишет:
On 14 January 2016 at 08:24, David Rowley <david.rowley@2ndquadrant.com> wrote:
I will try to review the omit_opclass_4.0.patch soon.

Hi, as promised, here's my review of the omit_opclass_4.0.patch patch.

Thank you again. All mentioned points are fixed and patches are merged.
I hope it's all right now. Please check comments one more time. I rather doubt that I wrote everything correctly.
Also this makes me think that the name ii_KeyAttrNumbers is now out-of-date, as it contains
the including columns too by the looks of it. Maybe it just needs to drop the "Key" and become
"ii_AttrNumbers". It would be interesting to hear what others think of that.

I'm also wondering if indexkeys is still a good name for the IndexOptInfo struct member. 
Including columns are not really keys, but I feel renaming that might cause a fair bit of code churn, so I'd be interested to hear what other's have to say.

I agree that KeyAttrNumbers and indexkeys are a bit confusing names, but I'd like to keep them at least in this patch.
It's may be worth doing "index structures refactoring" as a separate patch.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:

12.01.2016 20:47, Jeff Janes:
> It looks like the "covering" patch, with or without the "omit_opclass"
> patch, does not support expressions as included columns:
>
> create table foobar (x text, y xml);
> create index on foobar (x) including  (md5(x));
> ERROR:  unrecognized node type: 904
> create index on foobar (x) including  ((y::text));
> ERROR:  unrecognized node type: 911
>
> I think we would probably want it to work with those (or at least to
> throw a better error message).
Thank you for the notice. I couldn't fix it quickly and added a stub in 
the latest patch.
But I'll try to fix it and add expressions support a bit later.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
Jeff Janes
Date:
On Tue, Jan 19, 2016 at 9:08 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>
>
> 18.01.2016 01:02, David Rowley пишет:
>
> On 14 January 2016 at 08:24, David Rowley <david.rowley@2ndquadrant.com>
> wrote:
>>
>> I will try to review the omit_opclass_4.0.patch soon.
>
>
> Hi, as promised, here's my review of the omit_opclass_4.0.patch patch.
>
> Thank you again. All mentioned points are fixed and patches are merged.
> I hope it's all right now. Please check comments one more time. I rather
> doubt that I wrote everything correctly.

Unfortunately there are several merge conflicts between your patch and
this commit:

commit 65c5fcd353a859da9e61bfb2b92a99f12937de3b
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sun Jan 17 19:36:59 2016 -0500
   Restructure index access method API to hide most of it at the C level.


Can you rebase past that commit?

Thanks,

Jeff



Re: WIP: Covering + unique indexes.

From
David Rowley
Date:
On 20 January 2016 at 06:08, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>
>
>
> 18.01.2016 01:02, David Rowley пишет:
>
> On 14 January 2016 at 08:24, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>
>> I will try to review the omit_opclass_4.0.patch soon.
>
>
> Hi, as promised, here's my review of the omit_opclass_4.0.patch patch.
>
> Thank you again. All mentioned points are fixed and patches are merged.
> I hope it's all right now. Please check comments one more time. I rather doubt that I wrote everything correctly.


Thanks for updating.

+        for the searching or ordering of records can defined in the

should be:

+        for the searching or ordering of records can be defined in the

but perhaps "defined" should be "included".

The following is still quite wasteful. CopyIndexTuple() does a
palloc() and memcpy(), and then you throw that away if
rel->rd_index->indnatts != rel->rd_index->indnkeyatts. I think you
just need to add an "else" and move the CopyIndexTuple() below the if.

item = (IndexTuple) PageGetItem(lpage, itemid); right_item = CopyIndexTuple(item);
+ if (rel->rd_index->indnatts != rel->rd_index->indnkeyatts)
+ right_item = index_reform_tuple(rel, right_item,
rel->rd_index->indnatts, rel->rd_index->indnkeyatts);

Tom also commited
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=65c5fcd353a859da9e61bfb2b92a99f12937de3b
So it looks like you'll need to update your pg_am.h changes. Looks
like you'll need a new struct member in IndexAmRoutine and just
populate that new member in each of the *handler functions listed in
pg_am.h

-#define Natts_pg_am 30
+#define Natts_pg_am 31

Can the following be changed to

-   (At present, only b-tree supports it.)
+   (At present, only b-tree supports it.) Columns included with clause
+   INCLUDING  aren't used to enforce uniqueness.

-   (At present, only b-tree supports it.)
+   (At present, only b-tree supports it.) Columns which are present in the<literal>INCLUDING</> clause are not used to
enforceuniqueness. 

> Also this makes me think that the name ii_KeyAttrNumbers is now out-of-date, as it contains
> the including columns too by the looks of it. Maybe it just needs to drop the "Key" and become
> "ii_AttrNumbers". It would be interesting to hear what others think of that.
>
> I'm also wondering if indexkeys is still a good name for the IndexOptInfo struct member.
> Including columns are not really keys, but I feel renaming that might cause a fair bit of code churn, so I'd be
interestedto hear what other's have to say. 
>
>
> I agree that KeyAttrNumbers and indexkeys are a bit confusing names, but I'd like to keep them at least in this
patch.
> It's may be worth doing "index structures refactoring" as a separate patch.


I agree. A separate patch sounds like the best course of action, but
authoring that can wait until after this is committed (I think).

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
22.01.2016 01:47, David Rowley:
> On 20 January 2016 at 06:08, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>>
>>
>> 18.01.2016 01:02, David Rowley пишет:
>>
>> On 14 January 2016 at 08:24, David Rowley <david.rowley@2ndquadrant.com> wrote:
>>> I will try to review the omit_opclass_4.0.patch soon.
>>
>> Hi, as promised, here's my review of the omit_opclass_4.0.patch patch.
>>
>> Thank you again. All mentioned points are fixed and patches are merged.
>> I hope it's all right now. Please check comments one more time. I rather doubt that I wrote everything correctly.
>
> Thanks for updating.
>
> +        for the searching or ordering of records can defined in the
>
> should be:
>
> +        for the searching or ordering of records can be defined in the
>
> but perhaps "defined" should be "included".
>
> The following is still quite wasteful. CopyIndexTuple() does a
> palloc() and memcpy(), and then you throw that away if
> rel->rd_index->indnatts != rel->rd_index->indnkeyatts. I think you
> just need to add an "else" and move the CopyIndexTuple() below the if.
>
> item = (IndexTuple) PageGetItem(lpage, itemid);
>    right_item = CopyIndexTuple(item);
> + if (rel->rd_index->indnatts != rel->rd_index->indnkeyatts)
> + right_item = index_reform_tuple(rel, right_item,
> rel->rd_index->indnatts, rel->rd_index->indnkeyatts);
Fixed. Thank you for reminding me.
> Tom also commited
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=65c5fcd353a859da9e61bfb2b92a99f12937de3b
> So it looks like you'll need to update your pg_am.h changes. Looks
> like you'll need a new struct member in IndexAmRoutine and just
> populate that new member in each of the *handler functions listed in
> pg_am.h
>
> -#define Natts_pg_am 30
> +#define Natts_pg_am 31
Done. I hope that my patch is close to the commit too.

Thank you again for review.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Jeff Janes
Date:
On Fri, Jan 22, 2016 at 7:19 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>
> Done. I hope that my patch is close to the commit too.
>

Thanks for the update.

I've run into this problem:

create table foobar (x text, w text);
create unique index foobar_pkey on foobar (x) including (w);
alter table foobar add constraint foobar_pkey primary key using index
foobar_pkey;

ERROR:  index "foobar_pkey" does not have default sorting behavior
LINE 1: alter table foobar add constraint foobar_pkey primary key us...                              ^
DETAIL:  Cannot create a primary key or unique constraint using such an index.
Time: 1.577 ms


If I instead define the table as
create table foobar (x int, w xml);

Then I can create the index and then the primary key the first time I
do this in a session.  But then if I drop the table and repeat the
process, I get "does not have default sorting behavior" error even for
this index that previously succeeded, so I think there is some kind of
problem with the backend syscache or catcache.

create table foobar (x int, w xml);
create unique index foobar_pkey on foobar (x) including (w);
alter table foobar add constraint foobar_pkey primary key using index
foobar_pkey;
drop table foobar ;
create table foobar (x int, w xml);
create unique index foobar_pkey on foobar (x) including (w);
alter table foobar add constraint foobar_pkey primary key using index
foobar_pkey;
ERROR:  index "foobar_pkey" does not have default sorting behavior
LINE 1: alter table foobar add constraint foobar_pkey primary key us...                              ^
DETAIL:  Cannot create a primary key or unique constraint using such an index.

Cheers,

Jeff



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
25.01.2016 03:32, Jeff Janes:
On Fri, Jan 22, 2016 at 7:19 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Done. I hope that my patch is close to the commit too.

Thanks for the update.

I've run into this problem:

create table foobar (x text, w text);
create unique index foobar_pkey on foobar (x) including (w);
alter table foobar add constraint foobar_pkey primary key using index
foobar_pkey;

ERROR:  index "foobar_pkey" does not have default sorting behavior
LINE 1: alter table foobar add constraint foobar_pkey primary key us...                              ^
DETAIL:  Cannot create a primary key or unique constraint using such an index.
Time: 1.577 ms


If I instead define the table as
create table foobar (x int, w xml);

Then I can create the index and then the primary key the first time I
do this in a session.  But then if I drop the table and repeat the
process, I get "does not have default sorting behavior" error even for
this index that previously succeeded, so I think there is some kind of
problem with the backend syscache or catcache.

create table foobar (x int, w xml);
create unique index foobar_pkey on foobar (x) including (w);
alter table foobar add constraint foobar_pkey primary key using index
foobar_pkey;
drop table foobar ;
create table foobar (x int, w xml);
create unique index foobar_pkey on foobar (x) including (w);
alter table foobar add constraint foobar_pkey primary key using index
foobar_pkey;
ERROR:  index "foobar_pkey" does not have default sorting behavior
LINE 1: alter table foobar add constraint foobar_pkey primary key us...                              ^
DETAIL:  Cannot create a primary key or unique constraint using such an index.

Great, I've fixed that. Thank you for the tip about cache.

I've also found and fixed related bug in copying tables with indexes:
create table tbl2 (like tbl including all);
And there's one more tiny fix in get_pkey_attnames in dblink module.

including_columns_3.0 is the latest version of patch.
And changes regarding the previous version are attached in a separate patch. Just to ease the review and debug.

I've changed size of pg_index.indclass array. It contains indnkeyatts elements now.
While pg_index.indkey still contains all attributes. And this query Retrieve primary key columns provides pretty non-obvious result. Is it a normal behavior here or some changes are required? Do you know any similar queries?
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: WIP: Covering + unique indexes.

From
David Rowley
Date:
On 27 January 2016 at 03:35, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> including_columns_3.0 is the latest version of patch.
> And changes regarding the previous version are attached in a separate patch.
> Just to ease the review and debug.

Hi,

I've made another pass over the patch. There's still a couple of
things that I think need to be looked at.

Do we need the "b (included)" here? The key is (a) = (1). Having
irrelevant details might be confusing.

postgres=# create table a (a int not null, b int not null);
CREATE TABLE
postgres=# create unique index on a (a) including(b);
CREATE INDEX
postgres=# insert into a values(1,1);
INSERT 0 1
postgres=# insert into a values(1,1);
ERROR:  duplicate key value violates unique constraint "a_a_b_idx"
DETAIL:  Key (a, b (included))=(1, 1) already exists.

Extra tabs:
/* Truncate nonkey attributes when inserting on nonleaf pages. */
if (rel->rd_index->indnatts != rel->rd_index->indnkeyatts
&& !P_ISLEAF(lpageop))
{
itup = index_reform_tuple(rel, itup,
rel->rd_index->indnatts, rel->rd_index->indnkeyatts);
}

In index_reform_tuple() I find it a bit scary that you change the
TupleDesc's number of attributes then set it back again once you're
finished reforming the shortened tuple.
Maybe it would be better to modify index_form_tuple() to accept a new
argument with a number of attributes, then you can just Assert that
this number is never higher than the number of attributes in the
TupleDesc.

I'm also not that keen on index_reform_tuple() in general. I wonder if
there's a way we can just keep the Datum/isnull arrays a bit longer,
and only form the tuple when needed. I've not looked into this in
detail, but it does look like reforming the tuple is not going to be
cheap.

If we do need to keep this function, I think a better name might be
index_trim_tuple() and I don't think you need to pass the original
length. It might make sense to Assert() that the trim length is
smaller than the tuple size

What statement will cause this:

numberOfKeyAttributes = list_length(stmt->indexParams);
if (numberOfKeyAttributes <= 0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("must specify at least one key column")));

I seem to just get errors from the parser when trying.


Much of this goes over 80 chars:

/*
* We append any INCLUDING columns onto the indexParams list so that
* we have one list with all columns. Later we can determine which of these
* are key columns, and which are just part of the INCLUDING list by
check the list
* position. A list item in a position less than ii_NumIndexKeyAttrs is part of
* the key columns, and anything equal to and over is part of the
* INCLUDING columns.
*/
stmt->indexParams = list_concat(stmt->indexParams, stmt->indexIncludingParams);

in gistrescan() there is some code:

for (attno = 1; attno <= natts; attno++)
{
TupleDescInitEntry(so->giststate->fetchTupdesc, attno, NULL, scan->indexRelation->rd_opcintype[attno - 1], -1, 0);
}

Going by RelationInitIndexAccessInfo() rd_opcintype[] is allocated to
be sized by the number of key columns, but this loop goes over the
number of attribute columns.
Perhaps this is not a big problem since GIST does not support
INCLUDING columns, but it does seem wrong still.

Which brings me to the fact that I've spent a bit of time trying to
look for places where you've forgotten to change natts to nkeyatts. I
did find this one, but I don't have much confidence that there's not
lots more places that have been forgotten. Apart from this one, how
confident are you that you've found all the places? I'm getting
towards being happy with the code that I see that's been changed, but
I'm hesitant to mark as "Ready for committer" due to not being all
that comfortable that all the code that needs to be updated has been
updated. I'm not quite sure of a good way to find all these places. I
wondering if hacking the code so that each btree index which is
created with > 1 column puts all but the first column into the
INCLUDING columns, then run the regression tests to see if there are
any crashes. I'm really not that sure of how else to increase the
confidence levels on this. Do you have ideas?

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
31.01.2016 11:04, David Rowley:
> On 27 January 2016 at 03:35, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> including_columns_3.0 is the latest version of patch.
>> And changes regarding the previous version are attached in a separate patch.
>> Just to ease the review and debug.
> Hi,
>
> I've made another pass over the patch. There's still a couple of
> things that I think need to be looked at.
Thank you again.
I just write here to say that I do not disappear and I do remember about 
the issue.
But I'm very very busy this week. I'll send an updated patch next week 
as soon as possible.

> Do we need the "b (included)" here? The key is (a) = (1). Having
> irrelevant details might be confusing.
>
> postgres=# create table a (a int not null, b int not null);
> CREATE TABLE
> postgres=# create unique index on a (a) including(b);
> CREATE INDEX
> postgres=# insert into a values(1,1);
> INSERT 0 1
> postgres=# insert into a values(1,1);
> ERROR:  duplicate key value violates unique constraint "a_a_b_idx"
> DETAIL:  Key (a, b (included))=(1, 1) already exists.
I thought that it could be strange if user inserts two values and then 
sees only one of them in error message.
But now I see that you're right. I'll also look at the same functional 
in other DBs and fix it.

> In index_reform_tuple() I find it a bit scary that you change the
> TupleDesc's number of attributes then set it back again once you're
> finished reforming the shortened tuple.
> Maybe it would be better to modify index_form_tuple() to accept a new
> argument with a number of attributes, then you can just Assert that
> this number is never higher than the number of attributes in the
> TupleDesc.
Good point.
I agree that this function is a bit strange. I have to set 
tupdesc->nattrs to support compatibility with index_form_tuple().
I didn't want to add neither a new field to tupledesc nor a new 
parameter to index_form_tuple(), because they are used widely.
> I'm also not that keen on index_reform_tuple() in general. I wonder if
> there's a way we can just keep the Datum/isnull arrays a bit longer,
> and only form the tuple when needed. I've not looked into this in
> detail, but it does look like reforming the tuple is not going to be
> cheap.
It is used in splits, for example. There is no datum array, we just move 
tuple key from a child page to a parent page or something like that.
And according to INCLUDING algorithm we need to truncate nonkey attributes.
> If we do need to keep this function, I think a better name might be
> index_trim_tuple() and I don't think you need to pass the original
> length. It might make sense to Assert() that the trim length is
> smaller than the tuple size

As regards the performance, I don't think that it's a big problem here.
Do you suggest to do it in a following way memcpy(oldtup, newtup, 
newtuplength)?
I will
> in gistrescan() there is some code:
>
> for (attno = 1; attno <= natts; attno++)
> {
> TupleDescInitEntry(so->giststate->fetchTupdesc, attno, NULL,
>    scan->indexRelation->rd_opcintype[attno - 1],
>    -1, 0);
> }
>
> Going by RelationInitIndexAccessInfo() rd_opcintype[] is allocated to
> be sized by the number of key columns, but this loop goes over the
> number of attribute columns.
> Perhaps this is not a big problem since GIST does not support
> INCLUDING columns, but it does seem wrong still.

GiST doesn't support INCLUDING clause, so natts and nkeyatts are always 
equal. I don't see any problem here.
And I think that it's an extra work to this patch. Maybe I or someone 
else would add this feature to other access methods later.
> Which brings me to the fact that I've spent a bit of time trying to
> look for places where you've forgotten to change natts to nkeyatts. I
> did find this one, but I don't have much confidence that there's not
> lots more places that have been forgotten. Apart from this one, how
> confident are you that you've found all the places? I'm getting
> towards being happy with the code that I see that's been changed, but
> I'm hesitant to mark as "Ready for committer" due to not being all
> that comfortable that all the code that needs to be updated has been
> updated. I'm not quite sure of a good way to find all these places.
I found all mentions of natts and other related variables with grep, and 
replaced (or expand) them with nkeyatts where it was necessary.
As mentioned before, I didn't change other AMs.
I strongly agree that any changes related to btree require thorough 
inspection, so I'll recheck it again. But I'm almost sure that it's okay.

> I wondering if hacking the code so that each btree index which is
> created with > 1 column puts all but the first column into the
> INCLUDING columns, then run the regression tests to see if there are
> any crashes. I'm really not that sure of how else to increase the
> confidence levels on this. Do you have ideas?

Do I understand correctly that you suggest to replace all multicolumn 
indexes with (1key column) + included?
I don't think it's a good idea. INCLUDING clause brings some 
disadvantages. For example, included columns must be filtered after the 
search, while key columns could be used in scan key directly. I already 
mentioned this in test example:

explain analyze select c1, c2 from tbl where c1<10000 and c3<20;
If columns' opclasses are used, new query plan uses them in  Index Cond: 
((c1 < 10000) AND (c3 < 20))
Otherwise, new query can not use included column in Index Cond and uses 
filter instead:
Index Cond: (c1 < 10000)
Filter: (c3 < 20)
Rows Removed by Filter: 9993
It slows down the query significantly.

And besides that, we still want to have multicolumn unique indexes. 
CREATE UNIQUE INDEX on tbl (a, b, c) INCLUDING (d);

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
Alvaro Herrera
Date:
Anastasia Lubennikova wrote:

> I just write here to say that I do not disappear and I do remember about the
> issue.
> But I'm very very busy this week. I'll send an updated patch next week as
> soon as possible.

That's great to know, thanks.  I moved your patch to the next
commitfest.  Please do submit a new version before it starts!

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
02.02.2016 15:50, Anastasia Lubennikova:
> 31.01.2016 11:04, David Rowley:
>> On 27 January 2016 at 03:35, Anastasia Lubennikova
>> <a.lubennikova@postgrespro.ru> wrote:
>>> including_columns_3.0 is the latest version of patch.
>>> And changes regarding the previous version are attached in a
>>> separate patch.
>>> Just to ease the review and debug.
>> Hi,
>>
>> I've made another pass over the patch. There's still a couple of
>> things that I think need to be looked at.
> Thank you again.
> I just write here to say that I do not disappear and I do remember
> about the issue.
> But I'm very very busy this week. I'll send an updated patch next week
> as soon as possible.
>

As promised, here's the new version of the patch "including_columns_4.0".
I fixed all issues except some points mentioned below.
Besides, I did some refactoring:
- use macros IndexRelationGetNumberOfAttributes,
IndexRelationGetNumberOfKeyAttributes where possible. Use macro
RelationGetNumberOfAttributes. Maybe that's a bit unrelated changes, but
it'll make development much easier in future.
- rename related variables to indnatts,  indnkeyatts.
>> I'm also not that keen on index_reform_tuple() in general. I wonder if
>> there's a way we can just keep the Datum/isnull arrays a bit longer,
>> and only form the tuple when needed. I've not looked into this in
>> detail, but it does look like reforming the tuple is not going to be
>> cheap.
> It is used in splits, for example. There is no datum array, we just
> move tuple key from a child page to a parent page or something like that.
> And according to INCLUDING algorithm we need to truncate nonkey
> attributes.
>> If we do need to keep this function, I think a better name might be
>> index_trim_tuple() and I don't think you need to pass the original
>> length. It might make sense to Assert() that the trim length is
>> smaller than the tuple size
>
> As regards the performance, I don't think that it's a big problem here.
> Do you suggest to do it in a following way memcpy(oldtup, newtup,
> newtuplength)?

I've tested it some more, and still didn't find any performance issues.

>> in gistrescan() there is some code:
>>
>> for (attno = 1; attno <= natts; attno++)
>> {
>> TupleDescInitEntry(so->giststate->fetchTupdesc, attno, NULL,
>>    scan->indexRelation->rd_opcintype[attno - 1],
>>    -1, 0);
>> }
>>
>> Going by RelationInitIndexAccessInfo() rd_opcintype[] is allocated to
>> be sized by the number of key columns, but this loop goes over the
>> number of attribute columns.
>> Perhaps this is not a big problem since GIST does not support
>> INCLUDING columns, but it does seem wrong still.
>
> GiST doesn't support INCLUDING clause, so natts and nkeyatts are
> always equal. I don't see any problem here.
> And I think that it's an extra work to this patch. Maybe I or someone
> else would add this feature to other access methods later.

Still the same.
>> Which brings me to the fact that I've spent a bit of time trying to
>> look for places where you've forgotten to change natts to nkeyatts. I
>> did find this one, but I don't have much confidence that there's not
>> lots more places that have been forgotten. Apart from this one, how
>> confident are you that you've found all the places? I'm getting
>> towards being happy with the code that I see that's been changed, but
>> I'm hesitant to mark as "Ready for committer" due to not being all
>> that comfortable that all the code that needs to be updated has been
>> updated. I'm not quite sure of a good way to find all these places.
> I found all mentions of natts and other related variables with grep,
> and replaced (or expand) them with nkeyatts where it was necessary.
> As mentioned before, I didn't change other AMs.
> I strongly agree that any changes related to btree require thorough
> inspection, so I'll recheck it again. But I'm almost sure that it's okay.
>
I rechecked everything again and fixed couple of omissions. Thank you
for being exacting reviewer)
I don't know how to ensure that everything is ok, but I have no idea
what else I can do.

>> I wondering if hacking the code so that each btree index which is
>> created with > 1 column puts all but the first column into the
>> INCLUDING columns, then run the regression tests to see if there are
>> any crashes. I'm really not that sure of how else to increase the
>> confidence levels on this. Do you have ideas?
>
> Do I understand correctly that you suggest to replace all multicolumn
> indexes with (1key column) + included?
> I don't think it's a good idea. INCLUDING clause brings some
> disadvantages. For example, included columns must be filtered after
> the search, while key columns could be used in scan key directly. I
> already mentioned this in test example:
>
> explain analyze select c1, c2 from tbl where c1<10000 and c3<20;
> If columns' opclasses are used, new query plan uses them in  Index
> Cond: ((c1 < 10000) AND (c3 < 20))
> Otherwise, new query can not use included column in Index Cond and
> uses filter instead:
> Index Cond: (c1 < 10000)
> Filter: (c3 < 20)
> Rows Removed by Filter: 9993
> It slows down the query significantly.
>
> And besides that, we still want to have multicolumn unique indexes.
> CREATE UNIQUE INDEX on tbl (a, b, c) INCLUDING (d);
>

I started a new thread about related refactoring, because I think that
it should be a separate patch.
http://www.postgresql.org/message-id/56BB7788.30808@postgrespro.ru

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Jeff Janes
Date:
On Thu, Feb 11, 2016 at 8:46 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 02.02.2016 15:50, Anastasia Lubennikova:

>
> As promised, here's the new version of the patch "including_columns_4.0".
> I fixed all issues except some points mentioned below.

Thanks for the update patch.  I get a compiler warning:

genam.c: In function 'BuildIndexValueDescription':
genam.c:259: warning: unused variable 'tupdesc'

Also, I can't create a primary key INCLUDING columns directly:

jjanes=# create table foobar (a int, b int, c int);
jjanes=# alter table foobar add constraint foobar_pkey primary key
(a,b) including (c);
ERROR:  syntax error at or near "including"

But I can get there using a circuitous route:

jjanes=# create unique index on foobar (a,b) including (c);
jjanes=# alter table foobar add constraint foobar_pkey primary key
using index foobar_a_b_c_idx;

The description of the table's index knows to include the including column:

jjanes=# \d foobar   Table "public.foobar"Column |  Type   | Modifiers
--------+---------+-----------a      | integer | not nullb      | integer | not nullc      | integer |
Indexes:   "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c)


Since the machinery appears to all be in place to have primary keys
with INCLUDING columns, it would be nice if the syntax for adding
primary keys allowed one to implement them directly.

Is this something or future expansion, or could it be added at the
same time as the main patch?

I think this is something it would be pretty frustrating for the user
to be unable to do right from the start.

Cheers,

Jeff



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
25.02.2016 21:39, Jeff Janes:
>> As promised, here's the new version of the patch "including_columns_4.0".
>> I fixed all issues except some points mentioned below.
> Thanks for the update patch.  I get a compiler warning:
>
> genam.c: In function 'BuildIndexValueDescription':
> genam.c:259: warning: unused variable 'tupdesc'

Thank you for the notice, I'll fix it in the next update.
> Also, I can't create a primary key INCLUDING columns directly:
>
> jjanes=# create table foobar (a int, b int, c int);
> jjanes=# alter table foobar add constraint foobar_pkey primary key
> (a,b) including (c);
> ERROR:  syntax error at or near "including"
>
> But I can get there using a circuitous route:
>
> jjanes=# create unique index on foobar (a,b) including (c);
> jjanes=# alter table foobar add constraint foobar_pkey primary key
> using index foobar_a_b_c_idx;
>
> The description of the table's index knows to include the including column:
>
> jjanes=# \d foobar
>      Table "public.foobar"
>   Column |  Type   | Modifiers
> --------+---------+-----------
>   a      | integer | not null
>   b      | integer | not null
>   c      | integer |
> Indexes:
>      "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c)
>
>
> Since the machinery appears to all be in place to have primary keys
> with INCLUDING columns, it would be nice if the syntax for adding
> primary keys allowed one to implement them directly.
>
> Is this something or future expansion, or could it be added at the
> same time as the main patch?

Good point.
At quick glance, this looks easy to implement it. The only problem is 
that there are too many places in code which must be updated.
I'll try to do it, and if there would be difficulties, it's fine with me 
to delay this feature for the future work.

I found one more thing to do. Pgdump does not handle included columns 
now. I will fix it in the next version of the patch.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
29.02.2016 18:17, Anastasia Lubennikova:
> 25.02.2016 21:39, Jeff Janes:
>>> As promised, here's the new version of the patch 
>>> "including_columns_4.0".
>>> I fixed all issues except some points mentioned below.
>> Thanks for the update patch.  I get a compiler warning:
>>
>> genam.c: In function 'BuildIndexValueDescription':
>> genam.c:259: warning: unused variable 'tupdesc'
>
> Thank you for the notice, I'll fix it in the next update.
>> Also, I can't create a primary key INCLUDING columns directly:
>>
>> jjanes=# create table foobar (a int, b int, c int);
>> jjanes=# alter table foobar add constraint foobar_pkey primary key
>> (a,b) including (c);
>> ERROR:  syntax error at or near "including"
>>
>> But I can get there using a circuitous route:
>>
>> jjanes=# create unique index on foobar (a,b) including (c);
>> jjanes=# alter table foobar add constraint foobar_pkey primary key
>> using index foobar_a_b_c_idx;
>>
>> The description of the table's index knows to include the including 
>> column:
>>
>> jjanes=# \d foobar
>>      Table "public.foobar"
>>   Column |  Type   | Modifiers
>> --------+---------+-----------
>>   a      | integer | not null
>>   b      | integer | not null
>>   c      | integer |
>> Indexes:
>>      "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c)
>>
>>
>> Since the machinery appears to all be in place to have primary keys
>> with INCLUDING columns, it would be nice if the syntax for adding
>> primary keys allowed one to implement them directly.
>>
>> Is this something or future expansion, or could it be added at the
>> same time as the main patch?
>
> Good point.
> At quick glance, this looks easy to implement it. The only problem is 
> that there are too many places in code which must be updated.
> I'll try to do it, and if there would be difficulties, it's fine with 
> me to delay this feature for the future work.
>
> I found one more thing to do. Pgdump does not handle included columns 
> now. I will fix it in the next version of the patch.
>

As promised, fixed patch is in attachments. It allows to perform 
following statements:

create table utbl (a int, b box);
alter table utbl add unique (a) including(b);
create table ptbl (a int, b box);
alter table ptbl add primary key (a) including(b);

And now they can be dumped/restored successfully.
I used following settings
pg_dump --verbose -Fc postgres -f pg.dump
pg_restore -d newdb pg.dump

It is not the final version, because it breaks pg_dump for previous 
versions. I need some help from hackers here.
pgdump. line 5466
if (fout->remoteVersion >= 90400)

What does 'remoteVersion' mean? And what is the right way to change it? 
Or it changes between releases?
I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really so? 
That is totally new to me.
BTW, While we are on the subject, maybe it's worth to replace these 
magic numbers with some set of macro?

P.S. I'll update documentation for ALTER TABLE in the next patch.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
01.03.2016 19:55, Anastasia Lubennikova:
>
> 29.02.2016 18:17, Anastasia Lubennikova:
>> 25.02.2016 21:39, Jeff Janes:
>>>> As promised, here's the new version of the patch
>>>> "including_columns_4.0".
>>>> I fixed all issues except some points mentioned below.
>>> Thanks for the update patch.  I get a compiler warning:
>>>
>>> genam.c: In function 'BuildIndexValueDescription':
>>> genam.c:259: warning: unused variable 'tupdesc'
>>
>> Thank you for the notice, I'll fix it in the next update.
>>> Also, I can't create a primary key INCLUDING columns directly:
>>>
>>> jjanes=# create table foobar (a int, b int, c int);
>>> jjanes=# alter table foobar add constraint foobar_pkey primary key
>>> (a,b) including (c);
>>> ERROR:  syntax error at or near "including"
>>>
>>> But I can get there using a circuitous route:
>>>
>>> jjanes=# create unique index on foobar (a,b) including (c);
>>> jjanes=# alter table foobar add constraint foobar_pkey primary key
>>> using index foobar_a_b_c_idx;
>>>
>>> The description of the table's index knows to include the including
>>> column:
>>>
>>> jjanes=# \d foobar
>>>      Table "public.foobar"
>>>   Column |  Type   | Modifiers
>>> --------+---------+-----------
>>>   a      | integer | not null
>>>   b      | integer | not null
>>>   c      | integer |
>>> Indexes:
>>>      "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c)
>>>
>>>
>>> Since the machinery appears to all be in place to have primary keys
>>> with INCLUDING columns, it would be nice if the syntax for adding
>>> primary keys allowed one to implement them directly.
>>>
>>> Is this something or future expansion, or could it be added at the
>>> same time as the main patch?
>>
>> Good point.
>> At quick glance, this looks easy to implement it. The only problem is
>> that there are too many places in code which must be updated.
>> I'll try to do it, and if there would be difficulties, it's fine with
>> me to delay this feature for the future work.
>>
>> I found one more thing to do. Pgdump does not handle included columns
>> now. I will fix it in the next version of the patch.
>>
>
> As promised, fixed patch is in attachments. It allows to perform
> following statements:
>
> create table utbl (a int, b box);
> alter table utbl add unique (a) including(b);
> create table ptbl (a int, b box);
> alter table ptbl add primary key (a) including(b);
>
> And now they can be dumped/restored successfully.
> I used following settings
> pg_dump --verbose -Fc postgres -f pg.dump
> pg_restore -d newdb pg.dump
>
> It is not the final version, because it breaks pg_dump for previous
> versions. I need some help from hackers here.
> pgdump. line 5466
> if (fout->remoteVersion >= 90400)
>
> What does 'remoteVersion' mean? And what is the right way to change
> it? Or it changes between releases?
> I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really
> so? That is totally new to me.
> BTW, While we are on the subject, maybe it's worth to replace these
> magic numbers with some set of macro?
>
> P.S. I'll update documentation for ALTER TABLE in the next patch.

Sorry for missed attachment. Now it's here.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Michael Paquier
Date:
On Wed, Mar 2, 2016 at 2:10 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 01.03.2016 19:55, Anastasia Lubennikova:
>> It is not the final version, because it breaks pg_dump for previous
>> versions. I need some help from hackers here.
>> pgdump. line 5466
>> if (fout->remoteVersion >= 90400)
>>
>> What does 'remoteVersion' mean? And what is the right way to change it? Or
>> it changes between releases?
>> I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really so?
>> That is totally new to me.

Yes, you got it. That's basically PG_VERSION_NUM as compiled on the
server that has been queried, in this case the server from which a
dump is taken. If you are changing the system catalog layer, you would
need to provide a query at least equivalent to what has been done
until now for your patch, the modify pg_dump as follows:
if (fout->remoteVersion >= 90600)
{   query = my_new_query;
}
else if (fout->remoteVersion >= 90400)
{   query = the existing 9.4 query
}
etc.

In short you just need to add a new block so as remote servers newer
than 9.6 will be able to dump objects correctly. pg_upgrade is a good
way to check the validity of pg_dump actually, this explains why some
objects are not dropped in the regression tests. Perhaps you'd want to
do the same with your patch if the current test coverage of pg_dump is
not enough. I have not looked at your patch so I cannot say for sure.
-- 
Michael



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
02.03.2016 08:50, Michael Paquier:
> On Wed, Mar 2, 2016 at 2:10 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> 01.03.2016 19:55, Anastasia Lubennikova:
>>> It is not the final version, because it breaks pg_dump for previous
>>> versions. I need some help from hackers here.
>>> pgdump. line 5466
>>> if (fout->remoteVersion >= 90400)
>>>
>>> What does 'remoteVersion' mean? And what is the right way to change it? Or
>>> it changes between releases?
>>> I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really so?
>>> That is totally new to me.
> Yes, you got it. That's basically PG_VERSION_NUM as compiled on the
> server that has been queried, in this case the server from which a
> dump is taken. If you are changing the system catalog layer, you would
> need to provide a query at least equivalent to what has been done
> until now for your patch, the modify pg_dump as follows:
> if (fout->remoteVersion >= 90600)
> {
>      query = my_new_query;
> }
> else if (fout->remoteVersion >= 90400)
> {
>      query = the existing 9.4 query
> }
> etc.
>
> In short you just need to add a new block so as remote servers newer
> than 9.6 will be able to dump objects correctly. pg_upgrade is a good
> way to check the validity of pg_dump actually, this explains why some
> objects are not dropped in the regression tests. Perhaps you'd want to
> do the same with your patch if the current test coverage of pg_dump is
> not enough. I have not looked at your patch so I cannot say for sure.

Thank you for the explanation.
New version of the patch implements pg_dump well.
Documentation related to constraints is updated.

I hope, that patch is in a good shape now. Brief overview for reviewers:

This patch allows unique indexes to be defined on one set of columns
and include another set of column in the INCLUDING clause, on which
the uniqueness is not enforced upon. It allows more queries to benefit
from using index-only scan. Currently, only the B-tree access method
supports this feature.

Syntax example:
CREATE TABLE tbl (c1 int, c2 int, c3 box);
CREATE INDEX idx ON TABLE tbl (c1) INCLUDING (c2, c3);

In opposite to key columns (c1),  included columns (c2,c3) are not used
in index scankeys neither in "search" scankeys nor in "insertion" scankeys.
Included columns are stored only in leaf pages and it can help to slightly
reduce index size. Hence, included columns do not require any opclass
for btree access method. As you can see from example above, it's possible
to add into index columns of "box" type.

The most common use-case for this feature is combination of UNIQUE or
PRIMARY KEY constraint on columns (a,b) and covering index on columns
(a,b,c).
So, there is a new syntax for constraints.

CREATE TABLE tblu (c1 int, c2 int, c3 box, UNIQUE (c1,c2) INCLUDING (c3));
Index, created for this constraint contains three columns.
"tblu_c1_c2_c3_key" UNIQUE CONSTRAINT, btree (c1, c2) INCLUDING (c3)

CREATE TABLE tblpk (c1 int, c2 int, c3 box, PRIMARY KEY (c1) INCLUDING
(c3));
Index, created for this constraint contains two columns. Note that NOT NULL
constraint is applied only to key column(s) as well as unique constraint.

postgres=# \d tblpk
      Table "public.tblpk"
  Column |  Type   | Modifiers
--------+---------+-----------
  c1     | integer | not null
  c2     | integer |
  c3     | box     |
Indexes:
     "tblpk_pkey" PRIMARY KEY, btree (c1) INCLUDING (c3)

Same for ALTER TABLE statements:
CREATE TABLE tblpka (c1 int, c2 int, c3 box);
ALTER TABLE tblpka ADD PRIMARY KEY (c1) INCLUDING (c3);

pg_dump is updated and seems to work fine with this kind of indexes.

I see only one problem left (maybe I've mentioned it before).
Queries like this [1] must be rewritten, because after catalog changes,
i.indkey contains both key and included attrs.
One more thing to do is some refactoring of names, since "indkey"
looks really confusing to me. But it could be done as a separate patch [2].


[1] https://wiki.postgresql.org/wiki/Retrieve_primary_key_columns
[2] http://www.postgresql.org/message-id/56BB7788.30808@postgrespro.ru

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
David Steele
Date:
On 3/14/16 9:57 AM, Anastasia Lubennikova wrote:

> New version of the patch implements pg_dump well.
> Documentation related to constraints is updated.
> 
> I hope, that patch is in a good shape now.

It looks like this patch should be marked "needs review" and I have done so.

-- 
-David
david@pgmasters.net



Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Fri, Mar 18, 2016 at 5:15 AM, David Steele <david@pgmasters.net> wrote:
> It looks like this patch should be marked "needs review" and I have done so.

Uh, no it shouldn't. I've posted an extensive review on the original
design thread. See CF entry:

https://commitfest.postgresql.org/9/433/

Marked "Waiting on Author".

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
19.03.2016 08:00, Peter Geoghegan:
> On Fri, Mar 18, 2016 at 5:15 AM, David Steele <david@pgmasters.net> wrote:
>> It looks like this patch should be marked "needs review" and I have done so.
> Uh, no it shouldn't. I've posted an extensive review on the original
> design thread. See CF entry:
>
> https://commitfest.postgresql.org/9/433/
>
> Marked "Waiting on Author".
Thanks to David,
I've missed these letters at first.
I'll answer here.

> * You truncate (remove suffix attributes -- the "included" attributes)
> within _bt_insertonpg():
>
> -   right_item = CopyIndexTuple(item);
> +   indnatts = IndexRelationGetNumberOfAttributes(rel);
> +   indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
> +
> +   if (indnatts != indnkeyatts)
> +   {
> +       right_item = index_reform_tuple(rel, item, indnatts, indnkeyatts);
> +       right_item_sz = IndexTupleDSize(*right_item);
> +       right_item_sz = MAXALIGN(right_item_sz);
> +   }
> +   else
> +       right_item = CopyIndexTuple(item);
>      ItemPointerSet(&(right_item->t_tid), rbkno, P_HIKEY);
>
> I suggest that you do this within _bt_insert_parent(), instead, iff
> the original target page is know to be a leaf page. That's where it
> needs to happen for conventional suffix truncation, which has special
> considerations when determining which attributes are safe to truncate
> (or even which byte in the first distinguishing attribute it is okay
> to truncate past)

I agree that _bt_insertonpg() is not right for truncation.
Furthermore, I've noticed that all internal keys are solely the copies
of "High keys" from the leaf pages. Which is pretty logical.
Therefore, if we have already truncated the tuple, when it became a High
key, we do not need the same truncation within _bt_insert_parent() or
any other function.
So the only thing to worry about is the HighKey truncation. I rewrote
the code. Now only _bt_split cares about truncation.

It's a bit more complicated to add it into index creation algorithm.
There's a trick with a "high key".
         /*
          * We copy the last item on the page into the new page, and then
          * rearrange the old page so that the 'last item' becomes its
high key
          * rather than a true data item.  There had better be at least two
          * items on the page already, else the page would be empty of
useful
          * data.
          */
         /*
          * Move 'last' into the high key position on opage
          */

To be consistent with other steps of algorithm ( all high keys must be
truncated tuples), I had to update this high key on place:
delete the old one, and insert truncated high key.
The very same logic I use to truncate posting list of a compressed tuple
in the "btree_compression" patch. [1]
I hope, both patches will be accepted, and then I'll thoroughly merge them .

> * I think the comparison logic may have a bug.
>
> Does this work with amcheck? Maybe it works with bt_index_check(), but
> not bt_index_parent_check()? I think that you need to make sure that
> _bt_compare() knows about this, too. That's because it isn't good
> enough to let a truncated internal IndexTuple compare equal to a
> scankey when non-truncated attributes are equal.

It is a very important issue. But I don't think it's a bug there.
I've read amcheck sources thoroughly and found that the problem appears at
"invariant_key_less_than_equal_nontarget_offset()


static bool
invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
                                                Page nontarget, ScanKey key,
                                                OffsetNumber upperbound)
{
     int16        natts = state->rel->rd_rel->relnatts;
     int32        cmp;

     cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound);

     return cmp <= 0;
}

It uses scankey, made with _bt_mkscankey() which uses only key
attributes, but calls _bt_compare with wrong keysz.
If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of
natts, all the checks would be passed successfully.

Same for invariant_key_greater_than_equal_offset() and
invariant_key_less_than_equal_nontarget_offset().

In my view, it's the correct way to fix this problem, because the caller
is responsible for passing proper arguments to the function.
Of course I will add a check into bt_compare, but I'd rather make it an
assertion (see the patch attached).

I'll add a flag to distinguish regular and truncated tuples, but it will
not be used in this patch. Please, comment, if I've missed something.
As you've already mentioned, neither high keys, nor tuples on internal
pages are using  "itup->t_tid.ip_posid", so I'll take one bit of it.

It will definitely require changes in the future works on suffix
truncation or something like that, but IMHO for now it's enough.

Do you have any objections or comments?

[1] https://commitfest.postgresql.org/9/494/

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
21.03.2016 19:53, Anastasia Lubennikova:
> 19.03.2016 08:00, Peter Geoghegan:
>> On Fri, Mar 18, 2016 at 5:15 AM, David Steele <david@pgmasters.net>
>> wrote:
>>> It looks like this patch should be marked "needs review" and I have
>>> done so.
>> Uh, no it shouldn't. I've posted an extensive review on the original
>> design thread. See CF entry:
>>
>> https://commitfest.postgresql.org/9/433/
>>
>> Marked "Waiting on Author".
> Thanks to David,
> I've missed these letters at first.
> I'll answer here.
>
>> * You truncate (remove suffix attributes -- the "included" attributes)
>> within _bt_insertonpg():
>>
>> -   right_item = CopyIndexTuple(item);
>> +   indnatts = IndexRelationGetNumberOfAttributes(rel);
>> +   indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
>> +
>> +   if (indnatts != indnkeyatts)
>> +   {
>> +       right_item = index_reform_tuple(rel, item, indnatts,
>> indnkeyatts);
>> +       right_item_sz = IndexTupleDSize(*right_item);
>> +       right_item_sz = MAXALIGN(right_item_sz);
>> +   }
>> +   else
>> +       right_item = CopyIndexTuple(item);
>>      ItemPointerSet(&(right_item->t_tid), rbkno, P_HIKEY);
>>
>> I suggest that you do this within _bt_insert_parent(), instead, iff
>> the original target page is know to be a leaf page. That's where it
>> needs to happen for conventional suffix truncation, which has special
>> considerations when determining which attributes are safe to truncate
>> (or even which byte in the first distinguishing attribute it is okay
>> to truncate past)
>
> I agree that _bt_insertonpg() is not right for truncation.
> Furthermore, I've noticed that all internal keys are solely the copies
> of "High keys" from the leaf pages. Which is pretty logical.
> Therefore, if we have already truncated the tuple, when it became a
> High key, we do not need the same truncation within
> _bt_insert_parent() or any other function.
> So the only thing to worry about is the HighKey truncation. I rewrote
> the code. Now only _bt_split cares about truncation.
>
> It's a bit more complicated to add it into index creation algorithm.
> There's a trick with a "high key".
>         /*
>          * We copy the last item on the page into the new page, and then
>          * rearrange the old page so that the 'last item' becomes its
> high key
>          * rather than a true data item.  There had better be at least
> two
>          * items on the page already, else the page would be empty of
> useful
>          * data.
>          */
>         /*
>          * Move 'last' into the high key position on opage
>          */
>
> To be consistent with other steps of algorithm ( all high keys must be
> truncated tuples), I had to update this high key on place:
> delete the old one, and insert truncated high key.
> The very same logic I use to truncate posting list of a compressed
> tuple in the "btree_compression" patch. [1]
> I hope, both patches will be accepted, and then I'll thoroughly merge
> them .
>
>> * I think the comparison logic may have a bug.
>>
>> Does this work with amcheck? Maybe it works with bt_index_check(), but
>> not bt_index_parent_check()? I think that you need to make sure that
>> _bt_compare() knows about this, too. That's because it isn't good
>> enough to let a truncated internal IndexTuple compare equal to a
>> scankey when non-truncated attributes are equal.
>
> It is a very important issue. But I don't think it's a bug there.
> I've read amcheck sources thoroughly and found that the problem
> appears at
> "invariant_key_less_than_equal_nontarget_offset()
>
>
> static bool
> invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state,
>                                                Page nontarget, ScanKey
> key,
>                                                OffsetNumber upperbound)
> {
>     int16        natts = state->rel->rd_rel->relnatts;
>     int32        cmp;
>
>     cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound);
>
>     return cmp <= 0;
> }
>
> It uses scankey, made with _bt_mkscankey() which uses only key
> attributes, but calls _bt_compare with wrong keysz.
> If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of
> natts, all the checks would be passed successfully.
>
> Same for invariant_key_greater_than_equal_offset() and
> invariant_key_less_than_equal_nontarget_offset().
>
> In my view, it's the correct way to fix this problem, because the
> caller is responsible for passing proper arguments to the function.
> Of course I will add a check into bt_compare, but I'd rather make it
> an assertion (see the patch attached).
>
> I'll add a flag to distinguish regular and truncated tuples, but it
> will not be used in this patch. Please, comment, if I've missed
> something.
> As you've already mentioned, neither high keys, nor tuples on internal
> pages are using  "itup->t_tid.ip_posid", so I'll take one bit of it.
>
> It will definitely require changes in the future works on suffix
> truncation or something like that, but IMHO for now it's enough.
>
> Do you have any objections or comments?
>
> [1] https://commitfest.postgresql.org/9/494/
>

One more version of the patch is attached. I did more testing, and fixed
couple of bugs.
Now, if any indexed column is deleted from table, we perform cascade
deletion of constraint and index.
/*
  * 3.1 Test ALTER TABLE tbl DROP COLUMN c.
  * Included column deletion leads to the index deletion,
  * as well as key columns deletion. It's explained in documentation.
  */
Constraint definition is fixed too.

Also, I added separate regression test for INCLUDING clause, that covers
both indexes and constraints.
I've tested pg_dump, and didn't find any problems. Test script is attached.

It seems to me that the patch is completed.
Except, maybe, grammar check of comments and documentation.

Looking forward to your review.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> It seems to me that the patch is completed.
> Except, maybe, grammar check of comments and documentation.
>
> Looking forward to your review.
Are there any objectins on it? I'm planning to look closely today or tommorrow 
and commit it.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Mon, Apr 4, 2016 at 7:14 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Are there any objectins on it? I'm planning to look closely today or
> tommorrow and commit it.

I object to committing the patch in that time frame. I'm looking at it again.


-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes:
> On Mon, Apr 4, 2016 at 7:14 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> Are there any objectins on it? I'm planning to look closely today or
>> tommorrow and commit it.

> I object to committing the patch in that time frame. I'm looking at it again.

Since it's a rather complex patch, pushing it in advance of the reviewers
signing off on it doesn't seem like a great idea ...
        regards, tom lane



Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Thanks to David,
> I've missed these letters at first.
> I'll answer here.

Sorry about using the wrong thread.

> I agree that _bt_insertonpg() is not right for truncation.

Cool.

> It's a bit more complicated to add it into index creation algorithm.
> There's a trick with a "high key".
>         /*
>          * We copy the last item on the page into the new page, and then
>          * rearrange the old page so that the 'last item' becomes its high
> key
>          * rather than a true data item.  There had better be at least two
>          * items on the page already, else the page would be empty of useful
>          * data.
>          */
>         /*
>          * Move 'last' into the high key position on opage
>          */
>
> To be consistent with other steps of algorithm ( all high keys must be
> truncated tuples), I had to update this high key on place:
> delete the old one, and insert truncated high key.

Hmm. But the high key comparing equal to the Scankey gives insertion
the choice of where to put its IndexTuple (it can go on the page with
the high key, or its right-sibling, according only to considerations
about fillfactor, etc). Is this changed? Does it not matter? Why not?
Is it just worth it?

The right-most page on every level has no high-key. But you say those
pages have an "imaginary" *positive* infinity high key, just as
internal pages have (non-imaginary) minus infinity downlinks as their
first item/downlink. So tuples in a (say) leaf page are always bound
by the downlink lower bound in parent, while their own high key is an
upper bound. Either (and, rarely, both) could be (positive or
negative) infinity.

Maybe you now see why I talked about special _bt_compare() logic for
this. I proposed special logic that is similar to the existing minus
infinity thing _bt_compare() does (although _bt_binsrch(), an
important caller of _bt_compare() also does special things for
internal .vs leaf case, so I'm not sure any new special logic must go
in _bt_compare()).

> It is a very important issue. But I don't think it's a bug there.
> I've read amcheck sources thoroughly and found that the problem appears at
> "invariant_key_less_than_equal_nontarget_offset()

> It uses scankey, made with _bt_mkscankey() which uses only key attributes,
> but calls _bt_compare with wrong keysz.
> If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of natts,
> all the checks would be passed successfully.

I probably shouldn't have brought amcheck into that particular
discussion. I thought amcheck might be a useful way to frame the
discussion, because amcheck always cares about specific invariants,
and notes a few special cases.

> In my view, it's the correct way to fix this problem, because the caller is
> responsible for passing proper arguments to the function.
> Of course I will add a check into bt_compare, but I'd rather make it an
> assertion (see the patch attached).

I see what you mean, but I think we need to decide what to do about
the key space when leaf high keys are truncated. I do think that
truncating the high key was the right idea, though, and it nicely
illustrates that nothing special should happen in upper levels. Suffix
truncation should only happen when leaf pages are split, generally
speaking.

As I said, the high key is very similar to the downlinks, in that both
bound the things that go on each page. Together with downlinks they
represent *discrete* ranges *unambiguously*, so INCLUDING truncation
needs to make it clear which page new items go on. As I said,
_bt_binsrch() already takes special actions for internal pages, making
sure to return the item that is < the scankey, not <= the scankey
which is only allowed for leaf pages. (See README, from "Lehman and
Yao assume that the key range for a subtree S is described by Ki < v
<= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
page...").

To give a specific example, I worry about the case where two sibling
downlinks in a parent page are distinct, but per specific-to-Postgres
"Ki <= v <= Ki+1" thing (which differs from the classic L&Y
invariant), some tuples with all right downlink's attributes matching
end up in left child page, not right child page. I worry that since
_bt_findsplitloc() doesn't consider this (for example), the split
point doesn't *reliably* and unambiguously divide the key space
between the new halves of a page being split. I think the "Ki <= v <=
Ki+1"/_bt_binsrch() thing might save you in common cases where all
downlink attributes are distinct, so maybe that simpler case is okay.
But to be even more specific, what about the more complicated case
where the downlinks *are* fully _bt_compare()-wise equal? This could
happen even though they're constrained to be unique in leaf pages, due
to bloat. Unique indexes aren't special here; they just make it far
less likely that this would happen in practice, because it takes a lot
of bloat. Less importantly, when that bloat happens, you don't want to
have to do a linear scan through many leaf pages (that should only
happen when there are many fully matching IndexTuples at the leaf
level -- not just matching on constrained attributes).

The more I think about it, the more I doubt that it's okay to not
ensure downlinks are always distinct with their siblings, by sometimes
including non-constrained (truncatable) attributes within internal
pages, as needed to *distinguish* downlinks (also, we must
occasionally have *all* attributes including truncatable attributes in
internal pages -- we must truncate nothing to keep the key space sane
in the parent). Unfortunately, these requirements are very close to
the actual full requirements for a full, complete suffix truncation
patch, including storing how many attributes are stored in each and
every internal IndexTuple (no general thing for the index), page split
code to determine where to truncate to make adjacent downlinks
distinct, etc.

You may think: But that fully-matching-downlink case is okay, because
it only makes us do more linear scanning due to the lack of
non-truncatable attributes, which is still correct, if a little more
slow when there is bloat -- at the leaf level, we'll start at the
correct place (the first place the item could be on), per the "Ki <= v
<= Ki+1"/_bt_binsrch() thing. I don't think it's correct, though. We
need to be able to reliably detect a concurrent page-split. Otherwise,
we'll move right within _bt_search() before even considering if
anything of interest for our index scan *might* be on the initial page
found from downlink (before even calling _bt_binsrch()). Even this bug
wouldn't happen in the common case where nextkey = true, but what
about when nextkey = false (e.g. for backwards scans)? We'd skip stuff
we are not supposed to by spuriously moving right, I think. I have a
bad feeling that even then we'd "accidentally fail to fail", because
of how backwards scans work at a higher level, but it's just too hard
to prove that that is correct. It's just too complicated to rely on so
much from a great distance.

This might not be the simplest example of where we could run into
trouble, but it's one example that I could see. The assumption that
downlinks and highkeys discretely separate ranges in the key space is
probably made many times. There could be more problematic spots, and
it's really hard to know where they might be. :-(

In general, it's common for any modification to the B-Tree code to
only break in a very subtle way, like this. I would be more
comfortable if I knew the patch received extensive stress-testing,
probably involving amcheck, lots of bloat, lots of VACUUMing, etc. But
generally, I believe we should not allow the key space to fail to be
separated fully by downlinks and high keys, even if our original "Ki
<= v <= Ki+1" changes to the L&Y algorithm to make duplicates work
happens to mask the problems in simple testing. It's too different to
what we have today.

> I'll add a flag to distinguish regular and truncated tuples, but it will not
> be used in this patch. Please, comment, if I've missed something.
> As you've already mentioned, neither high keys, nor tuples on internal pages
> are using  "itup->t_tid.ip_posid", so I'll take one bit of it.
>
> It will definitely require changes in the future works on suffix truncation
> or something like that, but IMHO for now it's enough.

I think that we need to discuss whether or not it's okay that we can
have that fully-matching-downlink case before we can be sure either
way.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
05.04.2016 01:48, Peter Geoghegan :
> On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> It's a bit more complicated to add it into index creation algorithm.
>> There's a trick with a "high key".
>>          /*
>>           * We copy the last item on the page into the new page, and then
>>           * rearrange the old page so that the 'last item' becomes its high
>> key
>>           * rather than a true data item.  There had better be at least two
>>           * items on the page already, else the page would be empty of useful
>>           * data.
>>           */
>>          /*
>>           * Move 'last' into the high key position on opage
>>           */
>>
>> To be consistent with other steps of algorithm ( all high keys must be
>> truncated tuples), I had to update this high key on place:
>> delete the old one, and insert truncated high key.
> Hmm. But the high key comparing equal to the Scankey gives insertion
> the choice of where to put its IndexTuple (it can go on the page with
> the high key, or its right-sibling, according only to considerations
> about fillfactor, etc). Is this changed? Does it not matter? Why not?
> Is it just worth it?

I would say, this is changed, but it doesn't matter.
Performing any search in btree (including choosing suitable page for
insertion), we use only key attributes.
We assume that included columns are stored in index unordered.
Simple example.
create table tbl(id int, data int);
create index idx on tbl (id) including (data);

Select query does not consider included columns in scan key.
It selects all tuples satisfying the condition on key column. And only
after that it applies filter to remove wrong rows from the result.
If key attribute doesn't satisfy query condition, there are no more
tuples to return and we can interrupt scan.

You can find more explanations in the attached sql script,
that contains queries to recieve detailed information about index
structure using pageinspect.

> The right-most page on every level has no high-key. But you say those
> pages have an "imaginary" *positive* infinity high key, just as
> internal pages have (non-imaginary) minus infinity downlinks as their
> first item/downlink. So tuples in a (say) leaf page are always bound
> by the downlink lower bound in parent, while their own high key is an
> upper bound. Either (and, rarely, both) could be (positive or
> negative) infinity.
>
> Maybe you now see why I talked about special _bt_compare() logic for
> this. I proposed special logic that is similar to the existing minus
> infinity thing _bt_compare() does (although _bt_binsrch(), an
> important caller of _bt_compare() also does special things for
> internal .vs leaf case, so I'm not sure any new special logic must go
> in _bt_compare()).
>
>> In my view, it's the correct way to fix this problem, because the caller is
>> responsible for passing proper arguments to the function.
>> Of course I will add a check into bt_compare, but I'd rather make it an
>> assertion (see the patch attached).
> I see what you mean, but I think we need to decide what to do about
> the key space when leaf high keys are truncated. I do think that
> truncating the high key was the right idea, though, and it nicely
> illustrates that nothing special should happen in upper levels. Suffix
> truncation should only happen when leaf pages are split, generally
> speaking.
> As I said, the high key is very similar to the downlinks, in that both
> bound the things that go on each page. Together with downlinks they
> represent *discrete* ranges *unambiguously*, so INCLUDING truncation
> needs to make it clear which page new items go on. As I said,
> _bt_binsrch() already takes special actions for internal pages, making
> sure to return the item that is < the scankey, not <= the scankey
> which is only allowed for leaf pages. (See README, from "Lehman and
> Yao assume that the key range for a subtree S is described by Ki < v
> <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
> page...").
>
> To give a specific example, I worry about the case where two sibling
> downlinks in a parent page are distinct, but per specific-to-Postgres
> "Ki <= v <= Ki+1" thing (which differs from the classic L&Y
> invariant), some tuples with all right downlink's attributes matching
> end up in left child page, not right child page. I worry that since
> _bt_findsplitloc() doesn't consider this (for example), the split
> point doesn't *reliably* and unambiguously divide the key space
> between the new halves of a page being split. I think the "Ki <= v <=
> Ki+1"/_bt_binsrch() thing might save you in common cases where all
> downlink attributes are distinct, so maybe that simpler case is okay.
> But to be even more specific, what about the more complicated case
> where the downlinks *are* fully _bt_compare()-wise equal? This could
> happen even though they're constrained to be unique in leaf pages, due
> to bloat. Unique indexes aren't special here; they just make it far
> less likely that this would happen in practice, because it takes a lot
> of bloat. Less importantly, when that bloat happens, you don't want to
> have to do a linear scan through many leaf pages (that should only
> happen when there are many fully matching IndexTuples at the leaf
> level -- not just matching on constrained attributes).

"just matching on constrained attributes" is the core idea of the whole
patch. Included columns just provide us possibility to use index-only
scan. Nothing more. We assume use case where index-only-scan is faster
than index-scan + heap fetch. For example, in queries like "select data
from tbl where id = 1;" we have no scan condition on data. Maybe you
afraid of long linear scan when we have enormous index bloat even on
unique index. It will happen anyway, whether we have index-only scan on
covering index or index-scan on unique index + heap fetch. The only
difference is that the covering index is faster.

At the very beginning of the proposal discussion, I suggested to
implement third kind of columns, which are not constrained, but used in
scankey.
They must have op class to do it, and they are not truncated. But it was
decided to abandon this feature.

> The more I think about it, the more I doubt that it's okay to not
> ensure downlinks are always distinct with their siblings, by sometimes
> including non-constrained (truncatable) attributes within internal
> pages, as needed to *distinguish* downlinks (also, we must
> occasionally have *all* attributes including truncatable attributes in
> internal pages -- we must truncate nothing to keep the key space sane
> in the parent). Unfortunately, these requirements are very close to
> the actual full requirements for a full, complete suffix truncation
> patch, including storing how many attributes are stored in each and
> every internal IndexTuple (no general thing for the index), page split
> code to determine where to truncate to make adjacent downlinks
> distinct, etc.
>
> You may think: But that fully-matching-downlink case is okay, because
> it only makes us do more linear scanning due to the lack of
> non-truncatable attributes, which is still correct, if a little more
> slow when there is bloat -- at the leaf level, we'll start at the
> correct place (the first place the item could be on), per the "Ki <= v
> <= Ki+1"/_bt_binsrch() thing. I don't think it's correct, though. We
> need to be able to reliably detect a concurrent page-split. Otherwise,
> we'll move right within _bt_search() before even considering if
> anything of interest for our index scan *might* be on the initial page
> found from downlink (before even calling _bt_binsrch()). Even this bug
> wouldn't happen in the common case where nextkey = true, but what
> about when nextkey = false (e.g. for backwards scans)? We'd skip stuff
> we are not supposed to by spuriously moving right, I think. I have a
> bad feeling that even then we'd "accidentally fail to fail", because
> of how backwards scans work at a higher level, but it's just too hard
> to prove that that is correct. It's just too complicated to rely on so
> much from a great distance.
>
> This might not be the simplest example of where we could run into
> trouble, but it's one example that I could see. The assumption that
> downlinks and highkeys discretely separate ranges in the key space is
> probably made many times. There could be more problematic spots, and
> it's really hard to know where they might be. :-(
>
> In general, it's common for any modification to the B-Tree code to
> only break in a very subtle way, like this. I would be more
> comfortable if I knew the patch received extensive stress-testing,
> probably involving amcheck, lots of bloat, lots of VACUUMing, etc. But
> generally, I believe we should not allow the key space to fail to be
> separated fully by downlinks and high keys, even if our original "Ki
> <= v <= Ki+1" changes to the L&Y algorithm to make duplicates work
> happens to mask the problems in simple testing. It's too different to
> what we have today.

Frankly, I still do not understand what you're worried about.
If high key is greater than the scan key, we definitely cannot find any
more tuples, because key attributes are ordered.
If high key is equal to the scan key, we will continue searching and
read next page.
The code is not changed here, it is the same as processing of duplicates
spreading over several pages. If you do not trust postgresql btree
changes to the L&Y to make duplicates work, I don't know what to say,
but it's definitely not related to my patch.

Of course I do not mind if someone will do more testing.
I did some tests and didn't find anything special. Besides, don't we
have special alpha and beta release stages to find tricky bugs?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Apr 5, 2016 at 7:56 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> I would say, this is changed, but it doesn't matter.

Actually, I would now say that it hasn't really changed (see below),
based on my new understanding. The *choice* to go on one page or the
other still exists.

> Performing any search in btree (including choosing suitable page for
> insertion), we use only key attributes.
> We assume that included columns are stored in index unordered.

The patch assumes no ordering for the non-indexed columns in the
index? While I knew that the patch was primarily motivated by enabling
index-only scans, I didn't realize that at all. The patch is much much
less like a general suffix truncation patch than I thought. I may have
been confused in part by the high key issue that you only recently
fixed, but you should have corrected me about suffix truncation
earlier. Obviously, this was a significant misunderstanding; we have
been "talking at cross purposes" this whole time.

There seems to have been significant misunderstanding about this before now:

http://www.postgresql.org/message-id/CAKJS1f9W0aB-g7H6yYgNBq7hJsOKF3UwHU7-Q5jobbaTyK9f4g@mail.gmail.com

My new understanding: The extra "included" columns are stored in the
index, but do not affect its sort order at all. They are no more part
of the key than, say, the heap TID that the key points to. They are
just "payload".

> "just matching on constrained attributes" is the core idea of the whole
> patch. Included columns just provide us possibility to use index-only scan.
> Nothing more. We assume use case where index-only-scan is faster than
> index-scan + heap fetch. For example, in queries like "select data from tbl
> where id = 1;" we have no scan condition on data. Maybe you afraid of long
> linear scan when we have enormous index bloat even on unique index. It will
> happen anyway, whether we have index-only scan on covering index or
> index-scan on unique index + heap fetch. The only difference is that the
> covering index is faster.

My concern about performance when that happens is very much secondary.
I really only mentioned it to help explain my primary concern.

> At the very beginning of the proposal discussion, I suggested to implement
> third kind of columns, which are not constrained, but used in scankey.
> They must have op class to do it, and they are not truncated. But it was
> decided to abandon this feature.

I must have missed that. Obviously, I wasn't paying enough attention
to earlier discussion. Earlier versions of the patch did fail to
recognize that the sort order was not the entire indexed order, but
that isn't the case with V8. That that was ever possible was only a
bug, it turns out.

>> The more I think about it, the more I doubt that it's okay to not
>> ensure downlinks are always distinct with their siblings, by sometimes
>> including non-constrained (truncatable) attributes within internal
>> pages, as needed to *distinguish* downlinks (also, we must
>> occasionally have *all* attributes including truncatable attributes in
>> internal pages -- we must truncate nothing to keep the key space sane
>> in the parent).

> Frankly, I still do not understand what you're worried about.
> If high key is greater than the scan key, we definitely cannot find any more
> tuples, because key attributes are ordered.
> If high key is equal to the scan key, we will continue searching and read
> next page.

I thought, because of the emphasis on unique indexes, that this patch
was mostly to offer a way of getting an index with uniqueness only
enforced on certain columns, but otherwise just the same as having a
non-unique index on those same columns. Plus, some suffix truncation,
because point-lookups involving later attributes are unlikely to be
useful when this is scoped to just unique indexes (which were
emphasized by you), because truncating key columns is not helpful
unless bloat is terrible.

I now understand that it was quite wrong to link this to suffix
truncation at all. The two are really not the same. That does make the
patch seem significantly simpler, at least as far as nbtree goes; a
tool like amcheck is not likely to detect problems in this patch that
a human tester could not catch. That was the kind of problem that I
feared.

> The code is not changed here, it is the same as processing of duplicates
> spreading over several pages. If you do not trust postgresql btree changes
> to the L&Y to make duplicates work, I don't know what to say, but it's
> definitely not related to my patch.

My point about the postgres btree changes to L&Y to make duplicates
work is that I think it makes the patch work, but perhaps not
absolutely reliably. I don't have any specific misgivings about it on
its own. Again, my earlier remarks were based on a misguided
understanding of the patch, so it doesn't matter now.

Communication is hard. There may be a lesson here for both of us about that.

> Of course I do not mind if someone will do more testing.
> I did some tests and didn't find anything special. Besides, don't we have
> special alpha and beta release stages to find tricky bugs?

Our history of committing performance improvements to the B-Tree code
is limited, particularly in the last 5 years. That's definitely a
problem, and one that I have tried to make smaller, but it is the
reality.

BTW, I can see why you used index_reform_tuple(), rather than trying
to modify an existing tuple in place. NULL bitmaps have a storage
overhead in IndexTuples (presumably an alternative approach would make
truncated IndexTuples have NULL attributes to represent truncation),
whereas the cost of index_reform_tuple() only has to be paid when
there is a leaf page split. It's important that truncation is 100%
guaranteed to produce a tuple smaller than the inserted tuple,
otherwise the user could get a non-recoverable "1/3 of page size
exceeded" when they were not the one to insert the big IndexTuple. I
should try to see if this could be possible due to some
index_reform_tuple() edge-case.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Apr 5, 2016 at 1:31 PM, Peter Geoghegan <pg@heroku.com> wrote:
> My new understanding: The extra "included" columns are stored in the
> index, but do not affect its sort order at all. They are no more part
> of the key than, say, the heap TID that the key points to. They are
> just "payload".

Noticed a few issues following another pass:

* tuplesort.c should handle the CLUSTER case in the same way as the
btree case. No?

* Why have a RelationGetNumberOfAttributes(indexRel) call in
tuplesort_begin_index_btree() at all now?

* This critical section is unnecessary, because this happens during
index builds:

+               if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+               {
+                       /*
+                        * It's essential to truncate High key here.
+                        * The purpose is not just to save more space
on this particular page,
+                        * but to keep whole b-tree structure
consistent. Subsequent insertions
+                        * assume that hikey is already truncated, and
so they should not
+                        * worry about it, when copying the high key
into the parent page
+                        * as a downlink.
+                        * NOTE It is not crutial for reliability in present,
+                        * but maybe it will be that in the future.
+                        * NOTE this code will be changed by the
"btree compression" patch,
+                        * which is in progress now.
+                        */
+                       keytup = index_reform_tuple(wstate->index, oitup,
+                indnatts, indnkeyatts);
+
+                       /*  delete "wrong" high key, insert keytup as
P_HIKEY. */
+                       START_CRIT_SECTION();
+                       PageIndexTupleDelete(opage, P_HIKEY);
+
+                       if (!_bt_pgaddtup(opage,
IndexTupleSize(keytup), keytup, P_HIKEY))
+                               elog(ERROR, "failed to rewrite
compressed item in index \"%s\"",
+                                       RelationGetRelationName(wstate->index));
+                       END_CRIT_SECTION();
+               }

Note that START_CRIT_SECTION() promotes any ERROR to PANIC, which
isn't useful here, because we have no buffer lock held, and nothing
must be WAL-logged.

* Think you forgot to update spghandler(). (You did not add a test for
just that one AM, either)

* I wonder why this restriction needs to exist:

+               else
+                       elog(ERROR, "Expressions are not supported in
included columns.");

What does not supporting it buy us? Was it just that the pg_index
representation is more complicated, and you wanted to put it off?

An error like this should use ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED ..., btw.

* I would like to see index_reform_tuple() assert that the new,
truncated index tuple is definitely <= the original (I worry about the
1/3 page restriction issue). Maybe you should also change the name of
index_reform_tuple(), per David.

* There is some stray whitespace within RelationGetIndexAttrBitmap().
I think you should have updated it with code, though. I don't think
it's necessary for HOT updates to work, but I think it could be
necessary so that we don't need to get a row lock that blocks
non-conflict foreign key locking (see heap_update() callers). I think
you need to be careful for non-key columns within the loop in
RelationGetIndexAttrBitmap(), basically, because it seems to still go
through all columns. UPSERT also must call this code, FWIW.

* I think that a similar omission is also made for the replica
identity stuff in RelationGetIndexAttrBitmap(). Some thought is needed
on how this patch interacts with logical decoding, I guess.

* Valgrind shows an error with an aggregate statement I tried:

2016-04-05 17:01:31.129 PDT 12310 LOG:  statement: explain analyze
select count(*) from ab  where b > 5 group by a, b;
==12310== Invalid read of size 4
==12310==    at 0x656615: match_clause_to_indexcol (indxpath.c:2226)
==12310==    by 0x656615: match_clause_to_index (indxpath.c:2144)
==12310==    by 0x656DBC: match_clauses_to_index (indxpath.c:2115)
==12310==    by 0x658054: match_restriction_clauses_to_index (indxpath.c:2026)
==12310==    by 0x658054: create_index_paths (indxpath.c:269)
==12310==    by 0x64D1DB: set_plain_rel_pathlist (allpaths.c:649)
==12310==    by 0x64D1DB: set_rel_pathlist (allpaths.c:427)
==12310==    by 0x64D93B: set_base_rel_pathlists (allpaths.c:299)
==12310==    by 0x64D93B: make_one_rel (allpaths.c:170)
==12310==    by 0x66876C: query_planner (planmain.c:246)
==12310==    by 0x669FBA: grouping_planner (planner.c:1666)
==12310==    by 0x66D0C9: subquery_planner (planner.c:751)
==12310==    by 0x66D3DA: standard_planner (planner.c:300)
==12310==    by 0x66D714: planner (planner.c:170)
==12310==    by 0x6FD692: pg_plan_query (postgres.c:798)
==12310==    by 0x59082D: ExplainOneQuery (explain.c:350)
==12310==  Address 0xbff290c is 2,508 bytes inside a block of size 8,192 alloc'd
==12310==    at 0x4C2AB80: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12310==    by 0x81B7FA: AllocSetAlloc (aset.c:853)
==12310==    by 0x81D257: palloc (mcxt.c:907)
==12310==    by 0x4B6F65: RelationGetIndexScan (genam.c:94)
==12310==    by 0x4C135D: btbeginscan (nbtree.c:431)
==12310==    by 0x4B7A5C: index_beginscan_internal (indexam.c:279)
==12310==    by 0x4B7C5A: index_beginscan (indexam.c:222)
==12310==    by 0x4B73D1: systable_beginscan (genam.c:379)
==12310==    by 0x7E8CF9: ScanPgRelation (relcache.c:341)
==12310==    by 0x7EB3C4: RelationBuildDesc (relcache.c:951)
==12310==    by 0x7ECD35: RelationIdGetRelation (relcache.c:1800)
==12310==    by 0x4A4D37: relation_open (heapam.c:1118)
==12310==
{  <insert_a_suppression_name_here>  Memcheck:Addr4  fun:match_clause_to_indexcol  fun:match_clause_to_index
fun:match_clauses_to_index fun:match_restriction_clauses_to_index  fun:create_index_paths  fun:set_plain_rel_pathlist
fun:set_rel_pathlist fun:set_base_rel_pathlists  fun:make_one_rel  fun:query_planner  fun:grouping_planner
fun:subquery_planner fun:standard_planner  fun:planner  fun:pg_plan_query  fun:ExplainOneQuery
 
}

Separately, I tried "make installcheck-tests TESTS=index_including"
from Postgres + Valgrind, with Valgrind's --track-origins option
enabled (as it was above). I recommend installing Valgrind, and making
sure that the patch shows no errors. I didn't actually find a Valgrind
issue from just using your regression tests (nor did I find an issue
from separately running the regression tests with
CLOBBER_CACHE_ALWAYS, FWIW).

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
06.04.2016 03:05, Peter Geoghegan:
> On Tue, Apr 5, 2016 at 1:31 PM, Peter Geoghegan<pg@heroku.com>  wrote:
>> My new understanding: The extra "included" columns are stored in the
>> index, but do not affect its sort order at all. They are no more part
>> of the key than, say, the heap TID that the key points to. They are
>> just "payload".

It was really long and complicated discussion. I'm glad that finally we
are in agreement about the patch.
Anyway, I think all mentioned questions will be very helpful for the
future work on b-tree.

> Noticed a few issues following another pass:
>
> * tuplesort.c should handle the CLUSTER case in the same way as the
> btree case. No?
Yes, I just missed that cluster uses index sort.  Fixed.

> * Why have a RelationGetNumberOfAttributes(indexRel) call in
> tuplesort_begin_index_btree() at all now?
Fixed.
> * This critical section is unnecessary, because this happens during
> index builds:
>
> +               if (indnkeyatts != indnatts && P_ISLEAF(opageop))
> +               {
> +                       /*
> +                        * It's essential to truncate High key here.
> +                        * The purpose is not just to save more space
> on this particular page,
> +                        * but to keep whole b-tree structure
> consistent. Subsequent insertions
> +                        * assume that hikey is already truncated, and
> so they should not
> +                        * worry about it, when copying the high key
> into the parent page
> +                        * as a downlink.
> +                        * NOTE It is not crutial for reliability in present,
> +                        * but maybe it will be that in the future.
> +                        * NOTE this code will be changed by the
> "btree compression" patch,
> +                        * which is in progress now.
> +                        */
> +                       keytup = index_reform_tuple(wstate->index, oitup,
> +
>                   indnatts, indnkeyatts);
> +
> +                       /*  delete "wrong" high key, insert keytup as
> P_HIKEY. */
> +                       START_CRIT_SECTION();
> +                       PageIndexTupleDelete(opage, P_HIKEY);
> +
> +                       if (!_bt_pgaddtup(opage,
> IndexTupleSize(keytup), keytup, P_HIKEY))
> +                               elog(ERROR, "failed to rewrite
> compressed item in index \"%s\"",
> +                                       RelationGetRelationName(wstate->index));
> +                       END_CRIT_SECTION();
> +               }
>
> Note that START_CRIT_SECTION() promotes any ERROR to PANIC, which
> isn't useful here, because we have no buffer lock held, and nothing
> must be WAL-logged.
>
> * Think you forgot to update spghandler(). (You did not add a test for
> just that one AM, either)
Fixed.
> * I wonder why this restriction needs to exist:
>
> +               else
> +                       elog(ERROR, "Expressions are not supported in
> included columns.");
>
> What does not supporting it buy us? Was it just that the pg_index
> representation is more complicated, and you wanted to put it off?
>
> An error like this should use ereport(ERROR,
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED ..., btw.
Yes, you get it right. It was a bit complicated to implement and I
decided to delay it to the next patch.
errmsg is fixed.

> * I would like to see index_reform_tuple() assert that the new,
> truncated index tuple is definitely <= the original (I worry about the
> 1/3 page restriction issue). Maybe you should also change the name of
> index_reform_tuple(), per David.
Is it possible that the new tuple, containing less attributes than the
old one, will have a greater size?
Maybe you can give an example?
I think that  Assert(indnkeyatts <= indnatts); covers this kind of errors.
I do not mind to rename this function, but what name would be better?
index_truncate_tuple()?

> * There is some stray whitespace within RelationGetIndexAttrBitmap().
> I think you should have updated it with code, though. I don't think
> it's necessary for HOT updates to work, but I think it could be
> necessary so that we don't need to get a row lock that blocks
> non-conflict foreign key locking (see heap_update() callers). I think
> you need to be careful for non-key columns within the loop in
> RelationGetIndexAttrBitmap(), basically, because it seems to still go
> through all columns. UPSERT also must call this code, FWIW.
>
> * I think that a similar omission is also made for the replica
> identity stuff in RelationGetIndexAttrBitmap(). Some thought is needed
> on how this patch interacts with logical decoding, I guess.

Good point. Indexes are everywhere in the code.
I missed that RelationGetIndexAttrBitmap() is used not only for REINDEX.
I'll discuss it with Theodor and send an updated patch tomorrow.

> * Valgrind shows an error with an aggregate statement I tried:
>
> 2016-04-05 17:01:31.129 PDT 12310 LOG:  statement: explain analyze
> select count(*) from ab  where b > 5 group by a, b;
> ==12310== Invalid read of size 4
> ==12310==    at 0x656615: match_clause_to_indexcol (indxpath.c:2226)
> ==12310==    by 0x656615: match_clause_to_index (indxpath.c:2144)
> ==12310==    by 0x656DBC: match_clauses_to_index (indxpath.c:2115)
> ==12310==    by 0x658054: match_restriction_clauses_to_index (indxpath.c:2026)
> ==12310==    by 0x658054: create_index_paths (indxpath.c:269)
> ==12310==    by 0x64D1DB: set_plain_rel_pathlist (allpaths.c:649)
> ==12310==    by 0x64D1DB: set_rel_pathlist (allpaths.c:427)
> ==12310==    by 0x64D93B: set_base_rel_pathlists (allpaths.c:299)
> ==12310==    by 0x64D93B: make_one_rel (allpaths.c:170)
> ==12310==    by 0x66876C: query_planner (planmain.c:246)
> ==12310==    by 0x669FBA: grouping_planner (planner.c:1666)
> ==12310==    by 0x66D0C9: subquery_planner (planner.c:751)
> ==12310==    by 0x66D3DA: standard_planner (planner.c:300)
> ==12310==    by 0x66D714: planner (planner.c:170)
> ==12310==    by 0x6FD692: pg_plan_query (postgres.c:798)
> ==12310==    by 0x59082D: ExplainOneQuery (explain.c:350)
> ==12310==  Address 0xbff290c is 2,508 bytes inside a block of size 8,192 alloc'd
> ==12310==    at 0x4C2AB80: malloc (in
> /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==12310==    by 0x81B7FA: AllocSetAlloc (aset.c:853)
> ==12310==    by 0x81D257: palloc (mcxt.c:907)
> ==12310==    by 0x4B6F65: RelationGetIndexScan (genam.c:94)
> ==12310==    by 0x4C135D: btbeginscan (nbtree.c:431)
> ==12310==    by 0x4B7A5C: index_beginscan_internal (indexam.c:279)
> ==12310==    by 0x4B7C5A: index_beginscan (indexam.c:222)
> ==12310==    by 0x4B73D1: systable_beginscan (genam.c:379)
> ==12310==    by 0x7E8CF9: ScanPgRelation (relcache.c:341)
> ==12310==    by 0x7EB3C4: RelationBuildDesc (relcache.c:951)
> ==12310==    by 0x7ECD35: RelationIdGetRelation (relcache.c:1800)
> ==12310==    by 0x4A4D37: relation_open (heapam.c:1118)
> ==12310==
> {
>     <insert_a_suppression_name_here>
>     Memcheck:Addr4
>     fun:match_clause_to_indexcol
>     fun:match_clause_to_index
>     fun:match_clauses_to_index
>     fun:match_restriction_clauses_to_index
>     fun:create_index_paths
>     fun:set_plain_rel_pathlist
>     fun:set_rel_pathlist
>     fun:set_base_rel_pathlists
>     fun:make_one_rel
>     fun:query_planner
>     fun:grouping_planner
>     fun:subquery_planner
>     fun:standard_planner
>     fun:planner
>     fun:pg_plan_query
>     fun:ExplainOneQuery
> }
>
> Separately, I tried "make installcheck-tests TESTS=index_including"
> from Postgres + Valgrind, with Valgrind's --track-origins option
> enabled (as it was above). I recommend installing Valgrind, and making
> sure that the patch shows no errors. I didn't actually find a Valgrind
> issue from just using your regression tests (nor did I find an issue
> from separately running the regression tests with
> CLOBBER_CACHE_ALWAYS, FWIW).
>
Thank you for advice.
Another miss of index->ncolumns to index->nkeycolumns replacement in
match_clause_to_index. Fixed.
I also updated couple of typos in documentation.

Thank you again for the detailed review.

--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
06.04.2016 16:15, Anastasia Lubennikova :
> 06.04.2016 03:05, Peter Geoghegan:
>> * There is some stray whitespace within RelationGetIndexAttrBitmap().
>> I think you should have updated it with code, though. I don't think
>> it's necessary for HOT updates to work, but I think it could be
>> necessary so that we don't need to get a row lock that blocks
>> non-conflict foreign key locking (see heap_update() callers). I think
>> you need to be careful for non-key columns within the loop in
>> RelationGetIndexAttrBitmap(), basically, because it seems to still go
>> through all columns. UPSERT also must call this code, FWIW.
>>
>> * I think that a similar omission is also made for the replica
>> identity stuff in RelationGetIndexAttrBitmap(). Some thought is needed
>> on how this patch interacts with logical decoding, I guess.
>
> Good point. Indexes are everywhere in the code.
> I missed that RelationGetIndexAttrBitmap() is used not only for REINDEX.
> I'll discuss it with Theodor and send an updated patch tomorrow.

As promised, updated patch is in attachments.
But, I'm not an expert in this area, so it needs a 'critical look'.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Wed, Apr 6, 2016 at 6:15 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>> * I would like to see index_reform_tuple() assert that the new,
>> truncated index tuple is definitely <= the original (I worry about the
>> 1/3 page restriction issue). Maybe you should also change the name of
>> index_reform_tuple(), per David.
>
> Is it possible that the new tuple, containing less attributes than the old
> one, will have a greater size?
> Maybe you can give an example?
> I think that  Assert(indnkeyatts <= indnatts); covers this kind of errors.

I don't think it is possible, because you aren't e.g. making an
attribute's value NULL where it wasn't NULL before (making the
IndexTuple contain a NULL bitmap where it didn't before). But that's
kind of subtle, and it certainly seems worth an assertion. It could
change tomorrow, when someone optimizes heap_deform_tuple(), which has
been proposed more than once.

Personally, I like documenting assertions, and will sometimes write
assertions that the compiler could easily optimize away. Maybe going
*that* far is more a matter of personal style, but I think an
assertion about the new index tuple size being <= the old one is just
a good idea. It's not about a problem in your code at all.

> I do not mind to rename this function, but what name would be better?
> index_truncate_tuple()?

That seems better, yes.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Personally, I like documenting assertions, and will sometimes write
> assertions that the compiler could easily optimize away. Maybe going
> *that* far is more a matter of personal style, but I think an
> assertion about the new index tuple size being <= the old one is just
> a good idea. It's not about a problem in your code at all.

You should make index_truncate_tuple()/index_reform_tuple() promise to
always do this in its comments/contract with caller as part of this,
IMV.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
06.04.2016 23:52, Peter Geoghegan:
> On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Personally, I like documenting assertions, and will sometimes write
>> assertions that the compiler could easily optimize away. Maybe going
>> *that* far is more a matter of personal style, but I think an
>> assertion about the new index tuple size being <= the old one is just
>> a good idea. It's not about a problem in your code at all.
> You should make index_truncate_tuple()/index_reform_tuple() promise to
> always do this in its comments/contract with caller as part of this,
> IMV.
>

Mentioned issues are fixed. Patch is attached.

I'd like to remind you that the commitfest will be closed very-very
soon, so I'd like to get your final resolution about the patch.
Not to have it in the 9.6 release will be very disappointing.

I agree that b-tree is a crucial subsystem. But it seems to me, that we
have lack of improvements in this area
not only because of the algorithm's complexity but also because of lack
of enthusiasts to work on it and struggle through endless discussions.
But it's off-topic here. Attention to these development difficulties
will be one of the messages of my pgcon talk.

You know, we lost a lot of time discussing various b-tree problems.
Besides that, I am sure that the patch is really in a good shape. It
hasn't any open problems to fix.
And possible subtle bugs can be found at the testing stage of the release.

Looking forward to your reply.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Personally, I like documenting assertions, and will sometimes write
>> assertions that the compiler could easily optimize away. Maybe going
>> *that* far is more a matter of personal style, but I think an
>> assertion about the new index tuple size being <= the old one is just
>> a good idea. It's not about a problem in your code at all.
>
> You should make index_truncate_tuple()/index_reform_tuple() promise to
> always do this in its comments/contract with caller as part of this,
> IMV.
>
Some notices:
- index_truncate_tuple(Relation idxrel, IndexTuple olditup, int indnatts,                       int  indnkeyatts)  Why
weneed indnatts/indnkeyatts? They are presented in idxrel struct  already
 
- follow code where index_truncate_tuple() is called, it should never called in  case where indnatts == indnkeyatts.
So,indnkeyatts should be strictly less  than indnatts, pls, change assertion. If they are equal the this function
becomescomplicated variant of CopyIndexTuple()
 
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
08.04.2016 15:06, Teodor Sigaev:
>> On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> Personally, I like documenting assertions, and will sometimes write
>>> assertions that the compiler could easily optimize away. Maybe going
>>> *that* far is more a matter of personal style, but I think an
>>> assertion about the new index tuple size being <= the old one is just
>>> a good idea. It's not about a problem in your code at all.
>>
>> You should make index_truncate_tuple()/index_reform_tuple() promise to
>> always do this in its comments/contract with caller as part of this,
>> IMV.
>>
> Some notices:
> - index_truncate_tuple(Relation idxrel, IndexTuple olditup, int indnatts,
>                        int  indnkeyatts)
>   Why we need indnatts/indnkeyatts? They are presented in idxrel struct
>   already
> - follow code where index_truncate_tuple() is called, it should never
> called in
>   case where indnatts == indnkeyatts. So, indnkeyatts should be
> strictly less
>   than indnatts, pls, change assertion. If they are equal the this
> function
>   becomes complicated variant of CopyIndexTuple()

Good point. These attributes seem to be there since previous versions of
the function.
But now they are definitely unnecessary. Updated patch is attached

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
08.04.2016 15:45, Anastasia Lubennikova:
> 08.04.2016 15:06, Teodor Sigaev:
>>> On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>>> Personally, I like documenting assertions, and will sometimes write
>>>> assertions that the compiler could easily optimize away. Maybe going
>>>> *that* far is more a matter of personal style, but I think an
>>>> assertion about the new index tuple size being <= the old one is just
>>>> a good idea. It's not about a problem in your code at all.
>>>
>>> You should make index_truncate_tuple()/index_reform_tuple() promise to
>>> always do this in its comments/contract with caller as part of this,
>>> IMV.
>>>
>> Some notices:
>> - index_truncate_tuple(Relation idxrel, IndexTuple olditup, int
>> indnatts,
>>                        int  indnkeyatts)
>>   Why we need indnatts/indnkeyatts? They are presented in idxrel struct
>>   already
>> - follow code where index_truncate_tuple() is called, it should never
>> called in
>>   case where indnatts == indnkeyatts. So, indnkeyatts should be
>> strictly less
>>   than indnatts, pls, change assertion. If they are equal the this
>> function
>>   becomes complicated variant of CopyIndexTuple()
>
> Good point. These attributes seem to be there since previous versions
> of the function.
> But now they are definitely unnecessary. Updated patch is attached

One more improvement - note about expressions into documentation.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
Attached version has fix of pg_dump suggested by Stephen Frost in -committers thread.
http://postgresql.nabble.com/pgsql-CREATE-INDEX-INCLUDING-column-td5897653.html
Sooner or later, I'd like to see this patch finished.

For now, it has two complaints:
- support of expressions as included columns.
Frankly, I don't understand, why it's a problem of the patch.
The patch is  already big enough and it will be much easier to add expressions support in the following patch, after the first one will be stable.
I wonder, if someone has objections to that?
Yes, it's a kind of delayed feature. But should we wait for every patch when it will be entirely completed?

- lack of review and testing
Obviously I did as much testing as I could.
So, if reviewers have any concerns about the patch, I'm waiting forward to see them.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Apr 12, 2016 at 9:14 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Sooner or later, I'd like to see this patch finished.

Me, too.

> For now, it has two complaints:
> - support of expressions as included columns.
> Frankly, I don't understand, why it's a problem of the patch.
> The patch is  already big enough and it will be much easier to add
> expressions support in the following patch, after the first one will be
> stable.
> I wonder, if someone has objections to that?

Probably. If we limit the scope of something, it's always in a way
that limits the functionality available to users, rather than limits
how generalized the new functionality is, and so cutting scope
sometimes isn't possible. There is a very high value placed on
features working well together. A user ought to be able to rely on the
intuition that features work well together. Preserving that general
ability for users to guess correctly what will work based on what they
already know is seen as important.

For example, notice that the INSERT documentation allows UPSERT unique
index inference to optionally accept an opclass or collation. So far,
the need for this functionality is totally theoretical (in practice
all B-Tree opclasses have the same idea about equality across a given
type, and we have no case insensitive collations), but it's still
there. Making that work was not a small effort (there was a follow-up
bugfix commit just for that, too). This approach is mostly about
making the implementation theoretically sound (or demonstrating that
it is) by considering edge-cases up-front. Often, there will be
benefits to a maximally generalized approach that were not initially
anticipated by the patch author, or anyone else.

I agree that it is difficult to uphold this standard at all times, but
there is something to be said for it. Postgres development must have a
very long term outlook, and this approach tends to make things easier
for future patch authors by making the code more maintainable. Even if
this is the wrong thing in specific cases, it's sometimes easier to
just do it than to convince others that their concern is misplaced in
this one instance.

> Yes, it's a kind of delayed feature. But should we wait for every patch when
> it will be entirely completed?

I think that knowing where and how to cut scope is an important skill.
If this question is asked as a general question, then the answer must
be "yes". I suggest asking a more specific question. :-)

> - lack of review and testing
> Obviously I did as much testing as I could.
> So, if reviewers have any concerns about the patch, I'm waiting forward to
> see them.

For what it's worth, I agree that you put a great deal of effort into
this patch, and it did not get in to 9.6 because of a collective
failure to focus minds on the patch. Your patch was a credible
attempt, which is impressive when you consider that the B-Tree code is
so complicated. There is also the fact that there is now a very small
list of credible reviewers for B-Tree patches; you must have noticed
that not even amcheck was committed, even though I was asked to
produce a polished version in February during the FOSDEM dev meeting,
and even though it's just a contrib module that is totally orientated
around finding bugs and so on. I'm not happy about that either, but
that's just something I have to swallow.

I fancy myself as am expert on the B-Tree code, but I've never managed
to make an impact in improving its performance at all (I've never made
a serious effort, but have had many ideas). So, in case it needs to be
said, I'll say it: You've chosen a very ambitious set of projects to
work on, by any standard. I think it's a good thing that you've been
ambitious, and I don't suggest changing that, since I think that you
have commensurate skill. But, in order to be successful in these
projects, patience and resolve are very important.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
David Steele
Date:
On 4/27/16 5:08 PM, Peter Geoghegan wrote:

> So, in case it needs to be
> said, I'll say it: You've chosen a very ambitious set of projects to
> work on, by any standard. I think it's a good thing that you've been
> ambitious, and I don't suggest changing that, since I think that you
> have commensurate skill. But, in order to be successful in these
> projects, patience and resolve are very important.

+1.

This is very exciting work and I look forward to seeing it continue.
The patch was perhaps not a good fit for the last CF of 9.6 but that
doesn't mean it can't have a bright future.

-- 
-David
david@pgmasters.net



Re: WIP: Covering + unique indexes.

From
Robert Haas
Date:
On Wed, Apr 27, 2016 at 5:47 PM, David Steele <david@pgmasters.net> wrote:
> On 4/27/16 5:08 PM, Peter Geoghegan wrote:
>> So, in case it needs to be
>> said, I'll say it: You've chosen a very ambitious set of projects to
>> work on, by any standard. I think it's a good thing that you've been
>> ambitious, and I don't suggest changing that, since I think that you
>> have commensurate skill. But, in order to be successful in these
>> projects, patience and resolve are very important.
>
> +1.
>
> This is very exciting work and I look forward to seeing it continue.
> The patch was perhaps not a good fit for the last CF of 9.6 but that
> doesn't mean it can't have a bright future.

+1.  Totally agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, failed
Spec compliant:           tested, passed
Documentation:            tested, passed

Hi hackers!

I've read the patch and here is my code review.

==========PURPOSE============
I've used this feature from time to time with MS SQL. From my experience INCLUDE is a 'sugar on top' feature. 
Some MS SQL classes do not even mention INCLUDE despite it's there from 2005 (though classes do not mention lots of
importantthings, so it's not kind of valuable indicator).
 
But those who use it, use it whenever possible. For example, system view with recommended indices rarely list one
withoutINCLUDE columns.
 
So, this feature is very important from perspective of converting MS SQL DBAs to PostgreSQL. This is how I see it.

========SUGGESTIONS==========
0. Index build is broken. This script
https://github.com/x4m/pggistopt/blob/8ad65d2e305e98c836388a07909af5983dba9c73/test.sqlSEGFAULTs and may cause
situationwhen you cannot insert anything into table (I think drop of index would help, but didn't tested this)
 
1. I think MS SQL syntax INCLUDE instead of INCLUDING would be better (for a purpose listed above)
2. Empty line added in ruleutils.c. Is it for a reason?
3. Now we have indnatts and indnkeyatts instead of indnatts. I think it is worth considering renaming indnatts to
somethingdifferent from old name. Someone somewhere could still suppose it's a number of keys.
 

========PERFORMANCE==========
Due to suggestion number 0 I could not measure performance of index build. Index crashes when there's more than 1.1
millionof rows in a table.
 
Performance test script is here
https://github.com/x4m/pggistopt/blob/f206b4395baa15a2fa42897eeb27bd555619119a/test.sql
Test scenario is following:
1. Create table, then create index, then add data.
2. Make a query touching data in INCLUDING columns.
This scenario is tested against table with:
A. Table with index, that do not contain touched columns, just PK.
B. Index with all columns in index.
C. Index with PK in keys and INCLUDING all other columns.

Tests were executed 5 times on Ubuntu VM under Hyper-V i5 2500 CPU, 16 Gb of RAM, SSD disk.
Time to insert 10M rows:
A. AVG 110 seconds STD 4.8
B. AVG 121 seconds STD 2.0
C. AVG 111 seconds STD 5.7
Inserts to INCLUDING index is almost as fast as inserts to index without extra columns.

Time to run SELECT query:
A. AVG 2864 ms STD 794
B. AVG 2329 ms STD 84
C. AVG 2293 ms STD 58
Selects with INCLUDING columns is almost as fast as with full index.

Index size (deterministic measure, STD = 0)
A. 317 MB
B. 509 MB
C. 399 MB
Index size is in the middle between full index and minimal index.

I think this numbers agree with expectation from the feature.

========CONCLUSION==========
This patch brings useful and important feature. Build shall be repaired; other my suggestions are only suggestions.



Best regards, Andrey Borodin, Octonica & Ural Federal University.

The new status of this patch is: Waiting on Author

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
14.08.2016 20:11, Andrey Borodin:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:       tested, failed
> Spec compliant:           tested, passed
> Documentation:            tested, passed
>
> Hi hackers!
>
> I've read the patch and here is my code review.
>
> ==========PURPOSE============
> I've used this feature from time to time with MS SQL. From my experience INCLUDE is a 'sugar on top' feature.
> Some MS SQL classes do not even mention INCLUDE despite it's there from 2005 (though classes do not mention lots of
importantthings, so it's not kind of valuable indicator). 
> But those who use it, use it whenever possible. For example, system view with recommended indices rarely list one
withoutINCLUDE columns. 
> So, this feature is very important from perspective of converting MS SQL DBAs to PostgreSQL. This is how I see it.

Thank you for the review, I hope this feature will be useful for many
people.

> ========SUGGESTIONS==========
> 0. Index build is broken. This script
https://github.com/x4m/pggistopt/blob/8ad65d2e305e98c836388a07909af5983dba9c73/test.sqlSEGFAULTs and may cause
situationwhen you cannot insert anything into table (I think drop of index would help, but didn't tested this) 

Thank you for reporting. That was a bug caused by high key truncation,
that occurs
when index has more than 3 levels.
Fixed. See attached file.

> 1. I think MS SQL syntax INCLUDE instead of INCLUDING would be better (for a purpose listed above)

I've chosen this particular name to avoid using of new keyword. We
already have INCLUDING
in postgres in a context of inheritance that will never intersect with
covering indexes.
I'm sure it won't be a big problem of migration from MsSQL.

> 2. Empty line added in ruleutils.c. Is it for a reason?

No, just a missed line.
Fixed.

> 3. Now we have indnatts and indnkeyatts instead of indnatts. I think it is worth considering renaming indnatts to
somethingdifferent from old name. Someone somewhere could still suppose it's a number of keys. 

I agree that naming became vague after this patch.
I've already suggested to replace "indkeys[]" with more specific name, and
AFAIR there was no reaction. So I didn't do that.
But I don't sure about your suggestion regarding indnatts. Old queries
(and old indexes)
can still use it correctly. I don't see a reason to break compatibility
for all users.
Those who will use this new feature, should ensure that their queries to
pg_index
behave as expected.

> ========PERFORMANCE==========
> Due to suggestion number 0 I could not measure performance of index build. Index crashes when there's more than 1.1
millionof rows in a table. 
> Performance test script is here
https://github.com/x4m/pggistopt/blob/f206b4395baa15a2fa42897eeb27bd555619119a/test.sql
> Test scenario is following:
> 1. Create table, then create index, then add data.
> 2. Make a query touching data in INCLUDING columns.
> This scenario is tested against table with:
> A. Table with index, that do not contain touched columns, just PK.
> B. Index with all columns in index.
> C. Index with PK in keys and INCLUDING all other columns.
>
> Tests were executed 5 times on Ubuntu VM under Hyper-V i5 2500 CPU, 16 Gb of RAM, SSD disk.
> Time to insert 10M rows:
> A. AVG 110 seconds STD 4.8
> B. AVG 121 seconds STD 2.0
> C. AVG 111 seconds STD 5.7
> Inserts to INCLUDING index is almost as fast as inserts to index without extra columns.
>
> Time to run SELECT query:
> A. AVG 2864 ms STD 794
> B. AVG 2329 ms STD 84
> C. AVG 2293 ms STD 58
> Selects with INCLUDING columns is almost as fast as with full index.
>
> Index size (deterministic measure, STD = 0)
> A. 317 MB
> B. 509 MB
> C. 399 MB
> Index size is in the middle between full index and minimal index.
>
> I think this numbers agree with expectation from the feature.
>
> ========CONCLUSION==========
> This patch brings useful and important feature. Build shall be repaired; other my suggestions are only suggestions.
>
>
>
> Best regards, Andrey Borodin, Octonica & Ural Federal University.
>
> The new status of this patch is: Waiting on Author
>

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Andrew Borodin
Date:
> That was a bug caused by high key truncation, that occurs when index has more than 3 levels. Fixed.
Affirmative. I've tested index construction with 100M rows and
subsequent execution of select queries using index, works fine.

Best regards, Andrey Borodin, Octonica & Ural Federal University.



Re: WIP: Covering + unique indexes.

From
Amit Kapila
Date:
On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
@@ -590,7 +622,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup) if (last_off == P_HIKEY) { Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
+ /*
+ * Truncate the tuple that we're going to insert
+ * into the parent page as a downlink
+ */

+ if (indnkeyatts != indnatts && P_ISLEAF(pageop))
+ state->btps_minkey = index_truncate_tuple(wstate->index, itup);
+ else
+ state->btps_minkey = CopyIndexTuple(itup);

It seems that above code always ensure that for leaf pages, high key
is a truncated tuple.  What is less clear is if that is true, why you
need to re-ensure it again for the old  page in below code:

@@ -510,6 +513,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup)
{
..
+ if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+ {
+ /*
+ * It's essential to truncate High key here.
+ * The purpose is not just to save more space on this particular page,
+ * but to keep whole b-tree structure consistent. Subsequent insertions
+ * assume that hikey is already truncated, and so they should not
+ * worry about it, when copying the high key into the parent page
+ * as a downlink.
+ * NOTE It is not crutial for reliability in present,
+ * but maybe it will be that in the future.
+ */
+ keytup = index_truncate_tuple(wstate->index, oitup);
+
+ /*  delete "wrong" high key, insert keytup as P_HIKEY. */
+ PageIndexTupleDelete(opage, P_HIKEY);
+
+ if (!_bt_pgaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY))
+ elog(ERROR, "failed to rewrite compressed item in index \"%s\"",
+ RelationGetRelationName(wstate->index));
+ }
+
..
..

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
28.08.2016 09:13, Amit Kapila:<br /><blockquote
cite="mid:CAA4eK1+0jLkz8P7MVsUAj+RHmXm4BWBjV8Cc+_-Hsyhgth9ELA@mail.gmail.com"type="cite"><pre wrap="">On Mon, Aug 15,
2016at 8:15 PM, Anastasia Lubennikova
 
<a class="moz-txt-link-rfc2396E" href="mailto:a.lubennikova@postgrespro.ru"><a.lubennikova@postgrespro.ru></a>
wrote:
@@ -590,7 +622,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
*state, IndexTuple itup) if (last_off == P_HIKEY) { Assert(state->btps_minkey == NULL);
- state->btps_minkey = CopyIndexTuple(itup);
+ /*
+ * Truncate the tuple that we're going to insert
+ * into the parent page as a downlink
+ */

+ if (indnkeyatts != indnatts && P_ISLEAF(pageop))
+ state->btps_minkey = index_truncate_tuple(wstate->index, itup);
+ else
+ state->btps_minkey = CopyIndexTuple(itup);

It seems that above code always ensure that for leaf pages, high key
is a truncated tuple.  What is less clear is if that is true, why you
need to re-ensure it again for the old  page in below code:
</pre></blockquote><br /> Thank you for the question. Investigation took a long time)<br /><br /> As far as I
understand,the code above only applies to<br /> the first tuple of each level. While the code you have quoted below<br
/>truncates high keys for all other pages.<br /><br /> There is a comment that clarifies situation:<br />     /*<br />
    * If the new item is the first for its page, stash a copy for later. Note<br />      * this will only happen for
thefirst item on a level; on later pages,<br />      * the first item for a page is copied from the prior page in the
code<br/>      * above.<br />      */<br /><br /><br /> So the patch is correct.<br /> We can go further and remove
thisindex_truncate_tuple() call, because<br /> the first key of any inner (or root) page doesn't need any key at
all.<br/> It simply points out to the leftmost page of the level below.<br /> But it's not a bug, because truncation of
onetuple per level doesn't<br /> add any considerable overhead. So I want to leave the patch in its current state.<br
/><br/><blockquote cite="mid:CAA4eK1+0jLkz8P7MVsUAj+RHmXm4BWBjV8Cc+_-Hsyhgth9ELA@mail.gmail.com" type="cite"><pre
wrap="">@@-510,6 +513,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState
 
*state, IndexTuple itup)
{
..
+ if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+ {
+ /*
+ * It's essential to truncate High key here.
+ * The purpose is not just to save more space on this particular page,
+ * but to keep whole b-tree structure consistent. Subsequent insertions
+ * assume that hikey is already truncated, and so they should not
+ * worry about it, when copying the high key into the parent page
+ * as a downlink.
+ * NOTE It is not crutial for reliability in present,
+ * but maybe it will be that in the future.
+ */
+ keytup = index_truncate_tuple(wstate->index, oitup);
+
+ /*  delete "wrong" high key, insert keytup as P_HIKEY. */
+ PageIndexTupleDelete(opage, P_HIKEY);
+
+ if (!_bt_pgaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY))
+ elog(ERROR, "failed to rewrite compressed item in index \"%s\"",
+ RelationGetRelationName(wstate->index));
+ }
+
..
..
</pre></blockquote><br /><br /><pre class="moz-signature" cols="72">-- 
Anastasia Lubennikova
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
The Russian Postgres Company</pre>

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
One more update.

I added ORDER BY clause to regression tests.
It was done as a separate bugfix patch by Tom Lane some time ago,
but it definitely should be included into the patch.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Amit Kapila
Date:
On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 28.08.2016 09:13, Amit Kapila:
>
> On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>
>
> So the patch is correct.
> We can go further and remove this index_truncate_tuple() call, because
> the first key of any inner (or root) page doesn't need any key at all.
>

Anyway, I think truncation happens if the page is at leaf level and
that is ensured by check, so I think we can't remove this:
+ if (indnkeyatts != indnatts && P_ISLEAF(pageop))


-- I have one more question regarding this truncate high-key concept.
I think if high key is truncated, then during insertion, for cases
like below it move to next page, whereas current page needs to be
splitted.

Assume index on c1,c2,c3 and c2,c3 are including columns.

Actual high key on leaf Page X -
3, 2 , 2
Truncated high key on leaf Page X
3

New insertion key
3, 1, 2

Now, I think for such cases during insertion if the page X doesn't
have enough space, it will move to next page whereas ideally, it
should split current page.  Refer function _bt_findinsertloc() for
this logic.

Is this truncation concept of high key needed for correctness of patch
or is it just to save space in index?   If you need this, then I think
nbtree/Readme needs to be updated.


-- I am getting Assertion failure when I use this patch with database
created with a build before this patch.  However, if I create a fresh
database it works fine.  Assertion failure details are as below:

LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
TRAP: unrecognized TOAST vartag("((bool) 1)", File: "src/backend/access/common/h
eaptuple.c", Line: 532)
LOG:  server process (PID 1404) was terminated by exception 0x80000003
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal valu
e.
LOG:  terminating any other active server processes

--
@@ -1260,14 +1262,14 @@ RelationInitIndexAccessInfo(Relation relation) * Allocate arrays to hold data */
relation->rd_opfamily= (Oid *)
 
- MemoryContextAllocZero(indexcxt, natts * sizeof(Oid));
+ MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid)); relation->rd_opcintype = (Oid *)
- MemoryContextAllocZero(indexcxt, natts * sizeof(Oid));
+ MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid));
 amsupport = relation->rd_amroutine->amsupport; if (amsupport > 0) {
- int nsupport = natts * amsupport;
+ int nsupport = indnatts * amsupport;
 relation->rd_support = (RegProcedure *) MemoryContextAllocZero(indexcxt, nsupport * sizeof(RegProcedure));
@@ -1281,10 +1283,10 @@ RelationInitIndexAccessInfo(Relation relation) }
 relation->rd_indcollation = (Oid *)
- MemoryContextAllocZero(indexcxt, natts * sizeof(Oid));
+ MemoryContextAllocZero(indexcxt, indnatts * sizeof(Oid));

Can you add a comment in above code or some other related place as to
why you need some attributes in relcache entry of size indnkeyatts and
others of size indnatts?

--
@@ -63,17 +63,26 @@ _bt_mkscankey(Relation rel, IndexTuple itup){ ScanKey skey; TupleDesc itupdesc;
- int natts;
+ int     indnatts,
+ indnkeyatts; int16   *indoption; int i;
 itupdesc = RelationGetDescr(rel);
- natts = RelationGetNumberOfAttributes(rel);
+ indnatts = IndexRelationGetNumberOfAttributes(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); indoption = rel->rd_indoption;

- skey = (ScanKey) palloc(natts * sizeof(ScanKeyData));
+ Assert(indnkeyatts != 0);
+ Assert(indnkeyatts <= indnatts);

Here I think you need to declare indnatts as PG_USED_FOR_ASSERTS_ONLY,
otherwise it will give warning on some platforms.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WIP: Covering + unique indexes.

From
Amit Kapila
Date:
On Tue, Sep 20, 2016 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> 28.08.2016 09:13, Amit Kapila:
>>
>> On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova
>> <a.lubennikova@postgrespro.ru> wrote:
>>
>>
>> So the patch is correct.
>> We can go further and remove this index_truncate_tuple() call, because
>> the first key of any inner (or root) page doesn't need any key at all.
>>
>
> Anyway, I think truncation happens if the page is at leaf level and
> that is ensured by check, so I think we can't remove this:
> + if (indnkeyatts != indnatts && P_ISLEAF(pageop))
>
>
> -- I have one more question regarding this truncate high-key concept.
> I think if high key is truncated, then during insertion, for cases
> like below it move to next page, whereas current page needs to be
> splitted.
>
> Assume index on c1,c2,c3 and c2,c3 are including columns.
>
> Actual high key on leaf Page X -
> 3, 2 , 2
> Truncated high key on leaf Page X
> 3
>
> New insertion key
> 3, 1, 2
>
> Now, I think for such cases during insertion if the page X doesn't
> have enough space, it will move to next page whereas ideally, it
> should split current page.  Refer function _bt_findinsertloc() for
> this logic.
>

Basically, here I wanted to know is that do we maintain ordering for
keys with respect to including columns while storing them (In above
example, do we ensure that 3,1,2 is always stored before 3,2,2)?

>
>
> -- I am getting Assertion failure when I use this patch with database
> created with a build before this patch.  However, if I create a fresh
> database it works fine.  Assertion failure details are as below:
>

I have tried this test on my Windows m/c only.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
20.09.2016 08:21, Amit Kapila:
On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
28.08.2016 09:13, Amit Kapila:

On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:


So the patch is correct.
We can go further and remove this index_truncate_tuple() call, because
the first key of any inner (or root) page doesn't need any key at all.

Anyway, I think truncation happens if the page is at leaf level and
that is ensured by check, so I think we can't remove this:
+ if (indnkeyatts != indnatts && P_ISLEAF(pageop))


-- I have one more question regarding this truncate high-key concept.
I think if high key is truncated, then during insertion, for cases
like below it move to next page, whereas current page needs to be
splitted.

Assume index on c1,c2,c3 and c2,c3 are including columns.

Actual high key on leaf Page X -
3, 2 , 2
Truncated high key on leaf Page X
3

New insertion key
3, 1, 2

Now, I think for such cases during insertion if the page X doesn't
have enough space, it will move to next page whereas ideally, it
should split current page.  Refer function _bt_findinsertloc() for
this logic.

Thank you again for the review.

The problem seems really tricky, but the answer is simple.
We store included columns unordered. It was mentioned somewhere in
this thread. Let me give you an example:

create table t (i int, p point);
create index on (i) including (p);
"point" data type doesn't have any opclass for btree.
Should we insert (0, '(0,2)') before (0, '(1,1)') or after?
We have no idea what is the "correct order" for this attribute.
So the answer is "it doesn't matter". When searching in index,
we know that only key attrs are ordered, so only them can be used
in scankey. Other columns are filtered after retrieving data.

explain select i,p from t where i =0 and p <@ circle '((0,0),2)';
                            QUERY PLAN                            
-------------------------------------------------------------------
 Index Only Scan using idx on t  (cost=0.14..4.20 rows=1 width=20)
   Index Cond: (i = 0)
   Filter: (p <@ '<(0,0),2>'::circle)


The same approach is used for included columns of any type, even if
their data types have opclass.

Is this truncation concept of high key needed for correctness of patch
or is it just to save space in index?   If you need this, then I think
nbtree/Readme needs to be updated.

Now it's done only for space saving. We never check included attributes
in non-leaf pages, so why store them? Especially if we assume that included
attributes can be quite long.
There is already a note in documentation:

+        It's the same with other constraints (PRIMARY KEY and EXCLUDE). This can
+        also can be used for non-unique indexes as any columns which are not required
+        for the searching or ordering of records can be included in the
+        <literal>INCLUDING</> clause, which can slightly reduce the size of the index,
+        due to storing included attributes only in leaf index pages.

What should I add to README (or to documentation),
to make it more understandable?

-- I am getting Assertion failure when I use this patch with database
created with a build before this patch.  However, if I create a fresh
database it works fine.  Assertion failure details are as below:

LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
TRAP: unrecognized TOAST vartag("((bool) 1)", File: "src/backend/access/common/h
eaptuple.c", Line: 532)
LOG:  server process (PID 1404) was terminated by exception 0x80000003
HINT:  See C include file "ntstatus.h" for a description of the hexadecimal valu
e.
LOG:  terminating any other active server processes

That is expected behavior, because catalog versions are not compatible.
But I wonder why there was no message about that?
I suppose, that's because CATALOG_VERSION_NO was outdated in my
patch. As well as I know, committer will change it before the commit.
Try new patch with updated value. It should fail with a message about
incompatible versions.

If that is not the reason of your Assertion failure, provide please
more information to reproduce the situation.

--
@@ -1260,14 +1262,14 @@ RelationInitIndexAccessInfo(Relation relation) * Allocate arrays to hold data */ relation->rd_opfamily = (Oid *)
- MemoryContextAllocZero(indexcxt, natts * sizeof(Oid));
+ MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid)); relation->rd_opcintype = (Oid *)
- MemoryContextAllocZero(indexcxt, natts * sizeof(Oid));
+ MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid));
 amsupport = relation->rd_amroutine->amsupport; if (amsupport > 0) {
- int nsupport = natts * amsupport;
+ int nsupport = indnatts * amsupport;
 relation->rd_support = (RegProcedure *) MemoryContextAllocZero(indexcxt, nsupport * sizeof(RegProcedure));
@@ -1281,10 +1283,10 @@ RelationInitIndexAccessInfo(Relation relation) }
 relation->rd_indcollation = (Oid *)
- MemoryContextAllocZero(indexcxt, natts * sizeof(Oid));
+ MemoryContextAllocZero(indexcxt, indnatts * sizeof(Oid));

Can you add a comment in above code or some other related place as to
why you need some attributes in relcache entry of size indnkeyatts and
others of size indnatts?

Done. I hope that's enough.
The same logic is used in DefineIndex(), that already has comments.

--
@@ -63,17 +63,26 @@ _bt_mkscankey(Relation rel, IndexTuple itup){ ScanKey skey; TupleDesc itupdesc;
- int natts;
+ int     indnatts,
+ indnkeyatts; int16   *indoption; int i;
 itupdesc = RelationGetDescr(rel);
- natts = RelationGetNumberOfAttributes(rel);
+ indnatts = IndexRelationGetNumberOfAttributes(rel);
+ indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); indoption = rel->rd_indoption;

- skey = (ScanKey) palloc(natts * sizeof(ScanKeyData));
+ Assert(indnkeyatts != 0);
+ Assert(indnkeyatts <= indnatts);

Here I think you need to declare indnatts as PG_USED_FOR_ASSERTS_ONLY,
otherwise it will give warning on some platforms.
Fixed. Thank you for advice, I didn't know about this macro before.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: WIP: Covering + unique indexes.

From
Amit Kapila
Date:
On Wed, Sep 21, 2016 at 6:51 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 20.09.2016 08:21, Amit Kapila:
>
> On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>
> 28.08.2016 09:13, Amit Kapila:
>
>
> The problem seems really tricky, but the answer is simple.
> We store included columns unordered. It was mentioned somewhere in
> this thread.
>

Is there any fundamental problem in storing them in ordered way?  I
mean to say, you need to anyway store all the column values on leaf
page, so why can't we find the exact location for the complete key.
Basically use truncated key to reach to leaf level and then use the
complete key to find the exact location to store the key.  I might be
missing some thing here, but if we can store them in ordered fashion,
we can use them even for queries containing ORDER BY (where ORDER BY
contains included columns).

> Let me give you an example:
>
> create table t (i int, p point);
> create index on (i) including (p);
> "point" data type doesn't have any opclass for btree.
> Should we insert (0, '(0,2)') before (0, '(1,1)') or after?
> We have no idea what is the "correct order" for this attribute.
> So the answer is "it doesn't matter". When searching in index,
> we know that only key attrs are ordered, so only them can be used
> in scankey. Other columns are filtered after retrieving data.
>
> explain select i,p from t where i =0 and p <@ circle '((0,0),2)';
>                             QUERY PLAN
> -------------------------------------------------------------------
>  Index Only Scan using idx on t  (cost=0.14..4.20 rows=1 width=20)
>    Index Cond: (i = 0)
>    Filter: (p <@ '<(0,0),2>'::circle)
>

I think here reason for using Filter is that because we don't keep
included columns in scan keys, can't we think of having them in scan
keys, but use only key columns in scan key to reach till leaf level
and then use complete scan key at leaf level.

>
> The same approach is used for included columns of any type, even if
> their data types have opclass.
>
> Is this truncation concept of high key needed for correctness of patch
> or is it just to save space in index?   If you need this, then I think
> nbtree/Readme needs to be updated.
>
>
> Now it's done only for space saving. We never check included attributes
> in non-leaf pages, so why store them? Especially if we assume that included
> attributes can be quite long.
> There is already a note in documentation:
>
> +        It's the same with other constraints (PRIMARY KEY and EXCLUDE).
> This can
> +        also can be used for non-unique indexes as any columns which are
> not required
> +        for the searching or ordering of records can be included in the
> +        <literal>INCLUDING</> clause, which can slightly reduce the size of
> the index,
> +        due to storing included attributes only in leaf index pages.
>

Okay, thanks for clarification.

> What should I add to README (or to documentation),
> to make it more understandable?
>

May be add the data representation like only leaf pages contains all
the columns and how the scan works.  I think you can see if you can
extend "Notes About Data Representation" and or "Other Things That Are
Handy to Know" sections in existing README.

> -- I am getting Assertion failure when I use this patch with database
> created with a build before this patch.  However, if I create a fresh
> database it works fine.  Assertion failure details are as below:
>
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started
> TRAP: unrecognized TOAST vartag("((bool) 1)", File:
> "src/backend/access/common/h
> eaptuple.c", Line: 532)
> LOG:  server process (PID 1404) was terminated by exception 0x80000003
> HINT:  See C include file "ntstatus.h" for a description of the hexadecimal
> valu
> e.
> LOG:  terminating any other active server processes
>
>
> That is expected behavior, because catalog versions are not compatible.
> But I wonder why there was no message about that?
> I suppose, that's because CATALOG_VERSION_NO was outdated in my
> patch. As well as I know, committer will change it before the commit.
> Try new patch with updated value. It should fail with a message about
> incompatible versions.
>

Yeah, that must be reason, but lets not change it now, otherwise we
will face conflicts while applying patch.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
24.09.2016 15:36, Amit Kapila:
> On Wed, Sep 21, 2016 at 6:51 PM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> 20.09.2016 08:21, Amit Kapila:
>>
>> On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova
>> <a.lubennikova@postgrespro.ru> wrote:
>>
>> 28.08.2016 09:13, Amit Kapila:
>>
>>
>> The problem seems really tricky, but the answer is simple.
>> We store included columns unordered. It was mentioned somewhere in
>> this thread.
>>
> Is there any fundamental problem in storing them in ordered way?  I
> mean to say, you need to anyway store all the column values on leaf
> page, so why can't we find the exact location for the complete key.
> Basically use truncated key to reach to leaf level and then use the
> complete key to find the exact location to store the key.  I might be
> missing some thing here, but if we can store them in ordered fashion,
> we can use them even for queries containing ORDER BY (where ORDER BY
> contains included columns).
>

I'd say that the reason for not using included columns in any
operations which require comparisons, is that they don't have opclass.
Let's go back to the example of points.
This data type don't have any opclass for B-tree, because of fundamental 
reasons.
And we can not apply _bt_compare() and others to this attribute, so
we don't include it to scan key.

create table t (i int, i2 int, p point);
create index idx1 on (i) including (i2);
create index idx2 on (i) including (p);
create index idx3 on (i) including (i2, p);
create index idx4 on (i) including (p, i2);

You can keep tuples ordered in idx1, but not for idx2, partially ordered 
for idx3, but not for idx4.

At the very beginning of this thread [1], I suggested to use opclass, 
where possible.
Exactly the same idea, you're thinking about. But after short 
discussion, we came
to conclusion that it would require many additional checks and will be 
too complicated,
at least for the initial patch.

>> Let me give you an example:
>>
>> create table t (i int, p point);
>> create index on (i) including (p);
>> "point" data type doesn't have any opclass for btree.
>> Should we insert (0, '(0,2)') before (0, '(1,1)') or after?
>> We have no idea what is the "correct order" for this attribute.
>> So the answer is "it doesn't matter". When searching in index,
>> we know that only key attrs are ordered, so only them can be used
>> in scankey. Other columns are filtered after retrieving data.
>>
>> explain select i,p from t where i =0 and p <@ circle '((0,0),2)';
>>                              QUERY PLAN
>> -------------------------------------------------------------------
>>   Index Only Scan using idx on t  (cost=0.14..4.20 rows=1 width=20)
>>     Index Cond: (i = 0)
>>     Filter: (p <@ '<(0,0),2>'::circle)
>>
> I think here reason for using Filter is that because we don't keep
> included columns in scan keys, can't we think of having them in scan
> keys, but use only key columns in scan key to reach till leaf level
> and then use complete scan key at leaf level.

>> What should I add to README (or to documentation),
>> to make it more understandable?
>>
> May be add the data representation like only leaf pages contains all
> the columns and how the scan works.  I think you can see if you can
> extend "Notes About Data Representation" and or "Other Things That Are
> Handy to Know" sections in existing README.

Ok, I'll write it in a few days.


[1] https://www.postgresql.org/message-id/55F84DF4.5030207@postgrespro.ru

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
Robert Haas
Date:
On Mon, Sep 26, 2016 at 11:17 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>> Is there any fundamental problem in storing them in ordered way?  I
>> mean to say, you need to anyway store all the column values on leaf
>> page, so why can't we find the exact location for the complete key.
>> Basically use truncated key to reach to leaf level and then use the
>> complete key to find the exact location to store the key.  I might be
>> missing some thing here, but if we can store them in ordered fashion,
>> we can use them even for queries containing ORDER BY (where ORDER BY
>> contains included columns).
>
> I'd say that the reason for not using included columns in any
> operations which require comparisons, is that they don't have opclass.
> Let's go back to the example of points.
> This data type don't have any opclass for B-tree, because of fundamental
> reasons.
> And we can not apply _bt_compare() and others to this attribute, so
> we don't include it to scan key.
>
> create table t (i int, i2 int, p point);
> create index idx1 on (i) including (i2);
> create index idx2 on (i) including (p);
> create index idx3 on (i) including (i2, p);
> create index idx4 on (i) including (p, i2);
>
> You can keep tuples ordered in idx1, but not for idx2, partially ordered for
> idx3, but not for idx4.

Yeah, I think we shouldn't go there.  I mean, once you start ordering
by INCLUDING columns, then you're going to need to include them in
leaf pages because otherwise you can't actually guarantee that they
are in the right order.  And then you have to wonder why an INCLUDING
column is any different from a non-INCLUDING column.  It seems best to
make a firm rule that INCLUDING columns are there only for the values,
not for ordering.  That rule is simple and clear, which is a good
thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Covering + unique indexes.

From
Michael Paquier
Date:
On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Ok, I'll write it in a few days.

Marked as returned with feedback per last emails exchanged.
-- 
Michael



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
03.10.2016 05:22, Michael Paquier:
> On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> Ok, I'll write it in a few days.
> Marked as returned with feedback per last emails exchanged.

The only complaint about this patch was a lack of README,
which is fixed now (see the attachment). So, I added it to new CF,
marked as ready for committer.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Amit Kapila
Date:
On Tue, Sep 27, 2016 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Sep 26, 2016 at 11:17 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>>> Is there any fundamental problem in storing them in ordered way?  I
>>> mean to say, you need to anyway store all the column values on leaf
>>> page, so why can't we find the exact location for the complete key.
>>> Basically use truncated key to reach to leaf level and then use the
>>> complete key to find the exact location to store the key.  I might be
>>> missing some thing here, but if we can store them in ordered fashion,
>>> we can use them even for queries containing ORDER BY (where ORDER BY
>>> contains included columns).
>>
>> I'd say that the reason for not using included columns in any
>> operations which require comparisons, is that they don't have opclass.
>> Let's go back to the example of points.
>> This data type don't have any opclass for B-tree, because of fundamental
>> reasons.
>> And we can not apply _bt_compare() and others to this attribute, so
>> we don't include it to scan key.
>>
>> create table t (i int, i2 int, p point);
>> create index idx1 on (i) including (i2);
>> create index idx2 on (i) including (p);
>> create index idx3 on (i) including (i2, p);
>> create index idx4 on (i) including (p, i2);
>>
>> You can keep tuples ordered in idx1, but not for idx2, partially ordered for
>> idx3, but not for idx4.
>
> Yeah, I think we shouldn't go there.  I mean, once you start ordering
> by INCLUDING columns, then you're going to need to include them in
> leaf pages because otherwise you can't actually guarantee that they
> are in the right order.
>

I am not sure what you mean by above, because patch already stores
INCLUDING columns in leaf pages.

>  And then you have to wonder why an INCLUDING
> column is any different from a non-INCLUDING column.  It seems best to
> make a firm rule that INCLUDING columns are there only for the values,
> not for ordering.  That rule is simple and clear, which is a good
> thing.
>

Okay, we can make that firm rule, but I think reasoning behind that
should be clear.  As far as I get it by reading some of the mails in
this thread, it is because some of the other databases doesn't seem to
support ordering for included columns or supporting the same can
complicate the code.  One point, we should keep in mind that
suggestion for including many other columns in INCLUDING clause to use
Index Only scans by other databases might not hold equally good for
PostgreSQL because it can lead to many HOT updates as non-HOT updates.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WIP: Covering + unique indexes.

From
Robert Haas
Date:
On Tue, Oct 4, 2016 at 9:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> I'd say that the reason for not using included columns in any
>>> operations which require comparisons, is that they don't have opclass.
>>> Let's go back to the example of points.
>>> This data type don't have any opclass for B-tree, because of fundamental
>>> reasons.
>>> And we can not apply _bt_compare() and others to this attribute, so
>>> we don't include it to scan key.
>>>
>>> create table t (i int, i2 int, p point);
>>> create index idx1 on (i) including (i2);
>>> create index idx2 on (i) including (p);
>>> create index idx3 on (i) including (i2, p);
>>> create index idx4 on (i) including (p, i2);
>>>
>>> You can keep tuples ordered in idx1, but not for idx2, partially ordered for
>>> idx3, but not for idx4.
>>
>> Yeah, I think we shouldn't go there.  I mean, once you start ordering
>> by INCLUDING columns, then you're going to need to include them in
>> leaf pages because otherwise you can't actually guarantee that they
>> are in the right order.
>
> I am not sure what you mean by above, because patch already stores
> INCLUDING columns in leaf pages.

Sorry, I meant non-leaf pages.

>>  And then you have to wonder why an INCLUDING
>> column is any different from a non-INCLUDING column.  It seems best to
>> make a firm rule that INCLUDING columns are there only for the values,
>> not for ordering.  That rule is simple and clear, which is a good
>> thing.
>
> Okay, we can make that firm rule, but I think reasoning behind that
> should be clear.  As far as I get it by reading some of the mails in
> this thread, it is because some of the other databases doesn't seem to
> support ordering for included columns or supporting the same can
> complicate the code.  One point, we should keep in mind that
> suggestion for including many other columns in INCLUDING clause to use
> Index Only scans by other databases might not hold equally good for
> PostgreSQL because it can lead to many HOT updates as non-HOT updates.

Right.  Looking back, the originally articulated rationale for this
patch was that you might want a single index that is UNIQUE ON (a) but
also INCLUDING (b) rather than two indexes, a unique index on (a) and
a non-unique index on (a, b).  In that case, the patch is a
straight-up win: you get the same number of HOT updates either way,
but you don't use as much disk space, or spend as much CPU time and
WAL updating your indexes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
03.10.2016 15:29, Anastasia Lubennikova:
> 03.10.2016 05:22, Michael Paquier:
>> On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova
>> <a.lubennikova@postgrespro.ru> wrote:
>>> Ok, I'll write it in a few days.
>> Marked as returned with feedback per last emails exchanged.
>
> The only complaint about this patch was a lack of README,
> which is fixed now (see the attachment). So, I added it to new CF,
> marked as ready for committer.

One more fix for pg_upgrade.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Amit Kapila
Date:
On Tue, Oct 4, 2016 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Oct 4, 2016 at 9:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> I'd say that the reason for not using included columns in any
>>>> operations which require comparisons, is that they don't have opclass.
>>>> Let's go back to the example of points.
>>>> This data type don't have any opclass for B-tree, because of fundamental
>>>> reasons.
>>>> And we can not apply _bt_compare() and others to this attribute, so
>>>> we don't include it to scan key.
>>>>
>>>> create table t (i int, i2 int, p point);
>>>> create index idx1 on (i) including (i2);
>>>> create index idx2 on (i) including (p);
>>>> create index idx3 on (i) including (i2, p);
>>>> create index idx4 on (i) including (p, i2);
>>>>
>>>> You can keep tuples ordered in idx1, but not for idx2, partially ordered for
>>>> idx3, but not for idx4.
>>>
>>> Yeah, I think we shouldn't go there.  I mean, once you start ordering
>>> by INCLUDING columns, then you're going to need to include them in
>>> leaf pages because otherwise you can't actually guarantee that they
>>> are in the right order.
>>
>> I am not sure what you mean by above, because patch already stores
>> INCLUDING columns in leaf pages.
>
> Sorry, I meant non-leaf pages.
>

Okay, but in that case I think we don't need to store including
columns in non-leaf pages to get the exact ordering.  As mentioned
upthread, we can use truncated scan key to reach to leaf level and
then use the complete key to find the exact location to store the key.
This is only possible if there exists an opclass for columns that are
covered as part of including clause.  So, we can allow "order by" to
use index scan only if the columns covered in included clause have
opclass for btree.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WIP: Covering + unique indexes.

From
Robert Haas
Date:
On Wed, Oct 5, 2016 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, but in that case I think we don't need to store including
> columns in non-leaf pages to get the exact ordering.  As mentioned
> upthread, we can use truncated scan key to reach to leaf level and
> then use the complete key to find the exact location to store the key.
> This is only possible if there exists an opclass for columns that are
> covered as part of including clause.  So, we can allow "order by" to
> use index scan only if the columns covered in included clause have
> opclass for btree.

But what if there are many pages full of keys that have the same
values for the non-INCLUDING columns?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Covering + unique indexes.

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 5, 2016 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Okay, but in that case I think we don't need to store including
>> columns in non-leaf pages to get the exact ordering.  As mentioned
>> upthread, we can use truncated scan key to reach to leaf level and
>> then use the complete key to find the exact location to store the key.
>> This is only possible if there exists an opclass for columns that are
>> covered as part of including clause.  So, we can allow "order by" to
>> use index scan only if the columns covered in included clause have
>> opclass for btree.

> But what if there are many pages full of keys that have the same
> values for the non-INCLUDING columns?

I concur with Robert that INCLUDING columns should be just dead weight
as far as the index is concerned.  Even if opclass information is
available for them, it's overcomplication for too little return.  We do
not need three classes of columns in an index.
        regards, tom lane



Re: WIP: Covering + unique indexes.

From
Peter Eisentraut
Date:
On 10/4/16 10:47 AM, Anastasia Lubennikova wrote:
> 03.10.2016 15:29, Anastasia Lubennikova:
>> 03.10.2016 05:22, Michael Paquier:
>>> On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova
>>> <a.lubennikova@postgrespro.ru> wrote:
>>>> Ok, I'll write it in a few days.
>>> Marked as returned with feedback per last emails exchanged.
>>
>> The only complaint about this patch was a lack of README,
>> which is fixed now (see the attachment). So, I added it to new CF,
>> marked as ready for committer.
> 
> One more fix for pg_upgrade.

Latest patch doesn't apply.  See also review by Brad DeJong.  I'm
setting it back to Waiting.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WIP: Covering + unique indexes.

From
Haribabu Kommi
Date:


On Sat, Nov 19, 2016 at 8:38 AM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 10/4/16 10:47 AM, Anastasia Lubennikova wrote:
> 03.10.2016 15:29, Anastasia Lubennikova:
>> 03.10.2016 05:22, Michael Paquier:
>>> On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova
>>> <a.lubennikova@postgrespro.ru> wrote:
>>>> Ok, I'll write it in a few days.
>>> Marked as returned with feedback per last emails exchanged.
>>
>> The only complaint about this patch was a lack of README,
>> which is fixed now (see the attachment). So, I added it to new CF,
>> marked as ready for committer.
>
> One more fix for pg_upgrade.

Latest patch doesn't apply.  See also review by Brad DeJong.  I'm
setting it back to Waiting.

Closed in 2016-11 commitfest with "returned with feedback" status.
Please feel free to update the status once you submit the updated patch.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
Updated version of the patch is attached. Besides code itself, it contains new regression test,
documentation updates and a paragraph in nbtree/README.

Syntax was changed - keyword is INCLUDE now as in other databases.

Below you can see the answers to the latest review by Brad DeJong.

Given "create table foo (a int, b int, c int, d int)" and "create unique index foo_a_b on foo (a, b) including (c)".

                                                   index only?   heap tuple needed?
select a, b, c from foo where a = 1                    yes              no
select a, b, d from foo where a = 1                    no               yes
select a, b    from foo where a = 1 and c = 1          ?                ?

select a, b    from foo where a = 1 and c = 1             yes                no


As you can see in EXPLAIN this query doesn't need heap tuple. We can fetch tuple using index-only scan strategy,
because btree never use lossy data representation (i.e stores the same data as in heap). Afterward we apply
Filter (c=1) to the fetched tuple.

explain analyze select a, b    from foo where a = 1 and c = 1;
                                                    QUERY PLAN                                                   
------------------------------------------------------------------------------------------------------------------
 Index Only Scan using foo_a_b on foo  (cost=0.28..4.30 rows=1 width=8) (actual time=0.021..0.022 rows=1 loops=1)
   Index Cond: (a = 1)
   Filter: (c = 1)
   Heap Fetches: 0
 Planning time: 0.344 ms
 Execution time: 0.073 ms


Are included columns counted against the 32 column and 2712 byte index limits? I did not see either explicitly mentioned in the discussion or the documentation. I only ask because in SQL Server the limits are different for include columns.

This limit remains unchanged since included attributes are stored in the very same way as regular index attributes.

1. syntax - on 2016-08-14, Andrey Borodin wrote "I think MS SQL syntax INCLUDE instead of INCLUDING would be better". I would go further than that. This feature is already supported by 2 of the top 5 SQL databases and they both use INCLUDE. Using different syntax because of an internal implementation detail seems short sighted.

Done.
4. documentation - minor items (these are not actual diffs)
Thank you. All issues are fixed.

5. coding
    parse_utilcmd.c
        @@ -1334,6 +1334,38 @@ ...
        The loop is handling included columns separately.
        The loop adds the collation name for each included column if it is not the default.

        Q: Given that the create index/create constraint syntax does not allow a collation to be specified for included columns, how can you ever have a non-default collation?

        @@ -1776,6 +1816,7 @@
        The comment here says "NOTE that exclusion constraints don't support included nonkey attributes". However, the paragraph on INCLUDING in create_index.sgml says "It's the same for the other constraints (PRIMARY KEY and EXCLUDE)".

Good point.
In this version I added syntax for EXCLUDE and INCLUDE compatibility.
Though names look weird, it works as well as other constraints. So documentation is correct now.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [HACKERS] WIP: Covering + unique indexes.

From
Erik Rijkers
Date:
On 2017-01-09 16:02, Anastasia Lubennikova wrote:
>  include_columns_10.0_v1.patch

The patch applies, compiles, and make check is OK.

It yields nice perfomance gains and I haven't been able to break 
anything (yet).

Some edits of the sgml-changes are attached.

Thank you for this very useful improvement.

Erik Rijkers






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Covering + unique indexes.

From
Amit Kapila
Date:
On Mon, Jan 9, 2017 at 8:32 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Updated version of the patch is attached. Besides code itself, it contains
> new regression test,
> documentation updates and a paragraph in nbtree/README.
>

The latest patch doesn't apply cleanly.

Few assorted comments:
1.
@@ -4806,16 +4810,25 @@ RelationGetIndexAttrBitmap(Relation relation,
IndexAttrBitmapKind attrKind)
{
..
+ /*
+ * Since we have covering indexes with non-key columns,
+ * we must handle them accurately here. non-key columns
+ * must be added into indexattrs, since they are in index,
+ * and HOT-update shouldn't miss them.
+ * Obviously, non-key columns couldn't be referenced by
+ * foreign key or identity key. Hence we do not include
+ * them into uindexattrs and idindexattrs bitmaps.
+ */ if (attrnum != 0) { indexattrs = bms_add_member(indexattrs,   attrnum -
FirstLowInvalidHeapAttributeNumber);

- if (isKey)
+ if (isKey && i < indexInfo->ii_NumIndexKeyAttrs) uindexattrs = bms_add_member(uindexattrs,   attrnum -
FirstLowInvalidHeapAttributeNumber);

- if (isIDKey)
+ if (isIDKey && i < indexInfo->ii_NumIndexKeyAttrs) idindexattrs = bms_add_member(idindexattrs,   attrnum -
FirstLowInvalidHeapAttributeNumber);
..
}

Can included columns be part of primary key?  If not, then won't you
need a check similar to above for Primary keys?


2.
+ int indnkeyattrs; /* number of index key attributes*/
+ int indnattrs; /* total number of index attributes*/
+ Oid   *indkeys; /* In spite of the name 'indkeys' this field
+ * contains both key and nonkey
attributes*/

Before the end of the comment, one space is needed.

3.
}
- /* * For UNIQUE and PR
IMARY KEY, we just have a list of column names. *

Looks like spurious line removal.

4.
+ IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE INCLUDING INCREMENT INDEX INDEXES INHERIT
INHERITSINITIALLY INLINE_P INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER INTERSECT INTERVAL INTO
INVOKERIS ISNULL ISOLATION
 
@@ -3431,17 +3433,18 @@ ConstraintElem: n->initially_valid = !n->skip_validation; $$ = (Node *)n; }
- | UNIQUE '(' columnList ')' opt_definition OptConsTableSpace
+ | UNIQUE '(' columnList ')' opt_c_including opt_definition OptConsTableSpace

If we want to use INCLUDE in syntax, then it might be better to keep
the naming reflect the same.  For ex. instead of opt_c_including we
should use opt_c_include.

5.
+opt_c_including: INCLUDE optcincluding { $$ = $2; }
+ | /* EMPTY */ { $$
= NIL; }
+ ;
+
+optcincluding : '(' columnList ')' { $$ = $2; }
+ ;
+

It seems optcincluding is redundant, why can't we directly specify
along with INCLUDE?  If there was some other use of optcincluding or
if there is a complicated definition of the same then it would have
made sense to define it separately.  We have a lot of similar usage in
gram.y, refer opt_in_database.

6.
+optincluding : '(' index_including_params ')' { $$ = $2; }
+ ;
+opt_including: INCLUDE optincluding { $$ = $2; }
+ | /* EMPTY */ { $$ = NIL; }
+ ;

Here the ordering of above clauses seems to be another way.  Also, the
naming of both seems to be confusing. I think either we can eliminate
*optincluding* by following suggestion similar to the previous point
or name them somewhat clearly (like opt_include_clause and
opt_include_params/opt_include_list).

7. Can you include doc fixes suggested by Erik Rijkers [1]?  I have
checked them and they seem to be better than what is there in the
patch.


[1] - https://www.postgresql.org/message-id/3863bca17face15c6acd507e0173a6dc%40xs4all.nl

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
14.02.2017 15:46, Amit Kapila:
> On Mon, Jan 9, 2017 at 8:32 PM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> Updated version of the patch is attached. Besides code itself, it contains
>> new regression test,
>> documentation updates and a paragraph in nbtree/README.
>>
> The latest patch doesn't apply cleanly.
Fixed.
> Few assorted comments:
> 1.
> @@ -4806,16 +4810,25 @@ RelationGetIndexAttrBitmap(Relation relation,
> IndexAttrBitmapKind attrKind)
> {
> ..
> + /*
> + * Since we have covering indexes with non-key columns,
> + * we must handle them accurately here. non-key columns
> + * must be added into indexattrs, since they are in index,
> + * and HOT-update shouldn't miss them.
> + * Obviously, non-key columns couldn't be referenced by
> + * foreign key or identity key. Hence we do not include
> + * them into uindexattrs and idindexattrs bitmaps.
> + */
>    if (attrnum != 0)
>    {
>    indexattrs = bms_add_member(indexattrs,
>      attrnum -
> FirstLowInvalidHeapAttributeNumber);
>
> - if (isKey)
> + if (isKey && i < indexInfo->ii_NumIndexKeyAttrs)
>    uindexattrs = bms_add_member(uindexattrs,
>      attrnum -
> FirstLowInvalidHeapAttributeNumber);
>
> - if (isIDKey)
> + if (isIDKey && i < indexInfo->ii_NumIndexKeyAttrs)
>    idindexattrs = bms_add_member(idindexattrs,
>      attrnum -
> FirstLowInvalidHeapAttributeNumber);
> ..
> }
>
> Can included columns be part of primary key?  If not, then won't you
> need a check similar to above for Primary keys?
No, they cannot be a part of any constraint, so I fixed a check.

> 2.
> + int indnkeyattrs; /* number of index key attributes*/
> + int indnattrs; /* total number of index attributes*/
> + Oid   *indkeys; /* In spite of the name 'indkeys' this field
> + * contains both key and nonkey
> attributes*/
>
> Before the end of the comment, one space is needed.
>
> 3.
> }
> -
>    /*
>    * For UNIQUE and PR
> IMARY KEY, we just have a list of column names.
>    *
>
> Looks like spurious line removal.
Both are fixed.
> 4.
> + IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE
>    INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
>    INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER
>    INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION
> @@ -3431,17 +3433,18 @@ ConstraintElem:
>    n->initially_valid = !n->skip_validation;
>    $$ = (Node *)n;
>    }
> - | UNIQUE '(' columnList ')' opt_definition OptConsTableSpace
> + | UNIQUE '(' columnList ')' opt_c_including opt_definition OptConsTableSpace
>
> If we want to use INCLUDE in syntax, then it might be better to keep
> the naming reflect the same.  For ex. instead of opt_c_including we
> should use opt_c_include.
>
> 5.
> +opt_c_including: INCLUDE optcincluding { $$ = $2; }
> + | /* EMPTY */ { $$
> = NIL; }
> + ;
> +
> +optcincluding : '(' columnList ')' { $$ = $2; }
> + ;
> +
>
> It seems optcincluding is redundant, why can't we directly specify
> along with INCLUDE?  If there was some other use of optcincluding or
> if there is a complicated definition of the same then it would have
> made sense to define it separately.  We have a lot of similar usage in
> gram.y, refer opt_in_database.
>
> 6.
> +optincluding : '(' index_including_params ')' { $$ = $2; }
> + ;
> +opt_including: INCLUDE optincluding { $$ = $2; }
> + | /* EMPTY */ { $$ = NIL; }
> + ;
>
> Here the ordering of above clauses seems to be another way.  Also, the
> naming of both seems to be confusing. I think either we can eliminate
> *optincluding* by following suggestion similar to the previous point
> or name them somewhat clearly (like opt_include_clause and
> opt_include_params/opt_include_list).

Thank you for this suggestion. I've just wrote the code looking at 
examples around,
but optincluding and optcincluding clauses seem to be redundant.
I've cleaned up the code.

> 7. Can you include doc fixes suggested by Erik Rijkers [1]?  I have
> checked them and they seem to be better than what is there in the
> patch.

Yes, I've included them in the last version of the patch.
> [1] - https://www.postgresql.org/message-id/3863bca17face15c6acd507e0173a6dc%40xs4all.nl

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Covering + unique indexes.

From
Amit Kapila
Date:
On Thu, Feb 16, 2017 at 6:43 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> 14.02.2017 15:46, Amit Kapila:
>
>
>> 4.
>> + IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P
>> INCLUDE
>>    INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
>>    INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER
>>    INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION
>> @@ -3431,17 +3433,18 @@ ConstraintElem:
>>    n->initially_valid = !n->skip_validation;
>>    $$ = (Node *)n;
>>    }
>> - | UNIQUE '(' columnList ')' opt_definition OptConsTableSpace
>> + | UNIQUE '(' columnList ')' opt_c_including opt_definition
>> OptConsTableSpace
>>
>> If we want to use INCLUDE in syntax, then it might be better to keep
>> the naming reflect the same.  For ex. instead of opt_c_including we
>> should use opt_c_include.
>>
>
>
> Thank you for this suggestion. I've just wrote the code looking at examples
> around,
> but optincluding and optcincluding clauses seem to be redundant.
> I've cleaned up the code.
>

I think you have cleaned only in gram.y as I could see the references
to 'including' in other parts of code.  For ex, see below code:
@@ -2667,6 +2667,7 @@ _copyConstraint(const Constraint *from) COPY_NODE_FIELD(raw_expr);
COPY_STRING_FIELD(cooked_expr);COPY_NODE_FIELD(keys);
 
+ COPY_NODE_FIELD(including); COPY_NODE_FIELD(exclusions); COPY_NODE_FIELD(options); COPY_STRING_FIELD(indexname);
@@ -3187,6 +3188,7 @@ _copyIndexStmt(const IndexStmt *from) COPY_STRING_FIELD(accessMethod);
COPY_STRING_FIELD(tableSpace);COPY_NODE_FIELD(indexParams);
 
+ COPY_NODE_FIELD(indexIncludingParams);


@@ -425,6 +425,13 @@ ConstructTupleDescriptor(Relation heapRelation, /*
+ * Code below is concerned to the opclasses which are not used
+ * with the included columns.
+ */
+ if (i >= indexInfo->ii_NumIndexKeyAttrs)
+ continue;
+

There seems to be code below the above check which is not directly
related to opclasses, so not sure if you have missed that or is there
any other reason to ignore that.  I am referring to following code in
the same function after the above check:
/*
* If a key type different from the heap value is specified, update
*
the type-related fields in the index tupdesc.
*/
if (OidIsValid(keyType) &&
keyType != to->atttypid)


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] WIP: Covering + unique indexes.

From
Peter Eisentraut
Date:
On 2/16/17 08:13, Anastasia Lubennikova wrote:
> @@ -629,7 +630,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
>  
>      HANDLER HAVING HEADER_P HOLD HOUR_P
>  
> -    IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P
> +    IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE
>      INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P
>      INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER
>      INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION

I think your syntax would read no worse, possibly even better, if you
just used the existing INCLUDING keyword.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
26.02.2017 06:09, Amit Kapila:
On Thu, Feb 16, 2017 at 6:43 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
14.02.2017 15:46, Amit Kapila:


4.
+ IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P
INCLUDE  INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P  INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER  INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION
@@ -3431,17 +3433,18 @@ ConstraintElem:  n->initially_valid = !n->skip_validation;  $$ = (Node *)n;  }
- | UNIQUE '(' columnList ')' opt_definition OptConsTableSpace
+ | UNIQUE '(' columnList ')' opt_c_including opt_definition
OptConsTableSpace

If we want to use INCLUDE in syntax, then it might be better to keep
the naming reflect the same.  For ex. instead of opt_c_including we
should use opt_c_include.

Thank you for this suggestion. I've just wrote the code looking at examples
around,
but optincluding and optcincluding clauses seem to be redundant.
I've cleaned up the code.

I think you have cleaned only in gram.y as I could see the references
to 'including' in other parts of code.  For ex, see below code:
@@ -2667,6 +2667,7 @@ _copyConstraint(const Constraint *from) COPY_NODE_FIELD(raw_expr); COPY_STRING_FIELD(cooked_expr); COPY_NODE_FIELD(keys);
+ COPY_NODE_FIELD(including); COPY_NODE_FIELD(exclusions); COPY_NODE_FIELD(options); COPY_STRING_FIELD(indexname);
@@ -3187,6 +3188,7 @@ _copyIndexStmt(const IndexStmt *from) COPY_STRING_FIELD(accessMethod); COPY_STRING_FIELD(tableSpace); COPY_NODE_FIELD(indexParams);
+ COPY_NODE_FIELD(indexIncludingParams);


There is a lot of variables like 'including*' in the patch.
Frankly, I don't see a reason to rename them. It's clear that they
refers to included attributes, whatever we call them "include", "included" or "including".

@@ -425,6 +425,13 @@ ConstructTupleDescriptor(Relation heapRelation, /*
+ * Code below is concerned to the opclasses which are not used
+ * with the included columns.
+ */
+ if (i >= indexInfo->ii_NumIndexKeyAttrs)
+ continue;
+

There seems to be code below the above check which is not directly
related to opclasses, so not sure if you have missed that or is there
any other reason to ignore that.  I am referring to following code in
the same function after the above check:
/*
* If a key type different from the heap value is specified, update
*
the type-related fields in the index tupdesc.
*/
if (OidIsValid(keyType) &&
keyType != to->atttypid)

Good point,
I skip some steps that should be executed for all attributes.
It is harmless though, since for btree (and other access methods, except hash) amkeytype is always invalid.
But I agree that the code can be clarified.

New patch with minor changes is attached.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [HACKERS] WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
Patch rebased to the current master is in attachments.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: WIP: Covering + unique indexes.

From
Aleksander Alekseev
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            tested, passed

This patch looks good to me. As I understand we have both a complete feature and a consensus in a thread here. If there
areno objection, I'm marking this patch as "Ready for Commiter". 

The new status of this patch is: Ready for Committer

Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
>> -    IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P
>> +    IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE
> I think your syntax would read no worse, possibly even better, if you
> just used the existing INCLUDING keyword.
It was a discussion in this thread about naming and both databases, which 
support covering indexes, use INCLUDE keyword.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/
 



Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
I had a look on patch and played with it, seems, it looks fine. I splitted it to 
two patches: core changes (+bloom index fix) and btree itself. All docs are left 
in first patch - I'm too lazy to rewrite documentation which is changed in 
second patch.
Any objection from reviewers to push both patches?


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: WIP: Covering + unique indexes.

From
Aleksander Alekseev
Date:
Hi Teodor,

> I had a look on patch and played with it, seems, it looks fine. I splitted
> it to two patches: core changes (+bloom index fix) and btree itself. All
> docs are left in first patch - I'm too lazy to rewrite documentation which
> is changed in second patch.
> Any objection from reviewers to push both patches?

These patches look OK. Definitely no objections from me.

--
Best regards,
Aleksander Alekseev

Re: WIP: Covering + unique indexes.

From
Robert Haas
Date:
On Thu, Mar 30, 2017 at 11:26 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> I had a look on patch and played with it, seems, it looks fine. I splitted
> it to two patches: core changes (+bloom index fix) and btree itself. All
> docs are left in first patch - I'm too lazy to rewrite documentation which
> is changed in second patch.
> Any objection from reviewers to push both patches?

Has this really had enough review and testing?  The last time it was
pushed, it didn't go too well.  And laziness is not a very good excuse
for not dividing up patches properly.

It seems highly surprising to me that CheckIndexCompatible() only gets
a one line change in this patch.  That seems unlikely to be correct.

Has anybody done some testing of this patch with the WAL consistency
checker?  Like, create some tables with indexes that have INCLUDE
columns, set up a standby, enable consistency checking, pound the
master, and see if the standby bails?

Has anybody tested this patch with amcheck?  Does it break amcheck?

A few minor comments:

-    foreach(lc, constraint->keys)
+    else foreach(lc, constraint->keys)

That doesn't look like a reasonable way of formatting the code.

+    /* Here is some code duplication. But we do need it. */

That is not a very informative comment.

+                        * NOTE It is not crutial for reliability in present,

Spelling, punctuation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Covering + unique indexes.

From
Andres Freund
Date:
On 2017-03-30 18:26:05 +0300, Teodor Sigaev wrote:
> Any objection from reviewers to push both patches?


> diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
> index f2eda67..59029b9 100644
> --- a/contrib/bloom/blutils.c
> +++ b/contrib/bloom/blutils.c
> @@ -120,6 +120,7 @@ blhandler(PG_FUNCTION_ARGS)
>      amroutine->amclusterable = false;
>      amroutine->ampredlocks = false;
>      amroutine->amcanparallel = false;
> +    amroutine->amcaninclude = false;

That name doesn't strike me as very descriptive.


> +      <term><literal>INCLUDE</literal></term>
> +      <listitem>
> +       <para>
> +        An optional <literal>INCLUDE</> clause allows a list of columns to be
> +        specified which will be included in the non-key portion of the index.
> +        Columns which are part of this clause cannot also exist in the
> +        key columns portion of the index, and vice versa. The
> +        <literal>INCLUDE</> columns exist solely to allow more queries to benefit
> +        from <firstterm>index-only scans</> by including certain columns in the
> +        index, the value of which would otherwise have to be obtained by reading
> +        the table's heap. Having these columns in the <literal>INCLUDE</> clause
> +        in some cases allows <productname>PostgreSQL</> to skip the heap read
> +        completely. This also allows <literal>UNIQUE</> indexes to be defined on
> +        one set of columns, which can include another set of columns in the
> +       <literal>INCLUDE</> clause, on which the uniqueness is not enforced.
> +        It's the same with other constraints (PRIMARY KEY and EXCLUDE). This can
> +        also can be used for non-unique indexes as any columns which are not required
> +        for the searching or ordering of records can be used in the
> +        <literal>INCLUDE</> clause, which can slightly reduce the size of the index.
> +        Currently, only the B-tree access method supports this feature.
> +        Expressions as included columns are not supported since they cannot be used
> +        in index-only scans.
> +       </para>
> +      </listitem>
> +     </varlistentry>

This could use some polishing.


> +/*
> + * Reform index tuple. Truncate nonkey (INCLUDE) attributes.
> + */
> +IndexTuple
> +index_truncate_tuple(Relation idxrel, IndexTuple olditup)
> +{
> +    TupleDesc   itupdesc = RelationGetDescr(idxrel);
> +    Datum       values[INDEX_MAX_KEYS];
> +    bool        isnull[INDEX_MAX_KEYS];
> +    IndexTuple    newitup;
> +    int indnatts = IndexRelationGetNumberOfAttributes(idxrel);
> +    int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(idxrel);
> +
> +    Assert(indnatts <= INDEX_MAX_KEYS);
> +    Assert(indnkeyatts > 0);
> +    Assert(indnkeyatts < indnatts);
> +
> +    index_deform_tuple(olditup, itupdesc, values, isnull);
> +
> +    /* form new tuple that will contain only key attributes */
> +    itupdesc->natts = indnkeyatts;
> +    newitup = index_form_tuple(itupdesc, values, isnull);
> +    newitup->t_tid = olditup->t_tid;
> +
> +    itupdesc->natts = indnatts;

Uh, isn't this a *seriously* bad idea?  If index_form_tuple errors out,
this'll corrupt the tuple descriptor.


Maybe also rename the function to index_build_key_tuple()?

>   * Construct a string describing the contents of an index entry, in the
>   * form "(key_name, ...)=(key_value, ...)".  This is currently used
> - * for building unique-constraint and exclusion-constraint error messages.
> + * for building unique-constraint and exclusion-constraint error messages,
> + * so only key columns of index are checked and printed.

s/index/the index/


> @@ -368,7 +370,7 @@ systable_beginscan(Relation heapRelation,
>          {
>              int            j;
>  
> -            for (j = 0; j < irel->rd_index->indnatts; j++)
> +            for (j = 0; j < IndexRelationGetNumberOfAttributes(irel); j++)

>              {
>                  if (key[i].sk_attno == irel->rd_index->indkey.values[j])
>                  {
> @@ -376,7 +378,7 @@ systable_beginscan(Relation heapRelation,
>                      break;
>                  }
>              }
> -            if (j == irel->rd_index->indnatts)
> +            if (j == IndexRelationGetNumberOfAttributes(irel))
>                  elog(ERROR, "column is not in index");
>          }

Not that it matters overly much, but why are we doing this for all
attributes, rather than just key attributes?


> --- a/src/backend/bootstrap/bootstrap.c
> +++ b/src/backend/bootstrap/bootstrap.c
> @@ -600,7 +600,7 @@ boot_openrel(char *relname)
>           relname, (int) ATTRIBUTE_FIXED_PART_SIZE);
>  
>      boot_reldesc = heap_openrv(makeRangeVar(NULL, relname, -1), NoLock);
> -    numattr = boot_reldesc->rd_rel->relnatts;
> +    numattr = RelationGetNumberOfAttributes(boot_reldesc);
>      for (i = 0; i < numattr; i++)
>      {
>          if (attrtypes[i] == NULL)

That seems a bit unrelated.


> @@ -2086,7 +2086,8 @@ StoreRelCheck(Relation rel, char *ccname, Node *expr,
>                                is_validated,
>                                RelationGetRelid(rel),    /* relation */
>                                attNos,    /* attrs in the constraint */
> -                              keycount, /* # attrs in the constraint */
> +                              keycount, /* # key attrs in the constraint */
> +                              keycount, /* # total attrs in the constraint */
>                                InvalidOid,        /* not a domain constraint */
>                                InvalidOid,        /* no associated index */
>                                InvalidOid,        /* Foreign key fields */

It doesn't quite seem right to me to store this both in pg_index and
pg_constraint.



> @@ -340,14 +341,27 @@ DefineIndex(Oid relationId,
>      numberOfAttributes = list_length(stmt->indexParams);
> -    if (numberOfAttributes <= 0)
> -        ereport(ERROR,
> -                (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> -                 errmsg("must specify at least one column")));
> +

Huh, why's that check gone?

>  
> +opt_c_include:    INCLUDE '(' columnList ')'            { $$ = $3; }
> +             |        /* EMPTY */                        { $$ = NIL; }
> +        ;

> +opt_include:        INCLUDE '(' index_including_params ')'            { $$ = $3; }
> +             |        /* EMPTY */                        { $$ = NIL; }
> +        ;
> +
> +index_including_params:    index_elem                        { $$ = list_make1($1); }
> +            | index_including_params ',' index_elem        { $$ = lappend($1, $3); }
> +        ;
> +

Why do we have multiple different definitions of this?


> @@ -1979,6 +2017,48 @@ transformIndexConstraint(Constraint *constraint, CreateStmtContext *cxt)
>          index->indexParams = lappend(index->indexParams, iparam);
>      }
>  
> +    /* Here is some code duplication. But we do need it. */

Aha?


> +    foreach(lc, constraint->including)
> +    {
> +        char       *key = strVal(lfirst(lc));
> +        bool        found = false;
> +        ColumnDef  *column = NULL;
> +        ListCell   *columns;
> +        IndexElem  *iparam;
> +
> +        foreach(columns, cxt->columns)
> +        {
> +            column = (ColumnDef *) lfirst(columns);
> +            Assert(IsA(column, ColumnDef));
> +            if (strcmp(column->colname, key) == 0)
> +            {
> +                found = true;
> +                break;
> +            }
> +        }
> +
> +        /*
> +         * In the ALTER TABLE case, don't complain about index keys not
> +         * created in the command; they may well exist already. DefineIndex
> +         * will complain about them if not, and will also take care of marking
> +         * them NOT NULL.
> +         */

Uh. Why should they be marked as NOT NULL? ISTM the comment has been
copied here without adjustments.



> @@ -1275,6 +1275,21 @@ pg_get_indexdef_worker(Oid indexrelid, int colno,
>          Oid            keycoltype;
>          Oid            keycolcollation;
>  
> +        /*
> +         * attrsOnly flag is used for building unique-constraint and
> +         * exclusion-constraint error messages. Included attrs are
> +         * meaningless there, so do not include them in the message.
> +         */
> +        if (attrsOnly && keyno >= idxrec->indnkeyatts)
> +            break;

Sounds like the parameter should be renamed then.



> +Included attributes in B-tree indexes
> +-------------------------------------
> +
> +Since 10.0 there is an optional INCLUDE clause, that allows to add

10.0 isn't right, since that's the "patch" version now.


> +a portion of non-key attributes to index. They exist to allow more queries
> +to benefit from index-only scans. We never use included attributes in
> +ScanKeys, neither for search nor for inserts. That allows us to include
> +into B-tree any datatypes, even those which don't have suitable opclass.
> +Included columns only stored in regular items on leaf pages. All inner
> +keys and high keys are truncated and contain only key attributes.
> +That helps to reduce the size of index.

s/index/the index/



> @@ -537,6 +542,28 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
>          ItemIdSetUnused(ii);    /* redundant */
>          ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
>  
> +        if (indnkeyatts != indnatts && P_ISLEAF(opageop))
> +        {
> +            /*
> +             * It's essential to truncate High key here.
> +             * The purpose is not just to save more space on this particular page,
> +             * but to keep whole b-tree structure consistent. Subsequent insertions
> +             * assume that hikey is already truncated, and so they should not
> +             * worry about it, when copying the high key into the parent page
> +             * as a downlink.

s/should/need/

> +             * NOTE It is not crutial for reliability in present,

s/crutial/crucial/

> +             * but maybe it will be that in the future.
> +             */

"it's essential" ... "it is not crutial" -- that's contradictory.

> +            keytup = index_truncate_tuple(wstate->index, oitup);

The code in _bt_split previously claimed that it's the only place doing
truncation...


> +            /*  delete "wrong" high key, insert keytup as P_HIKEY. */
> +            PageIndexTupleDelete(opage, P_HIKEY);

> +            if (!_bt_pgaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY))
> +                elog(ERROR, "failed to rewrite compressed item in index \"%s\"",
> +                    RelationGetRelationName(wstate->index));

Hm...


- Andres



Re: WIP: Covering + unique indexes.

From
Aleksander Alekseev
Date:
Hi Robert,

> Has anybody done some testing of this patch with the WAL consistency
> checker?  Like, create some tables with indexes that have INCLUDE
> columns, set up a standby, enable consistency checking, pound the
> master, and see if the standby bails?

I've decided to run such a test. It looks like there is a bug indeed.

Steps to reproduce:

0. Apply a patch.
1. Build PostgreSQL using quick-build.sh [1]
2. Install master and replica using install.sh [2]
3. Download test.sql [3]
4. Run: `cat test.sql | psql`
5. In replica's logfile:

```
FATAL:  inconsistent page found, rel 1663/16384/16396, forknum 0, blkno 1
```

> Has anybody tested this patch with amcheck?  Does it break amcheck?

Amcheck doesn't complain.

[1] https://github.com/afiskon/pgscripts/blob/master/quick-build.sh
[2] https://github.com/afiskon/pgscripts/blob/master/install.sh
[3] http://afiskon.ru/s/88/93c544e6cf_test.sql

--
Best regards,
Aleksander Alekseev

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
30.03.2017 19:49, Robert Haas:
> On Thu, Mar 30, 2017 at 11:26 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> I had a look on patch and played with it, seems, it looks fine. I splitted
>> it to two patches: core changes (+bloom index fix) and btree itself. All
>> docs are left in first patch - I'm too lazy to rewrite documentation which
>> is changed in second patch.
>> Any objection from reviewers to push both patches?
> Has this really had enough review and testing?  The last time it was
> pushed, it didn't go too well.  And laziness is not a very good excuse
> for not dividing up patches properly.

Well,
I don't know how can we estimate the quality of the review or testing.
The patch was reviewed by many people.
Here are those who marked themselves as reviewers on this and previous 
committfests: Stephen Frost (sfrost), Andrew Dunstan (adunstan), 
Aleksander Alekseev (a.alekseev), Amit Kapila (amitkapila), Andrey 
Borodin (x4m), Peter Geoghegan (pgeoghegan), David Rowley (davidrowley).

For me it looks serious enough. These people, as well as many others, 
shared their thoughts on this topic and pointed out various mistakes.
I fixed all the issues as soon as I could. And I'm not going to 
disappear when it will be committed. Personally, I always thought that 
we have Alpha and Beta releases for integration testing.

Speaking of the feature itself, it is included into our fork of 
PostgreSQL 9.6 since it was released.
And as far as I know, there were no complaints from users. It makes me 
believe that there are no critical bugs there.
While there may be conflicts with some other features of v10.0.

> It seems highly surprising to me that CheckIndexCompatible() only gets
> a one line change in this patch.  That seems unlikely to be correct.
What makes you think so? CheckIndexCompatible() only cares about 
possible opclasses' changes.
For covering indexes opclasses are only applicable to indnkeyatts. And 
that is exactly what was changed in this patch.
Do you think it needs some other changes?

> Has anybody done some testing of this patch with the WAL consistency
> checker?  Like, create some tables with indexes that have INCLUDE
> columns, set up a standby, enable consistency checking, pound the
> master, and see if the standby bails?
Good point. I missed this feature, I wish someone mentioned this issue a 
bit earlier.
And as Alexander's test shows there is some problem with my patch, indeed.
I'll fix it and send updated patch.

> Has anybody tested this patch with amcheck?  Does it break amcheck?
Yes, it breaks amcheck. Amcheck should be patched in order to work with 
covering indexes.
We've discussed it with Peter before and I even wrote small patch.
I'll attach it in the following message.

> A few minor comments:
>
> -    foreach(lc, constraint->keys)
> +    else foreach(lc, constraint->keys)
>
> That doesn't look like a reasonable way of formatting the code.
>
> +    /* Here is some code duplication. But we do need it. */
>
> That is not a very informative comment.
>
> +                        * NOTE It is not crutial for reliability in present,
>
> Spelling, punctuation.
>

Will be fixed as well.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
Robert Haas
Date:
On Thu, Mar 30, 2017 at 5:22 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Well,
> I don't know how can we estimate the quality of the review or testing.
> The patch was reviewed by many people.
> Here are those who marked themselves as reviewers on this and previous
> committfests: Stephen Frost (sfrost), Andrew Dunstan (adunstan), Aleksander
> Alekseev (a.alekseev), Amit Kapila (amitkapila), Andrey Borodin (x4m), Peter
> Geoghegan (pgeoghegan), David Rowley (davidrowley).

Sure, but the amount of in-depth review seems to have been limited.
Just because somebody put their name down in the CommitFest
application doesn't mean that they did a detailed review of all the
code.

>> It seems highly surprising to me that CheckIndexCompatible() only gets
>> a one line change in this patch.  That seems unlikely to be correct.
>
> What makes you think so? CheckIndexCompatible() only cares about possible
> opclasses' changes.
> For covering indexes opclasses are only applicable to indnkeyatts. And that
> is exactly what was changed in this patch.
> Do you think it needs some other changes?

Probably.  I mean, for an INCLUDE column, it wouldn't matter if a
collation or opclass change happened, but if the base data type had
changed, you'd still need to rebuild the index.  So presumably
CheckIndexCompatible() ought to be comparing some things, but not
everything, for INCLUDE columns.

>> Has anybody tested this patch with amcheck?  Does it break amcheck?
>
> Yes, it breaks amcheck. Amcheck should be patched in order to work with
> covering indexes.
> We've discussed it with Peter before and I even wrote small patch.
> I'll attach it in the following message.

Great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
30.03.2017 22:11, Andres Freund
Any objection from reviewers to push both patches?

First of all, I want to thank you and Robert for reviewing this patch.
Your expertise in postgres subsystems is really necessary for features like this.
I just wonder, why don't you share your thoughts and doubts till the "last call".

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index f2eda67..59029b9 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -120,6 +120,7 @@ blhandler(PG_FUNCTION_ARGS)	amroutine->amclusterable = false;	amroutine->ampredlocks = false;	amroutine->amcanparallel = false;
+	amroutine->amcaninclude = false;
That name doesn't strike me as very descriptive.

The feature is "index with included columns", it uses keyword "INCLUDE".
So the name looks good to me.
Any suggestions?
+      <term><literal>INCLUDE</literal></term>
+      <listitem>
+       <para>
+        An optional <literal>INCLUDE</> clause allows a list of columns to be
+        specified which will be included in the non-key portion of the index.
+        Columns which are part of this clause cannot also exist in the
+        key columns portion of the index, and vice versa. The
+        <literal>INCLUDE</> columns exist solely to allow more queries to benefit
+        from <firstterm>index-only scans</> by including certain columns in the
+        index, the value of which would otherwise have to be obtained by reading
+        the table's heap. Having these columns in the <literal>INCLUDE</> clause
+        in some cases allows <productname>PostgreSQL</> to skip the heap read
+        completely. This also allows <literal>UNIQUE</> indexes to be defined on
+        one set of columns, which can include another set of columns in the
+       <literal>INCLUDE</> clause, on which the uniqueness is not enforced.
+        It's the same with other constraints (PRIMARY KEY and EXCLUDE). This can
+        also can be used for non-unique indexes as any columns which are not required
+        for the searching or ordering of records can be used in the
+        <literal>INCLUDE</> clause, which can slightly reduce the size of the index.
+        Currently, only the B-tree access method supports this feature.
+        Expressions as included columns are not supported since they cannot be used
+        in index-only scans.
+       </para>
+      </listitem>
+     </varlistentry>
This could use some polishing.
Definitely. But do you have any specific proposals?

+/*
+ * Reform index tuple. Truncate nonkey (INCLUDE) attributes.
+ */
+IndexTuple
+index_truncate_tuple(Relation idxrel, IndexTuple olditup)
+{
+	TupleDesc   itupdesc = RelationGetDescr(idxrel);
+	Datum       values[INDEX_MAX_KEYS];
+	bool        isnull[INDEX_MAX_KEYS];
+	IndexTuple	newitup;
+	int indnatts = IndexRelationGetNumberOfAttributes(idxrel);
+	int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(idxrel);
+
+	Assert(indnatts <= INDEX_MAX_KEYS);
+	Assert(indnkeyatts > 0);
+	Assert(indnkeyatts < indnatts);
+
+	index_deform_tuple(olditup, itupdesc, values, isnull);
+
+	/* form new tuple that will contain only key attributes */
+	itupdesc->natts = indnkeyatts;
+	newitup = index_form_tuple(itupdesc, values, isnull);
+	newitup->t_tid = olditup->t_tid;
+
+	itupdesc->natts = indnatts;
Uh, isn't this a *seriously* bad idea?  If index_form_tuple errors out,
this'll corrupt the tuple descriptor.
Initial reasoning was something like this:
> Maybe it would be better to modify index_form_tuple() to accept a new
> argument with a number of attributes, then you can just Assert that
> this number is never higher than the number of attributes in the
> TupleDesc.
Good point.
I agree that this function is a bit strange. I have to set 
tupdesc->nattrs to support compatibility with index_form_tuple().
I didn't want to add neither a new field to tupledesc nor a new 
parameter to index_form_tuple(), because they are used widely.

But I haven't considered the possibility of index_form_tuple() failure.
Fixed in this version of the patch. Now it creates a copy of tupledesc to pass it to index_form_tuple.

Maybe also rename the function to index_build_key_tuple()?
We'd discussed with other reviewers, they suggested index_truncate_tuple() instead of index_reform_tuple().
I think that this name reflects the essence of the function clear enough and don't feel like renaming it again.

@@ -368,7 +370,7 @@ systable_beginscan(Relation heapRelation,		{			int			j;
-			for (j = 0; j < irel->rd_index->indnatts; j++)
+			for (j = 0; j < IndexRelationGetNumberOfAttributes(irel); j++)
 			{				if (key[i].sk_attno == irel->rd_index->indkey.values[j])				{
@@ -376,7 +378,7 @@ systable_beginscan(Relation heapRelation,					break;				}			}
-			if (j == irel->rd_index->indnatts)
+			if (j == IndexRelationGetNumberOfAttributes(irel))				elog(ERROR, "column is not in index");		}
Not that it matters overly much, but why are we doing this for all
attributes, rather than just key attributes?
Since we don't use included columns for system indexes, there is no difference. I've just tried to minimize code changes here.
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -600,7 +600,7 @@ boot_openrel(char *relname)		 relname, (int) ATTRIBUTE_FIXED_PART_SIZE);	boot_reldesc = heap_openrv(makeRangeVar(NULL, relname, -1), NoLock);
-	numattr = boot_reldesc->rd_rel->relnatts;
+	numattr = RelationGetNumberOfAttributes(boot_reldesc);	for (i = 0; i < numattr; i++)	{		if (attrtypes[i] == NULL)
That seems a bit unrelated.
I've replaced all the references to relnatts with macro, primarily to ensure that I won't miss anything that should use only key attributes.
@@ -2086,7 +2086,8 @@ StoreRelCheck(Relation rel, char *ccname, Node *expr,							  is_validated,							  RelationGetRelid(rel),	/* relation */							  attNos,	/* attrs in the constraint */
-							  keycount, /* # attrs in the constraint */
+							  keycount, /* # key attrs in the constraint */
+							  keycount, /* # total attrs in the constraint */							  InvalidOid,		/* not a domain constraint */							  InvalidOid,		/* no associated index */							  InvalidOid,		/* Foreign key fields */
It doesn't quite seem right to me to store this both in pg_index and
pg_constraint.

Initially, I did to provide pg_get_constraintdef_worker() with info about included columns.
Maybe it can be solved in some other way, but for now it is a tested and working implementation.

 
+opt_c_include:	INCLUDE '(' columnList ')'			{ $$ = $3; }
+			 |		/* EMPTY */						{ $$ = NIL; }
+		;
+opt_include:		INCLUDE '(' index_including_params ')'			{ $$ = $3; }
+			 |		/* EMPTY */						{ $$ = NIL; }
+		;
+
+index_including_params:	index_elem						{ $$ = list_make1($1); }
+			| index_including_params ',' index_elem		{ $$ = lappend($1, $3); }
+		;
+
Why do we have multiple different definitions of this?

Hm,
columnList contains entries of columnElem type and index_including_params works with index_elem.
Is there a way they can be combined?

+			keytup = index_truncate_tuple(wstate->index, oitup);
The code in _bt_split previously claimed that it's the only place doing
truncation...
To be exact, it claimed that regarding insertion of new values, not index build.
> It's the only point in insertion process, where we perform truncation


Other comments about code format, spelling and comments are fixed in attached patches.
Thank you again for reviewing.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: WIP: Covering + unique indexes.

From
Andres Freund
Date:
On 2017-03-31 20:40:59 +0300, Anastasia Lubennikova wrote:
> 30.03.2017 22:11, Andres Freund
> > Any objection from reviewers to push both patches?
> 
> First of all, I want to thank you and Robert for reviewing this patch.
> Your expertise in postgres subsystems is really necessary for features like
> this.
> I just wonder, why don't you share your thoughts and doubts till the "last
> call".

Because there's a lot of other patches?  I only looked because Teodor
announced he was thinking about committing - I just don't have the
energy to look at all patches before they're ready to commit.
Unfortunatly "ready-for-committer" is very frequently not actually that
:(


> > Maybe it would be better to modify index_form_tuple() to accept a new
> > argument with a number of attributes, then you can just Assert that
> > this number is never higher than the number of attributes in the
> > TupleDesc.
> Good point.
> I agree that this function is a bit strange. I have to set
> tupdesc->nattrs to support compatibility with index_form_tuple().
> I didn't want to add neither a new field to tupledesc nor a new
> parameter to index_form_tuple(), because they are used widely.
> 
> 
> But I haven't considered the possibility of index_form_tuple() failure.
> Fixed in this version of the patch. Now it creates a copy of tupledesc to
> pass it to index_form_tuple.

That seems like it'd actually be a noticeable increase in memory
allocator overhead.  I think we should just add (as just proposed in
separate thread), a _extended version of it that allows to specify the
number of columns.

- Andres



Re: WIP: Covering + unique indexes.

From
Robert Haas
Date:
On Fri, Mar 31, 2017 at 1:40 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> First of all, I want to thank you and Robert for reviewing this patch.
> Your expertise in postgres subsystems is really necessary for features like
> this.
> I just wonder, why don't you share your thoughts and doubts till the "last
> call".

I haven't done any significant technical work other than review
patches in 14 months, and in the last several months I've often worked
10 and 12 hour days to get more review done.

I think at one level you've got a fair complaint here - it's hard to
get things committed, and this patch probably didn't get as much
attention as it deserved.  It's not so easy to know how to fix that.
I'm pretty sure "tell Andres and Robert to work harder" isn't it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
31.03.2017 20:47, Andres Freund:
Maybe it would be better to modify index_form_tuple() to accept a new
argument with a number of attributes, then you can just Assert that
this number is never higher than the number of attributes in the
TupleDesc.
Good point.
I agree that this function is a bit strange. I have to set
tupdesc->nattrs to support compatibility with index_form_tuple().
I didn't want to add neither a new field to tupledesc nor a new
parameter to index_form_tuple(), because they are used widely.


But I haven't considered the possibility of index_form_tuple() failure.
Fixed in this version of the patch. Now it creates a copy of tupledesc to
pass it to index_form_tuple.
That seems like it'd actually be a noticeable increase in memory
allocator overhead.  I think we should just add (as just proposed in
separate thread), a _extended version of it that allows to specify the
number of columns.

The function is called not that often. Only once per page split for indexes with included columns.
Doesn't look like dramatic overhead. So I decided that a wrapper function would be more appropriate than refactoring of all index_form_tuple() calls.
But index_form_tuple_extended() looks like a better solution.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
>
> Other comments about code format, spelling and comments are fixed in 
> attached patches.

One more version. Missed parse_utilcmd.c comment cleanup in previous 
0001 patch.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
31.03.2017 20:57, Robert Haas:
> On Fri, Mar 31, 2017 at 1:40 PM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> First of all, I want to thank you and Robert for reviewing this patch.
>> Your expertise in postgres subsystems is really necessary for features like
>> this.
>> I just wonder, why don't you share your thoughts and doubts till the "last
>> call".
> I haven't done any significant technical work other than review
> patches in 14 months, and in the last several months I've often worked
> 10 and 12 hour days to get more review done.
>
> I think at one level you've got a fair complaint here - it's hard to
> get things committed, and this patch probably didn't get as much
> attention as it deserved.  It's not so easy to know how to fix that.
> I'm pretty sure "tell Andres and Robert to work harder" isn't it.
>

*off-topic*
No complaints from me, I understand how difficult is reviewing and 
highly appreciate your work.
The problem is that not all developers are qualified enough to do a review.
I've tried to make a course about postrges internals. Something like 
"Deep dive into postgres codebase for hackers".
And it turned out to be really helpful for new developers. So, I wonder, 
maybe we could write some tips for new reviewers and testers as well.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
I had a quick look at this on the flight back from PGConf.US.

On Fri, Mar 31, 2017 at 10:40 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> But I haven't considered the possibility of index_form_tuple() failure.
> Fixed in this version of the patch. Now it creates a copy of tupledesc to
> pass it to index_form_tuple.

I think that we need to be 100% sure that index_truncate_tuple() will
not generate an IndexTuple that is larger than the original.
Otherwise, you could violate the "1/3 of page size exceeded" thing. We
need to catch that when the user actually inserts an oversized value.
After that, it's always too late. (See my remarks to Tom on other
thread about this, too.)

> We'd discussed with other reviewers, they suggested index_truncate_tuple()
> instead of index_reform_tuple().
> I think that this name reflects the essence of the function clear enough and
> don't feel like renaming it again.

+1.

Feedback so far:

* index_truncate_tuple() should have as an argument the number of
attributes. No need to "#include utils/rel.h" that way.

* I think that we should store this (the number of attributes), and
use it directly when comparing, per my remarks to Tom over on that
other thread. We should also use the free bit within
IndexTupleData.t_info, to indicate that the IndexTuple was truncated,
just to make it clear to everyone that might care that that's how
these truncated IndexTuples need to be represented.

Doing this would have no real impact on your patch, because for you
this will be 100% redundant. It will help external tools, and perhaps
another, more general suffix truncation patch that comes in the
future. We should try very hard to have a future-proof on-disk
representation. I think that this is quite possible.

* I suggest adding a "can't happen" defensive check + error that
checks that the tuple returned by index_truncate_tuple() is sized <=
the original. This cannot be allowed to ever happen. (Not that I think
it will.)

* I see a small bug. You forgot to teach _bt_findsplitloc() about
truncation. It does this currently, which you did not update:
   /*    * The first item on the right page becomes the high key of the left page;    * therefore it counts against
leftspace as well as right space.    */   leftfree -= firstrightitemsz;
 

I think that this accounting needs to be fixed.

* Note sure about one thing. What's the reason for this change?

> -       /* Log left page */
> -       if (!isleaf)
> -       {
> -           /*
> -            * We must also log the left page's high key, because the right
> -            * page's leftmost key is suppressed on non-leaf levels.  Show it
> -            * as belonging to the left page buffer, so that it is not stored
> -            * if XLogInsert decides it needs a full-page image of the left
> -            * page.
> -            */
> -           itemid = PageGetItemId(origpage, P_HIKEY);
> -           item = (IndexTuple) PageGetItem(origpage, itemid);
> -           XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
> -       }
> +       /*
> +        * We must also log the left page's high key, because the right
> +        * page's leftmost key is suppressed on non-leaf levels.  Show it
> +        * as belonging to the left page buffer, so that it is not stored
> +        * if XLogInsert decides it needs a full-page image of the left
> +        * page.
> +        */
> +       itemid = PageGetItemId(origpage, P_HIKEY);
> +       item = (IndexTuple) PageGetItem(origpage, itemid);
> +       XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));

Is this related to the problem that you mentioned to me that you'd
fixed when we spoke in person earlier today? You said something about
WAL logging, but I don't recall any details. I don't remember seeing
this change in prior versions.

Anyway, whatever the reason for doing this on the leaf level now, the
comments should be updated to explain it.

* Speaking of WAL-logging, I think that there is another bug in
btree_xlog_split(). You didn't change this existing code at all:
   /*    * On leaf level, the high key of the left page is equal to the first key    * on the right page.    */   if
(isleaf)  {       ItemId      hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
 
       left_hikey = PageGetItem(rpage, hiItemId);       left_hikeysz = ItemIdGetLength(hiItemId);   }

It seems like this was missed when you changed WAL-logging, since you
do something for this on the logging side, but not here, on the replay
side. No?

That's all I have for now. Maybe I can look again later, or tomorrow.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Fri, Mar 31, 2017 at 4:31 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> That's all I have for now. Maybe I can look again later, or tomorrow.

I took another look, this time at code used during CREATE INDEX. More feedback:

* I see no reason to expose _bt_pgaddtup() (to modify it to not be
static, so it can be called during CREATE INDEX for truncated high
key). You could call PageAddItem() directly, just as _bt_pgaddtup()
itself does, and lose nothing. This is the case because the special
steps within _bt_pgaddtup() are only when inserting the first real
item (and only on an internal page). You're only ever using
_bt_pgaddtup() for the high key offset. Would a raw PageAddItem() call
lose anything?

I think I see why you've done this -- the existing CREATE INDEX
_bt_sortaddtup() routine (which is very similar to _bt_pgaddtup(), a
routine used for *insertion*) doesn't do the correct thing were you to
use it, because it assumes that the page is always right most (i.e.,
always has no high key yet).

The reason _bt_sortaddtup() exists is explained here:
* This is almost like nbtinsert.c's _bt_pgaddtup(), but we can't use* that because it assumes that P_RIGHTMOST() will
returnthe correct* answer for the page.  Here, we don't know yet if the page will be* rightmost.  Offset P_FIRSTKEY is
alwaysthe first data key.*/
 
static void
_bt_sortaddtup(Page page,              Size itemsize,              IndexTuple itup,              OffsetNumber
itup_off)
{   ...
}

(...thinks some more...)

So, this difference only matters when you have a non-leaf item, which
is never subject to truncation in your patch. So, in fact, it doesn't
matter at all. I guess you should just use _bt_pgaddtup() after all,
rather than bothering with a raw PageAddItem(), even. But, don't
forget to note why this is okay above _bt_sortaddtup().

* Calling PageIndexTupleDelete() within _bt_buildadd(), which
memmove()s all other items on the leaf page, seems wasteful in the
context of CREATE INDEX. Can we do better?

* I also think that calling PageIndexTupleDelete() has a page space
accounting bug, because the following thing happens a second time for
highkey ItemId when new code does this call:

phdr->pd_lower -= sizeof(ItemIdData);

(The first time this happens is within _bt_buildadd() itself, just
before your patch calls PageIndexTupleDelete().)

* I don't think it's okay to let index_truncate_tuple() leak memory
within _bt_buildadd(). It's probably okay for nbtinsert.c callers to
index_truncate_tuple() to not be too careful, though, since those
calls occur in a per-tuple memory context. The same cannot be said for
_bt_buildadd()/CREATE INDEX calls.

* Speaking of memory management: is this really needed?

> @@ -554,7 +580,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
>          * Save a copy of the minimum key for the new page.  We have to copy
>          * it off the old page, not the new one, in case we are not at leaf
>          * level.
> +        * Despite oitup is already initialized, it's important to get high
> +        * key from the page, since we could have replaced it with truncated
> +        * copy. See comment above.
>          */
> +       oitup = (IndexTuple) PageGetItem(opage,PageGetItemId(opage, P_HIKEY));
>         state->btps_minkey = CopyIndexTuple(oitup);

You didn't modify/truncate oitup in-place -- you effectively made a
(truncated) copy by calling index_truncate_tuple(). Maybe you can
manage the memory by assigning keytup to state->btps_minkey, in place
of a CopyIndexTuple(), just for the truncation case?

I haven't studied this in enough detail to be sure that that would be
correct, but it seems clear that a better strategy is needed for
managing memory within _bt_buildadd().

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Thu, Mar 30, 2017 at 8:26 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Any objection from reviewers to push both patches?

I object.

Unfortunately, it seems very unlikely that we'll be able to get the
patch into shape in the allotted time before feature-freeze, even with
the 1 week extension.

-- 
Peter Geoghegan



Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
01.04.2017 02:31, Peter Geoghegan:
>
> * index_truncate_tuple() should have as an argument the number of
> attributes. No need to "#include utils/rel.h" that way.
Will fix.
>
> * I think that we should store this (the number of attributes), and
> use it directly when comparing, per my remarks to Tom over on that
> other thread. We should also use the free bit within
> IndexTupleData.t_info, to indicate that the IndexTuple was truncated,
> just to make it clear to everyone that might care that that's how
> these truncated IndexTuples need to be represented.
>
> Doing this would have no real impact on your patch, because for you
> this will be 100% redundant. It will help external tools, and perhaps
> another, more general suffix truncation patch that comes in the
> future. We should try very hard to have a future-proof on-disk
> representation. I think that this is quite possible.
To be honest, I think that it'll make the patch overcomplified, because 
this exact patch has nothing to do with suffix truncation.
Although, we can add any necessary flags if this work will be continued 
in the future.
> * I suggest adding a "can't happen" defensive check + error that
> checks that the tuple returned by index_truncate_tuple() is sized <=
> the original. This cannot be allowed to ever happen. (Not that I think
> it will.)
There is already an assertion.    Assert(IndexTupleSize(newitup) <= IndexTupleSize(olditup));
Do you think it is not enough?
> * I see a small bug. You forgot to teach _bt_findsplitloc() about
> truncation. It does this currently, which you did not update:
>
>      /*
>       * The first item on the right page becomes the high key of the left page;
>       * therefore it counts against left space as well as right space.
>       */
>      leftfree -= firstrightitemsz;
>
> I think that this accounting needs to be fixed.
Could you explain, what's wrong with this accounting? We may expect to 
take more space on the left page, than will be taken after highkey 
truncation. But I don't see any problem here.

> * Note sure about one thing. What's the reason for this change?
>
>> -       /* Log left page */
>> -       if (!isleaf)
>> -       {
>> -           /*
>> -            * We must also log the left page's high key, because the right
>> -            * page's leftmost key is suppressed on non-leaf levels.  Show it
>> -            * as belonging to the left page buffer, so that it is not stored
>> -            * if XLogInsert decides it needs a full-page image of the left
>> -            * page.
>> -            */
>> -           itemid = PageGetItemId(origpage, P_HIKEY);
>> -           item = (IndexTuple) PageGetItem(origpage, itemid);
>> -           XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
>> -       }
>> +       /*
>> +        * We must also log the left page's high key, because the right
>> +        * page's leftmost key is suppressed on non-leaf levels.  Show it
>> +        * as belonging to the left page buffer, so that it is not stored
>> +        * if XLogInsert decides it needs a full-page image of the left
>> +        * page.
>> +        */
>> +       itemid = PageGetItemId(origpage, P_HIKEY);
>> +       item = (IndexTuple) PageGetItem(origpage, itemid);
>> +       XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
> Is this related to the problem that you mentioned to me that you'd
> fixed when we spoke in person earlier today? You said something about
> WAL logging, but I don't recall any details. I don't remember seeing
> this change in prior versions.
>
> Anyway, whatever the reason for doing this on the leaf level now, the
> comments should be updated to explain it.
This change related to the bug described in this message.
https://www.postgresql.org/message-id/20170330192706.GA2565%40e733.localdomain
After fix it is not reproducible. I will update comments in the next patch.
> * Speaking of WAL-logging, I think that there is another bug in
> btree_xlog_split(). You didn't change this existing code at all:
>
>      /*
>       * On leaf level, the high key of the left page is equal to the first key
>       * on the right page.
>       */
>      if (isleaf)
>      {
>          ItemId      hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
>
>          left_hikey = PageGetItem(rpage, hiItemId);
>          left_hikeysz = ItemIdGetLength(hiItemId);
>      }
>
> It seems like this was missed when you changed WAL-logging, since you
> do something for this on the logging side, but not here, on the replay
> side. No?
>
I changed it. Now we always use highkey saved in xlog.
This code don't needed anymore and can be deleted. Thank you for the 
notice. I will send updated patch today.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Apr 4, 2017 at 3:07 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
>> * I think that we should store this (the number of attributes), and
>> use it directly when comparing, per my remarks to Tom over on that
>> other thread. We should also use the free bit within
>> IndexTupleData.t_info, to indicate that the IndexTuple was truncated,
>> just to make it clear to everyone that might care that that's how
>> these truncated IndexTuples need to be represented.
>>
>> Doing this would have no real impact on your patch, because for you
>> this will be 100% redundant. It will help external tools, and perhaps
>> another, more general suffix truncation patch that comes in the
>> future. We should try very hard to have a future-proof on-disk
>> representation. I think that this is quite possible.
>
> To be honest, I think that it'll make the patch overcomplified, because this
> exact patch has nothing to do with suffix truncation.
> Although, we can add any necessary flags if this work will be continued in
> the future.

Yes, doing things that way would mean adding a bit more complexity to
your patch, but IMV would be worth it to have the on-disk format be
compatible with what a full suffix truncation patch will eventually
require.

Obviously I disagree with what you say here -- I think that your patch
*does* have plenty in common with suffix truncation. But, you don't
have to even agree with me on that to see why what I propose is still
a good idea. Tom Lane had a specific objection to this patch --
catalog metadata is currently necessary to interpret internal page
IndexTuples [1]. However, by storing the true number of columns in the
case of truncated tuples, we can make the situation with IndexTuples
similar enough to the existing situation with heap tuples, where the
number of attributes is available right in the header as "natts". We
don't have to rely on something like catalog metadata from a great
distance, where some caller may forget to pass through the metadata to
a lower level.

So, presumably doing things this way addresses Tom's exact objection
to the truncation aspect of this patch [2]. We have the capacity to
store something like natts "for free" -- let's use it. The lack of any
direct source of metadata was called "dangerous". As much as anything
else, I want to remove any danger.

> There is already an assertion.
>     Assert(IndexTupleSize(newitup) <= IndexTupleSize(olditup));
> Do you think it is not enough?

I think that a "can't happen" check will work better in the future,
when user defined code could be involved in truncation. Any extra
overhead will be paid relatively infrequently, and will be very low.

>> * I see a small bug. You forgot to teach _bt_findsplitloc() about
>> truncation. It does this currently, which you did not update:
>>
>>      /*
>>       * The first item on the right page becomes the high key of the left
>> page;
>>       * therefore it counts against left space as well as right space.
>>       */
>>      leftfree -= firstrightitemsz;
>>
>> I think that this accounting needs to be fixed.
>
> Could you explain, what's wrong with this accounting? We may expect to take
> more space on the left page, than will be taken after highkey truncation.
> But I don't see any problem here.

Obviously it would at least be slightly better to have the actual
truncated high key size where that's expected -- not the would-be
untruncated high key size. The code as it stands might lead to a bad
choice of split point in edge-cases.

At the very least, you should change comments to note the issue. I
think it's highly unlikely that this could ever result in a failure to
find a split point, which there are many defenses against already, but
I think I would find that difficult to prove. The intent of the code
is almost as important as the code, at least in my opinion.

[1] postgr.es/m/CAH2-Wz=VMDH8pFAZX9WAH9Bn5Ast5vrnA0xSz+GsfRs12bp_sg@mail.gmail.com
[2] postgr.es/m/11895.1490983884%40sss.pgh.pa.us
-- 
Peter Geoghegan



Re: [HACKERS] WIP: Covering + unique indexes.

From
David Steele
Date:
On 4/4/17 2:47 PM, Peter Geoghegan wrote:
> 
> At the very least, you should change comments to note the issue. I
> think it's highly unlikely that this could ever result in a failure to
> find a split point, which there are many defenses against already, but
> I think I would find that difficult to prove. The intent of the code
> is almost as important as the code, at least in my opinion.

This submission as been Returned with Feedback.  Please feel free to
resubmit to a future commitfest.

-- 
-David
david@pgmasters.net



Re: [HACKERS] WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
Brief reminder of the idea behind the patch:

Use case:
- We have a table (c1, c2, c3, c4);
- We need to have an unique index on (c1, c2).
- We would like to have a covering index on all columns to avoid reading of heap pages.

Old way:
CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2);
CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4);

What's wrong?
Two indexes contain repeated data. Overhead to data manipulation operations and database size.

New way:
CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDE (c3, c4);

To find more about the syntax you can read related documentation patches and also take a look
at the new test - src/test/regress/sql/index_including.sql.

Updated version is attached. It applies to the commit e4fbf22831c2bbcf032ee60a327b871d2364b3f5.
The first patch patch contains changes in general index routines
and the second one contains btree specific changes.

This version contains fixes of the issues mentioned in the thread above and passes all existing tests.
But still it requires review and testing, because the merge was quite uneasy.
Especially I worry about integration with partitioning. I'll add some more tests in the next message.
-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: [HACKERS] WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
Hi!
+1 for pushing this. I'm really looking forward to see this in 11.

> 31 окт. 2017 г., в 13:21, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
>
> Updated version is attached. It applies to the commit e4fbf22831c2bbcf032ee60a327b871d2364b3f5.
> The first patch patch contains changes in general index routines
> and the second one contains btree specific changes.
>
> This version contains fixes of the issues mentioned in the thread above and passes all existing tests.
> But still it requires review and testing, because the merge was quite uneasy.
> Especially I worry about integration with partitioning. I'll add some more tests in the next message.

I've been doing benchmark tests a year ago, so I skip this part in this review.

I've done some stress tests with pgbench, replication etc. Everything was fine until I plugged in amcheck.
If I create cluster with this [0] and then
./pgbench -i -s 50

create index on pgbench_accounts (abalance) include (bid,aid,filler);
create extension amcheck;

--and finally
SELECT bt_index_check(c.oid), c.relname, c.relpages
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'public'
AND c.relpersistence != 't'
AND i.indisready AND i.indisvalid
ORDER BY c.relpages DESC LIMIT 100;
--just copypasted from amcheck docs with minor corrections

Postgres crashes:
TRAP: FailedAssertion("!(((const void*)(&isNull) != ((void*)0)) && (scankey->sk_attno) > 0)", File: "nbtsearch.c",
Line:466) 

May be I'm doing something wrong or amcheck support will go with different patch?

Few minor nitpicks:
0. PgAdmin fails to understand what is going on [1] . It is clearly problem of PgAdmin, pg_dump works as expected.
1. ISTM index_truncate_tuple() can be optimized. We only need to reset tuple length and infomask. But this should not
behotpath, anyway, so I propose ignoring this for current version. 
2. I've done grammarly checking :) This comma seems redundant [2]
I don't think something of these items require fixing.

Thanks for working on this, I believe it is important.

Best regards, Andrey Borodin.

[0] https://github.com/x4m/pgscripts/blob/master/install.sh
[1] https://yadi.sk/i/ro9YKFqo3PcwFT
[2]
https://github.com/x4m/postgres_g/commit/657c28952d923d8c150e6cabb3bdcbbc44a641b6?diff=unified#diff-640baf2937029728a8d51cccd554c2eeR1291

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Covering + unique indexes.

From
Michael Paquier
Date:
On Sun, Nov 12, 2017 at 8:40 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Postgres crashes:
> TRAP: FailedAssertion("!(((const void*)(&isNull) != ((void*)0)) && (scankey->sk_attno) > 0)", File: "nbtsearch.c",
Line:466)
 
>
> May be I'm doing something wrong or amcheck support will go with different patch?

Usually amcheck complaining is a sign of other symptoms. I am marking
this patch as returned with feedback for now as no updates have been
provided after two weeks.
-- 
Michael


Re: [HACKERS] WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Nov 28, 2017 at 6:16 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sun, Nov 12, 2017 at 8:40 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> Postgres crashes:
>> TRAP: FailedAssertion("!(((const void*)(&isNull) != ((void*)0)) && (scankey->sk_attno) > 0)", File: "nbtsearch.c",
Line:466)
 
>>
>> May be I'm doing something wrong or amcheck support will go with different patch?
>
> Usually amcheck complaining is a sign of other symptoms. I am marking
> this patch as returned with feedback for now as no updates have been
> provided after two weeks.

It looks like amcheck needs to be patched -- a simple oversight.
amcheck is probably calling _bt_compare() without realizing that
internal pages don't have the extra attributes (just leaf pages,
although they should also not participate in comparisons in respect of
included/extra columns). There were changes to amcheck at one point in
the past. That must have slipped through again. I don't think it's
that complicated.

BTW, it would probably be a good idea to use the new Github version's
"heapallindexed" verification [1] for testing this patch. Anastasia
will need to patch the externally maintained amcheck to do this, but
it's probably no extra work, because this is already needed for
contrib/amcheck, and because the heapallindexed check doesn't actually
care about index structure at all.

[1] https://github.com/petergeoghegan/amcheck#optional-heapallindexed-verification
-- 
Peter Geoghegan


Re: [HACKERS] WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
> 29 нояб. 2017 г., в 8:45, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> On Tue, Nov 28, 2017 at 6:16 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Sun, Nov 12, 2017 at 8:40 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>>> Postgres crashes:
>>> TRAP: FailedAssertion("!(((const void*)(&isNull) != ((void*)0)) && (scankey->sk_attno) > 0)", File: "nbtsearch.c",
Line:466) 
>>>
>>> May be I'm doing something wrong or amcheck support will go with different patch?
>>
>> Usually amcheck complaining is a sign of other symptoms. I am marking
>> this patch as returned with feedback for now as no updates have been
>> provided after two weeks.
>
> It looks like amcheck needs to be patched -- a simple oversight.
> amcheck is probably calling _bt_compare() without realizing that
> internal pages don't have the extra attributes (just leaf pages,
> although they should also not participate in comparisons in respect of
> included/extra columns). There were changes to amcheck at one point in
> the past. That must have slipped through again. I don't think it's
> that complicated.
>


There is no doubts that this will be fixed. Therefor I propose move to next CF with Waiting for Author status.

Best regards, Andrey Borodin.

Re: [HACKERS] WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
Hi, Peter!
> 29 нояб. 2017 г., в 8:45, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> It looks like amcheck needs to be patched -- a simple oversight.
> amcheck is probably calling _bt_compare() without realizing that
> internal pages don't have the extra attributes (just leaf pages,
> although they should also not participate in comparisons in respect of
> included/extra columns). There were changes to amcheck at one point in
> the past. That must have slipped through again. I don't think it's
> that complicated.
>
> BTW, it would probably be a good idea to use the new Github version's
> "heapallindexed" verification [1] for testing this patch. Anastasia
> will need to patch the externally maintained amcheck to do this, but
> it's probably no extra work, because this is already needed for
> contrib/amcheck, and because the heapallindexed check doesn't actually
> care about index structure at all.

Seems like it was not a big deal of patching, I've fixed those bits (see attachment).
I've done only simple tests as for now, but I'm planning to do better testing before next CF.
Thanks for mentioning "heapallindexed", I'll use it too.

Best regards, Andrey Borodin.

Attachment

Re: [HACKERS] WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
> 30 нояб. 2017 г., в 23:07, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
>
> Seems like it was not a big deal of patching, I've fixed those bits (see attachment).
> I've done only simple tests as for now, but I'm planning to do better testing before next CF.
> Thanks for mentioning "heapallindexed", I'll use it too.

I've tested the patch with fixed amcheck (including "heapallindexed" feature), tests included bulk index creation,
pgbenchingand amcheck of index itself and WAL-replicated index. 
Everything worked fine.

Spotted one more typo:
> Since 10.0 there is an optional INCLUDE clause
should be
> Since 11.0 there is an optional INCLUDE clause

I think that patch set (two patches + 1 amcheck diff) is ready for committer.

Best regards, Andrey Borodin.

Re: [HACKERS] WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
Hello!

The patch does not apply currently.
Anastasia, can you, please, rebase the patch?

Best regards, Andrey Borodin.


Re: [HACKERS] WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
Updated patches are attached.

Thank you for your interest to this patch and sorry for the slow reply.


08.01.2018 21:08, Andrey Borodin пишет:
> Hello!
>
> The patch does not apply currently.
> Anastasia, can you, please, rebase the patch?
>
> Best regards, Andrey Borodin.
>

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: [HACKERS] WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
Hi!
> 16 янв. 2018 г., в 21:50, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
>
> Updated patches are attached.
>
Cool, thanks!

I've looked into the code, but haven't found anything broken.
Since I've tried to rebase patch myself and failed on parse utils, I've spend some cycles trying to break parsing.
One minor complain (no need to fix).
This is fine
x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,upper(filler));
ERROR:  expressions are not supported in included columns
But why not same error here? Previous message is very descriptive.
x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,aid+1);
ERROR:  syntax error at or near "+"
This works. But should not, IMHO
x4mmm=# create index on pgbench_accounts (bid) include (aid,aid,aid);
CREATE INDEX
Do not know what's that...
# create index on pgbench_accounts (bid) include (aid desc, aid asc);
CREATE INDEX

All these things allow foot-shooting with a small caliber, but do not break big things.

Unfortunately, amcheck_next does not work currently on HEAD (there are problems with AllocSetContextCreate()
signature),but I've tested bt_index_check() before, during and after pgbench, on primary and on slave. Also, I've
checkedbt_index_parent_check() on master. 

During bt_index_check()  test from time to time I was observing
ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.

[install]check[-world] passed :)

From my POV, patch is in a good shape.
I think it is time to make the patch ready for committer again.

Best regards, Andrey Borodin.

Re: [HACKERS] WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
17.01.2018 11:45, Andrey Borodin:
> Hi!
>> 16 янв. 2018 г., в 21:50, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
>>
>> Updated patches are attached.
>>
> Cool, thanks!
>
> I've looked into the code, but haven't found anything broken.
> Since I've tried to rebase patch myself and failed on parse utils, I've spend some cycles trying to break parsing.
> One minor complain (no need to fix).
> This is fine
> x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,upper(filler));
> ERROR:  expressions are not supported in included columns
> But why not same error here? Previous message is very descriptive.
> x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,aid+1);
> ERROR:  syntax error at or near "+"
> This works. But should not, IMHO
> x4mmm=# create index on pgbench_accounts (bid) include (aid,aid,aid);
> CREATE INDEX
> Do not know what's that...
> # create index on pgbench_accounts (bid) include (aid desc, aid asc);
> CREATE INDEX
>
> All these things allow foot-shooting with a small caliber, but do not break big things.
>
> Unfortunately, amcheck_next does not work currently on HEAD (there are problems with AllocSetContextCreate()
signature),but I've tested bt_index_check() before, during and after pgbench, on primary and on slave. Also, I've
checkedbt_index_parent_check() on master.
 

What is amcheck_next ?
> During bt_index_check()  test from time to time I was observing
> ERROR:  canceling statement due to conflict with recovery
> DETAIL:  User query might have needed to see row versions that must be removed.
>

Sorry, I forgot  to attach the amcheck fix to the previous message.
Now all the patches are in attachment.
Could you recheck if the error is still there?



-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: [HACKERS] WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
Hi!
18 янв. 2018 г., в 18:57, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):

What is amcheck_next ?
amcheck_next is external version of amcheck, maintained by Peter G. on his github. It checks one more thing: that every heap tuple has twin in B-tree, so called heapallindexed check.
Version V3 of your patch was checked with heapallindexed and passed the test, both on master and on slave.

During bt_index_check()  test from time to time I was observing
ERROR:  canceling statement due to conflict with recovery
DETAIL:  User query might have needed to see row versions that must be removed.


Sorry, I forgot  to attach the amcheck fix to the previous message.
No problem, surely I've fixed that before testing.
Now all the patches are in attachment.
Could you recheck if the error is still there?
No need to do that, I was checking exactly same codebase.
And that error has nothing to do with your patch, amcheck does not always can perform bt_index_parent_check() on slave when master is heavy loaded. It's OK. I reported this error just to be 100% precise about observed things.

Thanks for working on this feature, hope to see it in 11.

Best regards, Andrey Borodin.

Re: [HACKERS] WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Wed, Jan 17, 2018 at 12:45 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Unfortunately, amcheck_next does not work currently on HEAD (there are problems with AllocSetContextCreate()
signature),but I've tested bt_index_check() before, during and after pgbench, on primary and on slave. Also, I've
checkedbt_index_parent_check() on master. 

I fixed that recently. It should be fine now.

--
Peter Geoghegan


Re: [HACKERS] WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
> 21 янв. 2018 г., в 3:36, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> On Wed, Jan 17, 2018 at 12:45 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> Unfortunately, amcheck_next does not work currently on HEAD (there are problems with AllocSetContextCreate()
signature),but I've tested bt_index_check() before, during and after pgbench, on primary and on slave. Also, I've
checkedbt_index_parent_check() on master. 
>
> I fixed that recently. It should be fine now.
Oh, sorry, missed that I'm using patched stale amcheck_next. Thanks!
Affirmative, amcheck_next works fine.

I run pgbench against several covering indexes. Checking before load, during and after, both on master and slave.
I do not observe any errors besides infrequent "canceling statement due to conflict with recovery", which is not a sign
ofany malfunction. 

Best regards, Andrey Borodin.

Re: WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
I feel sorry for the noise, switching this patch there and back. But the patch needs rebase again. It still applies
with-3, but do not compile anymore.
 

Best regards, Andrey Borodin.

The new status of this patch is: Waiting on Author

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
Thanks for the reminder. Rebased patches are attached.


21.01.2018 17:45, Andrey Borodin пишет:
> I feel sorry for the noise, switching this patch there and back. But the patch needs rebase again. It still applies
with-3, but do not compile anymore.
 
>
> Best regards, Andrey Borodin.
>
> The new status of this patch is: Waiting on Author

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Thomas Munro
Date:
On Fri, Jan 26, 2018 at 3:01 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Thanks for the reminder. Rebased patches are attached.

This is a really cool and also difficult feature.  Thanks for working
on it!  Here are a couple of quick comments on the documentation,
since I noticed it doesn't build:

SGML->XML change: (1) empty closing tags "</>" are no longer accepted,
(2) <xref ...> now needs to be written <xref .../> and (3) xref IDs
are now case-sensitive.

+  PRIMARY KEY ( <replaceable
class="parameter">column_name</replaceable> [, ... ] ) <replaceable
class="parameter">index_parameters</replaceable> <optional>INCLUDE
(<replaceable class="parameter">column_name</replaceable> [,
...])</optional> |

I hadn't seen that use of "<optional>" before.  Almost everywhere else
we use explicit [ and ] characters, but I see that there are other
examples, and it is rendered as [ and ] in the output.  OK, cool, but
I think there should be some extra whitespace so that it comes out as:

  [ INCLUDE ... ]

instead of:

  [INCLUDE ...]

to fit with the existing convention.

+        ... This also allows <literal>UNIQUE</> indexes to be defined on
+        one set of columns, which can include another set of columns in the
+       <literal>INCLUDE</> clause, on which the uniqueness is not enforced.
+        It's the same with other constraints (PRIMARY KEY and
EXCLUDE). This can
+        also can be used for non-unique indexes as any columns which
are not required
+        for the searching or ordering of records can be used in the
+        <literal>INCLUDE</> clause, which can slightly reduce the
size of the index.

Can I suggest rewording these three sentences a bit?  Just an idea:

<literal>UNIQUE</literal> indexes, <literal>PRIMARY KEY</literal>
constraints and <literal>EXCLUDE</literal> constraints can be defined
with extra columns in an <literal>INCLUDE</literal> clause, in which
case uniqueness is not enforced for the extra columns.  Moving columns
that are not needed for searching, ordering or uniqueness into the
<literal>INCLUDE</literal> clause can sometimes reduce the size of the
index while retaining the possibility of using a faster index-only
scan.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
26.01.2018 07:19, Thomas Munro:
> On Fri, Jan 26, 2018 at 3:01 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> Thanks for the reminder. Rebased patches are attached.
> This is a really cool and also difficult feature.  Thanks for working
> on it!  Here are a couple of quick comments on the documentation,
> since I noticed it doesn't build:
>
> SGML->XML change: (1) empty closing tags "</>" are no longer accepted,
> (2) <xref ...> now needs to be written <xref .../> and (3) xref IDs
> are now case-sensitive.
>
> +  PRIMARY KEY ( <replaceable
> class="parameter">column_name</replaceable> [, ... ] ) <replaceable
> class="parameter">index_parameters</replaceable> <optional>INCLUDE
> (<replaceable class="parameter">column_name</replaceable> [,
> ...])</optional> |
>
> I hadn't seen that use of "<optional>" before.  Almost everywhere else
> we use explicit [ and ] characters, but I see that there are other
> examples, and it is rendered as [ and ] in the output.  OK, cool, but
> I think there should be some extra whitespace so that it comes out as:
>
>    [ INCLUDE ... ]
>
> instead of:
>
>    [INCLUDE ...]
>
> to fit with the existing convention.
>
> +        ... This also allows <literal>UNIQUE</> indexes to be defined on
> +        one set of columns, which can include another set of columns in the
> +       <literal>INCLUDE</> clause, on which the uniqueness is not enforced.
> +        It's the same with other constraints (PRIMARY KEY and
> EXCLUDE). This can
> +        also can be used for non-unique indexes as any columns which
> are not required
> +        for the searching or ordering of records can be used in the
> +        <literal>INCLUDE</> clause, which can slightly reduce the
> size of the index.

Thank you for reviewing. All mentioned issues are fixed.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Thomas Munro
Date:
On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Thank you for reviewing. All mentioned issues are fixed.

== Applying patch 0002-covering-btree_v4.patch...
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/README.rej
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/nbtxlog.c.rej

Can we please have a new patch set?

-- 
Thomas Munro
http://www.enterprisedb.com


Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
06.03.2018 11:52, Thomas Munro:
> On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova
> <a.lubennikova@postgrespro.ru> wrote:
>> Thank you for reviewing. All mentioned issues are fixed.
> == Applying patch 0002-covering-btree_v4.patch...
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/backend/access/nbtree/README.rej
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/backend/access/nbtree/nbtxlog.c.rej
>
> Can we please have a new patch set?

Here it is.
Many thanks to Andrey Borodin.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Thu, Mar 8, 2018 at 7:13 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
06.03.2018 11:52, Thomas Munro:
On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Thank you for reviewing. All mentioned issues are fixed.
== Applying patch 0002-covering-btree_v4.patch...
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/README.rej
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/nbtxlog.c.rej

Can we please have a new patch set?

Here it is.
Many thanks to Andrey Borodin.

I took a look at this patchset.  I have some notes about it.

* I see patch changes dblink, amcheck and tcl contribs.  It would be nice to add corresponding
check to dblink and amcheck regression tests.  It would be good to do the same with tcn contrib.
But tcn doesn't have regression tests at all.  And it's out of scope of this patch to add regression
tests to tcn.  So, it's OK to just check that it's working correctly with covering indexes (I hope it's
already done by other reviewers).

* I think that subscription regression tests in src/test/subscription should have some use
of covering indexes.  Logical decoding and subscription are heavily using primary keys.
So they need to be tested to work correctly with covering indexes.

* I also think some isolation tests in src/test/isolation need to check covering indexes too.
In particular insert-conflict-*.spec and lock-*.spec and probably more.

*  pg_dump doesn't handle old PostgreSQL versions correctly.  If I try to dump database
of PostgreSQL 9.6, pg_dump gives me following error:

pg_dump: [archiver (db)] query failed: ERROR:  column i.indnkeyatts does not exist
LINE 1: ...atalog.pg_get_indexdef(i.indexrelid) AS indexdef, i.indnkeya...
                                                             ^

If fact there is a sequence of "if" ... "else if" blocks in getIndexes() which selects
appropriate query depending on remote server version.  And for pre-11 we should
use indnatts instead of indnkeyatts.

* There is minor formatting issue in this part of code.  Some spaces need to be replaced with tabs.
+IndexTuple
+index_truncate_tuple(Relation idxrel, IndexTuple olditup)
+{
+ TupleDesc   itupdesc = CreateTupleDescCopyConstr(RelationGetDescr(idxrel));
+ Datum       values[INDEX_MAX_KEYS];
+ bool        isnull[INDEX_MAX_KEYS];
+ IndexTuple newitup;

* I think this comment needs to be rephrased.
+ /*
+  * Code below is concerned to the opclasses which are not used
+  * with the included columns.
+  */
I would write something like this: "Code below checks opclass key type.  Included columns
don't have opclasses, and this check is not required for them.".  Native english speakers
could provide even better phrasing though.

* I would also like all the patches in patchset version to have same version number.
I understand that "Covering-btree" and "Covering-amcheck" have less previous
versions than "Covering-core".  But it's way easier to identify patches belonging to
the same patchset version if they have same version number.  For sure, then some
patches would skip some version numbers, but that doesn't seem to be a problem for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Wed, Mar 21, 2018 at 9:51 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Thu, Mar 8, 2018 at 7:13 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:
06.03.2018 11:52, Thomas Munro:
On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
Thank you for reviewing. All mentioned issues are fixed.
== Applying patch 0002-covering-btree_v4.patch...
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/README.rej
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/nbtxlog.c.rej

Can we please have a new patch set?

Here it is.
Many thanks to Andrey Borodin.

I took a look at this patchset.  I have some notes about it.

* I see patch changes dblink, amcheck and tcl contribs.  It would be nice to add corresponding
check to dblink and amcheck regression tests.  It would be good to do the same with tcn contrib.
But tcn doesn't have regression tests at all.  And it's out of scope of this patch to add regression
tests to tcn.  So, it's OK to just check that it's working correctly with covering indexes (I hope it's
already done by other reviewers).

* I think that subscription regression tests in src/test/subscription should have some use
of covering indexes.  Logical decoding and subscription are heavily using primary keys.
So they need to be tested to work correctly with covering indexes.

* I also think some isolation tests in src/test/isolation need to check covering indexes too.
In particular insert-conflict-*.spec and lock-*.spec and probably more.

*  pg_dump doesn't handle old PostgreSQL versions correctly.  If I try to dump database
of PostgreSQL 9.6, pg_dump gives me following error:

pg_dump: [archiver (db)] query failed: ERROR:  column i.indnkeyatts does not exist
LINE 1: ...atalog.pg_get_indexdef(i.indexrelid) AS indexdef, i.indnkeya...
                                                             ^

If fact there is a sequence of "if" ... "else if" blocks in getIndexes() which selects
appropriate query depending on remote server version.  And for pre-11 we should
use indnatts instead of indnkeyatts.

* There is minor formatting issue in this part of code.  Some spaces need to be replaced with tabs.
+IndexTuple
+index_truncate_tuple(Relation idxrel, IndexTuple olditup)
+{
+ TupleDesc   itupdesc = CreateTupleDescCopyConstr(RelationGetDescr(idxrel));
+ Datum       values[INDEX_MAX_KEYS];
+ bool        isnull[INDEX_MAX_KEYS];
+ IndexTuple newitup;

* I think this comment needs to be rephrased.
+ /*
+  * Code below is concerned to the opclasses which are not used
+  * with the included columns.
+  */
I would write something like this: "Code below checks opclass key type.  Included columns
don't have opclasses, and this check is not required for them.".  Native english speakers
could provide even better phrasing though.

* I would also like all the patches in patchset version to have same version number.
I understand that "Covering-btree" and "Covering-amcheck" have less previous
versions than "Covering-core".  But it's way easier to identify patches belonging to
the same patchset version if they have same version number.  For sure, then some
patches would skip some version numbers, but that doesn't seem to be a problem for me.

I have few more notes regarding this patchset.

* indkeyatts is described in the documentation, but I think that description of indnatts
should be also updated clarifying that indnatts counts "included" columns.

+      <row>
+      <entry><structfield>indnkeyatts</structfield></entry>
+      <entry><type>int2</type></entry>
+      <entry></entry>
+      <entry>The number of key columns in the index. "Key columns" are ordinary
+      index columns (as opposed to "included" columns).</entry>
+     </row>

*  It seems like this paragraph appears in the patchset without any mentioning
in the thread.

+Notes to Operator Class Implementors
+------------------------------------

Besides I really appreciate it, it seems to be unrelated to the covering indexes.  
I'd like this to be extracted into separate patch and be committed separately.

* There is a typo here: brtee -> btree
+ * 7. Check various AMs. All but brtee must fail.

* This comment should be updated assuming that we now put left page
hikey to the WAL independently on whether it's leaf page split.

+ /*
+ * We must also log the left page's high key, because the right
+ * page's leftmost key is suppressed on non-leaf levels.  Show it
+ * as belonging to the left page buffer, so that it is not stored
+ * if XLogInsert decides it needs a full-page image of the left
+ * page.
+ */

* get_index_def() is adjusted to support covering indexes.  I think this support
deserve to be checked in regression tests.

* In PostgreSQL sentences are sometimes divided by single spacing, sometimes
divided by double spacing.  I think we should follow general rule here: code should
look like its surroundings.  Could you please recheck that through the patch.
I notices that especially in documentation you frequently use single spacing while
surrounding uses double spacing.

Rest of things look OK to me for now.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Thu, Mar 22, 2018 at 8:23 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> * There is minor formatting issue in this part of code.  Some spaces need
>> to be replaced with tabs.
>> +IndexTuple
>> +index_truncate_tuple(Relation idxrel, IndexTuple olditup)
>> +{
>> + TupleDesc   itupdesc =
>> CreateTupleDescCopyConstr(RelationGetDescr(idxrel));
>> + Datum       values[INDEX_MAX_KEYS];
>> + bool        isnull[INDEX_MAX_KEYS];
>> + IndexTuple newitup;

The last time I looked at this patch, in April 2017, I made the point
that we should add something like an "nattributes" argument to
index_truncate_tuple() [1], rather than always using
IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple().
I can see that that change hasn't been made since that time.

With that approach, we can avoid relying on catalog metadata to the
same degree, which was a specific concern that Tom had around that
time. It's easy to do something with t_tid's offset, which is unused
in internal page IndexTuples. We do very similar things in GIN, where
alternative use of an IndexTuple's t_tid supports all kinds of
enhancements, some of which were not originally anticipated. Alexander
surely knows more about this than I do, since he wrote that code.

Having this index_truncate_tuple() "nattributes" argument, and storing
the number of attributes directly improves quite a lot of things:

* It makes diagnosing issues in the field quite a bit easier. Tools
like pg_filedump can do the right thing (Tom mentioned pg_filedump and
amcheck specifically). The nbtree IndexTuple format should not need to
be interpreted in a context-sensitive way, if we can avoid it.

* It lets you use index_truncate_tuple() for regular suffix truncation
in the future. These INCLUDE IndexTuples are really just a special
case of suffix truncation. At least, they should be, because otherwise
an eventual suffix truncation feature is going to be incompatible with
the INCLUDE tuple format. *Not* doing this makes suffix truncation
harder. Suffix truncation is a classic technique, first described by
Bayer in 1977, and we are very probably going to add it someday.

* Once you can tell a truncated IndexTuple from a non-truncated one
with little or no context, you can add defensive assertions in various
places where they're helpful. Similarly, amcheck can use and expect
this as a cross-check against IndexRelationGetNumberOfKeyAttributes().
This will increase confidence in the design, both initially and over
time.

I must say that I am disappointed that nothing has happened here,
especially because this really wasn't very much additional work, and
has essentially no downside. I can see that it doesn't work that way
in the Postgres Pro fork [2], and diverging from that may
inconvenience Postgres Pro, but that's a downside of forking. I don't
think that the community should have to absorb that cost.

> +Notes to Operator Class Implementors
> +------------------------------------
>
> Besides I really appreciate it, it seems to be unrelated to the covering
> indexes.
> I'd like this to be extracted into separate patch and be committed
> separately.

Commit 3785f7ee, from last month, moved the original "Notes to
Operator Class Implementors" section to the SGML docs. It looks like
that README section was accidentally reintroduced during rebasing. The
new information ("Included attributes in B-tree indexes") should be
moved over to the new section of the user docs -- the section that
3785f7ee added.

[1] https://postgr.es/m/CAH2-Wzm9y59h2m6iZjM4fpdUP5r4bsRVzGbN2gTRCO1j4nZmtw@mail.gmail.com
[2] https://github.com/postgrespro/postgrespro/blob/PGPRO9_5/src/backend/access/common/indextuple.c#L451
-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Sat, Mar 24, 2018 at 5:21 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Mar 22, 2018 at 8:23 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> * There is minor formatting issue in this part of code.  Some spaces need
>> to be replaced with tabs.
>> +IndexTuple
>> +index_truncate_tuple(Relation idxrel, IndexTuple olditup)
>> +{
>> + TupleDesc   itupdesc =
>> CreateTupleDescCopyConstr(RelationGetDescr(idxrel));
>> + Datum       values[INDEX_MAX_KEYS];
>> + bool        isnull[INDEX_MAX_KEYS];
>> + IndexTuple newitup;

The last time I looked at this patch, in April 2017, I made the point
that we should add something like an "nattributes" argument to
index_truncate_tuple() [1], rather than always using
IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple().
I can see that that change hasn't been made since that time.
 
+1, putting "nattributes" to argument of index_truncate_tuple() would
make this function way more universal.

With that approach, we can avoid relying on catalog metadata to the
same degree, which was a specific concern that Tom had around that
time. It's easy to do something with t_tid's offset, which is unused
in internal page IndexTuples. We do very similar things in GIN, where
alternative use of an IndexTuple's t_tid supports all kinds of
enhancements, some of which were not originally anticipated. Alexander
surely knows more about this than I do, since he wrote that code.
 
Originally that code was written by Teodor, but I also put my hands there.
In GIN entry tree, item pointers are stored in a posting list which is located
after index tuple attributes.  So, both t_tid block number and offset are
used for GIN needs.

Having this index_truncate_tuple() "nattributes" argument, and storing
the number of attributes directly improves quite a lot of things:

* It makes diagnosing issues in the field quite a bit easier. Tools
like pg_filedump can do the right thing (Tom mentioned pg_filedump and
amcheck specifically). The nbtree IndexTuple format should not need to
be interpreted in a context-sensitive way, if we can avoid it.

* It lets you use index_truncate_tuple() for regular suffix truncation
in the future. These INCLUDE IndexTuples are really just a special
case of suffix truncation. At least, they should be, because otherwise
an eventual suffix truncation feature is going to be incompatible with
the INCLUDE tuple format. *Not* doing this makes suffix truncation
harder. Suffix truncation is a classic technique, first described by
Bayer in 1977, and we are very probably going to add it someday.

* Once you can tell a truncated IndexTuple from a non-truncated one
with little or no context, you can add defensive assertions in various
places where they're helpful. Similarly, amcheck can use and expect
this as a cross-check against IndexRelationGetNumberOfKeyAttributes().
This will increase confidence in the design, both initially and over
time.
 
That makes sense.  Let's store the number of tuple attributes to t_tid.
Assuming that our INDEX_MAX_KEYS is quite small number, we will have
higher bits of t_tid free for latter use.

I must say that I am disappointed that nothing has happened here,
especially because this really wasn't very much additional work, and
has essentially no downside. I can see that it doesn't work that way
in the Postgres Pro fork [2], and diverging from that may
inconvenience Postgres Pro, but that's a downside of forking. I don't
think that the community should have to absorb that cost.
 
Sure, community shouldn't take care about Postgres Pro fork.  If we find
that something is better to be done differently, than let us do it so.

> +Notes to Operator Class Implementors
> +------------------------------------
>
> Besides I really appreciate it, it seems to be unrelated to the covering
> indexes.
> I'd like this to be extracted into separate patch and be committed
> separately.

Commit 3785f7ee, from last month, moved the original "Notes to
Operator Class Implementors" section to the SGML docs. It looks like
that README section was accidentally reintroduced during rebasing. The
new information ("Included attributes in B-tree indexes") should be
moved over to the new section of the user docs -- the section that
3785f7ee added.

Thank you for noticing that.  I've overlooked that.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Sat, Mar 24, 2018 at 12:39 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> +1, putting "nattributes" to argument of index_truncate_tuple() would
> make this function way more universal.

Great.

> Originally that code was written by Teodor, but I also put my hands there.
> In GIN entry tree, item pointers are stored in a posting list which is
> located
> after index tuple attributes.  So, both t_tid block number and offset are
> used for GIN needs.

Well, you worked on the posting list compression stuff, at least.  :-)

> That makes sense.  Let's store the number of tuple attributes to t_tid.
> Assuming that our INDEX_MAX_KEYS is quite small number, we will have
> higher bits of t_tid free for latter use.

I was going to say that you could just treat the low bit in the t_tid
offset as representing "see catalog entry". My first idea was that
nothing would have to change about the existing format, since internal
page items already have only the low bit set within their offset.
However, I now see that that won't really work, because we don't
change the offset in high keys when they're copied from a real item
during a page split. Whatever we do, it has to work equally well for
all "separator keys" -- that is, it must work for both downlinks in
internal pages, and all high keys (including high keys at the leaf
level).

A good solution is to use the unused 13th t_bit. If hash can have a
INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK.
This avoids a BTREE_VERSION bump, and allows us to deal with the
highkey offset issue. Actually, it's even more flexible than that --
it can work with ordinary leaf tuples in the future, too. That is, we
can eventually implement prefix truncation and deduplication at the
leaf level using this representation, since there is nothing that
limits INDEX_ALT_TID_MASK IndexTuples to "separator keys".

The main difference between this approach to leaf prefix
truncation/compression/deduplication, and the GIN entry tree's posting
list representation would be that it wouldn't have to be
super-optimized for duplicates, at the expense of more common case for
regular nbtree indexes -- having few or no duplicates. A btree_gin
index on pgbench_accounts(aid) looks very similar to an equivalent
nbtree index if you just compare internal pages from each, but they
look quite different at the leaf level, where GIN has 24 byte
IndexTuples instead of 16 bytes IndexTuples. Of course, this is
because the leaf pages have posting lists that can never be simple
heap pointer TIDs.

A secondary goal of this INDEX_ALT_TID_MASK representation should be
that it won't even be necessary to know that an IndexTuple is
contained within a leaf page rather than an index page (again, unlike
GIN). I'm pretty confident that we can have a truly universal
IndexTuple representation for nbtree, while supporting all of these
standard optimizations.

Sorry for going off in a tangent, but I think it's somewhat necessary
to have a strategy here. Of course, we don't have to get everything
right now, but we should be looking in this direction whenever we talk
about on-disk nbtree changes.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Sun, Mar 25, 2018 at 1:47 AM, Peter Geoghegan <pg@bowt.ie> wrote:
I was going to say that you could just treat the low bit in the t_tid
offset as representing "see catalog entry". My first idea was that
nothing would have to change about the existing format, since internal
page items already have only the low bit set within their offset.
However, I now see that that won't really work, because we don't
change the offset in high keys when they're copied from a real item
during a page split. Whatever we do, it has to work equally well for
all "separator keys" -- that is, it must work for both downlinks in
internal pages, and all high keys (including high keys at the leaf
level).

OK.
 
A good solution is to use the unused 13th t_bit. If hash can have a
INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK.
This avoids a BTREE_VERSION bump, and allows us to deal with the
highkey offset issue. Actually, it's even more flexible than that --
it can work with ordinary leaf tuples in the future, too. That is, we
can eventually implement prefix truncation and deduplication at the
leaf level using this representation, since there is nothing that
limits INDEX_ALT_TID_MASK IndexTuples to "separator keys".

The main difference between this approach to leaf prefix
truncation/compression/deduplication, and the GIN entry tree's posting
list representation would be that it wouldn't have to be
super-optimized for duplicates, at the expense of more common case for
regular nbtree indexes -- having few or no duplicates. A btree_gin
index on pgbench_accounts(aid) looks very similar to an equivalent
nbtree index if you just compare internal pages from each, but they
look quite different at the leaf level, where GIN has 24 byte
IndexTuples instead of 16 bytes IndexTuples. Of course, this is
because the leaf pages have posting lists that can never be simple
heap pointer TIDs.
 
Right, btree_gin is much smaller than regular btree when there are a lot
of duplicates.  When there is no duplicates then btree_gin becomes larger
than regular btree, because gin stores single item pointer less compact
than btree.

A secondary goal of this INDEX_ALT_TID_MASK representation should be
that it won't even be necessary to know that an IndexTuple is
contained within a leaf page rather than an index page (again, unlike
GIN). I'm pretty confident that we can have a truly universal
IndexTuple representation for nbtree, while supporting all of these
standard optimizations.

Sorry for going off in a tangent, but I think it's somewhat necessary
to have a strategy here. Of course, we don't have to get everything
right now, but we should be looking in this direction whenever we talk
about on-disk nbtree changes.

So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag
which would indicate that we're storing something special in the t_tid
offset.  And that should help us not only for covering indexes, but also for
further btree enhancements including suffix truncation.  What exactly do
you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag
is set?  Is it number of attributes in this particular index tuple?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: Re: WIP: Covering + unique indexes.

From
David Steele
Date:
On 3/26/18 6:10 AM, Alexander Korotkov wrote:
> 
> So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag
> which would indicate that we're storing something special in the t_tid
> offset.  And that should help us not only for covering indexes, but also for
> further btree enhancements including suffix truncation.  What exactly do
> you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag
> is set?  Is it number of attributes in this particular index tuple?

It appears that discussion and review of this patch is ongoing so it
should not be marked Ready for Committer.  I have changed it to Waiting
on Author since there are several pending reviews and at least one bug.

Regards,
-- 
-David
david@pgmasters.net


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Mon, Mar 26, 2018 at 3:10 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag
> which would indicate that we're storing something special in the t_tid
> offset.  And that should help us not only for covering indexes, but also for
> further btree enhancements including suffix truncation.  What exactly do
> you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag
> is set?  Is it number of attributes in this particular index tuple?

Yes. I think that once INDEX_ALT_TID_MASK is available, we should
store the number of attributes in that particular "separator key"
tuple (which has undergone suffix truncation), and always work off of
that. You could then have status bits in offset as follows:

* 1 bit that represents that this is a "separator key" IndexTuple
(high key or internal IndexTuple). Otherwise, it's a leaf IndexTuple
with an ordinary heap TID. (When INDEX_ALT_TID_MASK isn't set, it's
the same as today.)

* 3 reserved bits. I think that one of these bits can eventually be
used to indicate that the internal IndexTuple actually has a
"normalized key" representation [1], which seems like the best way to
do suffix truncation, long term. I think that we should support simple
suffix truncation, of the kind that this patch implements, alongside
normalized key suffix truncation. We need both for various reasons
[2].

Not sure what the other two flag bits might be used for, but they seem
worth having.

* 12 bits for the number of attributes, which should be more than
enough, even when INDEX_MAX_KEYS is significantly higher than 32. A
static assertion can keep this safe when INDEX_MAX_KEYS is set
ridiculously high.

I think that this scheme is future-proof. Maybe you have additional
ideas on the representation. Please let me know what you think.

When we eventually add optimizations that affect IndexTuples on the
leaf level, we can start using the block number (bi_hi + bi_lo)
itself, much like GIN posting lists. No need to further consider that
(the leaf level optimizations) today, because using block number
provides us with many more bits.

In internal page items, the block number is always a block number, so
internal IndexTuples are rather like GIN posting tree pointers in the
main entry tree (its leaf level) -- a conventional item pointer block
number is used, alongside unconventional use of the offset field,
where there are 16 bits available because no real offset is required.

[1] https://wiki.postgresql.org/wiki/Key_normalization#Optimizations_enabled_by_key_normalization
[2] https://wiki.postgresql.org/wiki/Key_normalization#How_big_can_normalized_keys_get.2C_and_is_it_worth_it.3F
-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> The last time I looked at this patch, in April 2017, I made the point
> that we should add something like an "nattributes" argument to
> index_truncate_tuple() [1], rather than always using
> IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple().
Agree, it looks logical because a) reading code will be simpler b) function will 
be use for any future usage.

> Having this index_truncate_tuple() "nattributes" argument, and storing
> the number of attributes directly improves quite a lot of things:

Storing number of attributes in now unused t_tid seems to me not so good idea. 
a) it could (and suppose, should) separate patch, at least it's not directly 
connected to covering patch, it could be added even before covering patch.
b) I don't like an idea to limiting usage of that field if we can do not that. 
Future usage could use it, for example, for different compression technics or 
something else.

> 
> * It makes diagnosing issues in the field quite a bit easier. Tools
> like pg_filedump can do the right thing (Tom mentioned pg_filedump and
> amcheck specifically). The nbtree IndexTuple format should not need to
> be interpreted in a context-sensitive way, if we can avoid it.
Both pg_filedump and amcheck could correclty parse any tuple based on BTP_LEAF 
flags and length of tuple.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> b) I don't like an idea to limiting usage of that field if we can do not that. 
> Future usage could use it, for example, for different compression technics or 
> something else.Or even removing t_tid from inner tuples to save 2 bytes in IndexTupleData. Of 
course, I remember about aligment, but it could be subject to change too in future.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Mar 27, 2018 at 10:07 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Storing number of attributes in now unused t_tid seems to me not so good
> idea. a) it could (and suppose, should) separate patch, at least it's not
> directly connected to covering patch, it could be added even before covering
> patch.

I think that we should do that first. It's not very hard.

> b) I don't like an idea to limiting usage of that field if we can do not
> that. Future usage could use it, for example, for different compression
> technics or something else.

The extra status bits that this would leave within the offset field
can be used for that in the future.

>> * It makes diagnosing issues in the field quite a bit easier. Tools
>> like pg_filedump can do the right thing (Tom mentioned pg_filedump and
>> amcheck specifically). The nbtree IndexTuple format should not need to
>> be interpreted in a context-sensitive way, if we can avoid it.
>
> Both pg_filedump and amcheck could correclty parse any tuple based on
> BTP_LEAF flags and length of tuple.

amcheck doesn't just care about the length of the tuple. It would have
to rely on catalog metadata about this being an INCLUDE index.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Mar 27, 2018 at 10:14 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> b) I don't like an idea to limiting usage of that field if we can do not
>> that. Future usage could use it, for example, for different compression
>> technics or something else.Or even removing t_tid from inner tuples to save
>> 2 bytes in IndexTupleData. Of
>
> course, I remember about aligment, but it could be subject to change too in
> future.

This is contradictory. You seem to be arguing that we need to preserve
on-disk compatibility for an optimization that throws out on-disk
compatibility.

Saving a single byte per internal IndexTuple is not worth it. We could
actually save 2 bytes in *all* nbtree pages, by halving the size of
ItemId for nbtree -- we don't need lp_len, which is redundant, and we
could reclaim one of the status bits too, to get back a full 16 bits.
Also, we could use suffix truncation to save at least one byte in
almost all cases, even with the thinnest possible
single-integer-attribute IndexTuples. What you describe just isn't
going to happen.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
Hi!

> 21 марта 2018 г., в 21:51, Alexander Korotkov <a.korotkov@postgrespro.ru> написал(а):
>
>
> I took a look at this patchset.  I have some notes about it.
>
> * I see patch changes dblink, amcheck and tcl contribs.  It would be nice to add corresponding
> check to dblink and amcheck regression tests.  It would be good to do the same with tcn contrib.
> But tcn doesn't have regression tests at all.  And it's out of scope of this patch to add regression
> tests to tcn.  So, it's OK to just check that it's working correctly with covering indexes (I hope it's
> already done by other reviewers).
>
I propose attached tests to amcheck and dblink. Not very extensive tests though, but enough to keep things working.
> * I think that subscription regression tests in src/test/subscription should have some use
> of covering indexes.  Logical decoding and subscription are heavily using primary keys.
> So they need to be tested to work correctly with covering indexes.
I've attached subscription tests. Unfortunately, they crash publisher with
2018-03-28 15:09:05.953 +05 [81805] 001_rep_changes.pl LOG:  statement: DELETE FROM tab_cov WHERE a > 20
2018-03-28 15:09:05.954 +05 [81691] LOG:  server process (PID 81805) was terminated by signal 11: Segmentation fault
Any of this commands lead to this
$node_publisher->safe_psql('postgres', "DELETE FROM tab_cov WHERE a > 20");
$node_publisher->safe_psql('postgres', "UPDATE tab_cov SET a = -a");


I didn't succeed in debugging. Maybe Anastasia can comment on is it bug or is it something wrong with tests?
>
> * I also think some isolation tests in src/test/isolation need to check covering indexes too.
> In particular insert-conflict-*.spec and lock-*.spec and probably more.
Currently, I couldn't compose good test scenarios, but I will think a bit about it more.

Best regards, Andrey Borodin.



Attachment

Re: WIP: Covering + unique indexes.

From
Anastasia Lubennikova
Date:
Here is the new version of the patch set.
All patches are rebased to apply without conflicts.

Besides, they contain following fixes:
- pg_dump bug is fixed
- index_truncate_tuple() now has 3rd argument new_indnatts.
- new tests for amcheck, dblink and subscription/t/001_rep_changes.pl
- info about opclass implementors and included columns is now in sgml doc

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Attachment

Re: WIP: Covering + unique indexes.

From
Erik Rijkers
Date:
On 2018-03-28 16:59, Anastasia Lubennikova wrote:
> Here is the new version of the patch set.

I can't get these to apply:

patch -b -l -F 25 -p 1 < 
/home/aardvark/download/pgpatches/0110/covering_indexes/20180328/0001-Covering-core-v8.patch


1 out of 19 hunks FAILED -- saving rejects to file 
src/backend/utils/cache/relcache.c.rej


$ cat src/backend/utils/cache/relcache.c.rej
--- src/backend/utils/cache/relcache.c
+++ src/backend/utils/cache/relcache.c
@@ -542,7 +542,7 @@
                 attp = (Form_pg_attribute) 
GETSTRUCT(pg_attribute_tuple);

                 if (attp->attnum <= 0 ||
-                       attp->attnum > relation->rd_rel->relnatts)
+                       attp->attnum > 
RelationGetNumberOfAttributes(relation))
                         elog(ERROR, "invalid attribute number %d for 
%s",
                                  attp->attnum, 
RelationGetRelationName(relation));





Erik Rijkers



Re: WIP: Covering + unique indexes.

From
Peter Eisentraut
Date:
On 1/25/18 23:19, Thomas Munro wrote:
> +  PRIMARY KEY ( <replaceable
> class="parameter">column_name</replaceable> [, ... ] ) <replaceable
> class="parameter">index_parameters</replaceable> <optional>INCLUDE
> (<replaceable class="parameter">column_name</replaceable> [,
> ...])</optional> |
> 
> I hadn't seen that use of "<optional>" before.  Almost everywhere else
> we use explicit [ and ] characters, but I see that there are other
> examples, and it is rendered as [ and ] in the output.

I think this will probably not come out right in the generated psql
help.  Check that please.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Wed, Mar 28, 2018 at 7:59 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Here is the new version of the patch set.
> All patches are rebased to apply without conflicts.
>
> Besides, they contain following fixes:
> - pg_dump bug is fixed
> - index_truncate_tuple() now has 3rd argument new_indnatts.
> - new tests for amcheck, dblink and subscription/t/001_rep_changes.pl
> - info about opclass implementors and included columns is now in sgml doc

This only changes the arguments given to index_truncate_tuple(), which
is a superficial change. It does not actually change anything about
the on-disk representation, which is what I sought. Why is that a
problem? I don't think it's very complicated.

The patch needs a rebase, as Erik mentioned:

1 out of 19 hunks FAILED -- saving rejects to file
src/backend/utils/cache/relcache.c.rej
(Stripping trailing CRs from patch; use --binary to disable.)

I also noticed that you still haven't done anything differently with
this code in _bt_checksplitloc(), which I mentioned in April of last
year:

    /* Account for all the old tuples */
    leftfree = state->leftspace - olddataitemstoleft;
    rightfree = state->rightspace -
        (state->olddataitemstotal - olddataitemstoleft);

    /*
     * The first item on the right page becomes the high key of the left page;
     * therefore it counts against left space as well as right space.
     */
    leftfree -= firstrightitemsz;

    /* account for the new item */
    if (newitemonleft)
        leftfree -= (int) state->newitemsz;
    else
        rightfree -= (int) state->newitemsz;

With an extreme enough case, this could result in a failure to find a
split point. Or at least, if that isn't true then it's not clear why,
and I think it needs to be explained.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
Hi!

I've revised a patchset.  It has improved comments and documentation.

I also updated some tests:
 * I've fixed checks on adding primary keys with included
columns in index_including.sql.  Previously all tries to
add primary keys were failed, I made some of them pass.
 * pg_index_def() is now covered by regression tests.
 * I made some use of covering indexes in isolation tests,
because covering indexes made some changes to row-level
locks.  Instead of adding extra tests which could significantly
increase isolation check runtime, I just replaced some of
indexes with covering indexes in existing tests.
 
Also this patchset have fix for logical subscription from Anastasia.

On Fri, Mar 30, 2018 at 2:33 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Mar 28, 2018 at 7:59 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Here is the new version of the patch set.
> All patches are rebased to apply without conflicts.
>
> Besides, they contain following fixes:
> - pg_dump bug is fixed
> - index_truncate_tuple() now has 3rd argument new_indnatts.
> - new tests for amcheck, dblink and subscription/t/001_rep_changes.pl
> - info about opclass implementors and included columns is now in sgml doc

This only changes the arguments given to index_truncate_tuple(), which
is a superficial change. It does not actually change anything about
the on-disk representation, which is what I sought. Why is that a
problem? I don't think it's very complicated.

I'll try it.  But I'm afraid that it's not as easy as you expect.
 
The patch needs a rebase, as Erik mentioned:

1 out of 19 hunks FAILED -- saving rejects to file
src/backend/utils/cache/relcache.c.rej
(Stripping trailing CRs from patch; use --binary to disable.)

I also noticed that you still haven't done anything differently with
this code in _bt_checksplitloc(), which I mentioned in April of last
year:

    /* Account for all the old tuples */
    leftfree = state->leftspace - olddataitemstoleft;
    rightfree = state->rightspace -
        (state->olddataitemstotal - olddataitemstoleft);

    /*
     * The first item on the right page becomes the high key of the left page;
     * therefore it counts against left space as well as right space.
     */
    leftfree -= firstrightitemsz;

    /* account for the new item */
    if (newitemonleft)
        leftfree -= (int) state->newitemsz;
    else
        rightfree -= (int) state->newitemsz;

With an extreme enough case, this could result in a failure to find a
split point. Or at least, if that isn't true then it's not clear why,
and I think it needs to be explained.

I don't think this could result in a failure to find a split point.
So, it finds a split point without taking into account that hikey
will be shorter.  If such split point exists then split point with
truncated hikey should also exists.  If not, then it would be
failure even without covering indexes.  I've updated comment
accordingly.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Fri, Mar 30, 2018 at 4:24 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Fri, Mar 30, 2018 at 2:33 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Mar 28, 2018 at 7:59 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Here is the new version of the patch set.
> All patches are rebased to apply without conflicts.
>
> Besides, they contain following fixes:
> - pg_dump bug is fixed
> - index_truncate_tuple() now has 3rd argument new_indnatts.
> - new tests for amcheck, dblink and subscription/t/001_rep_changes.pl
> - info about opclass implementors and included columns is now in sgml doc

This only changes the arguments given to index_truncate_tuple(), which
is a superficial change. It does not actually change anything about
the on-disk representation, which is what I sought. Why is that a
problem? I don't think it's very complicated.

I'll try it.  But I'm afraid that it's not as easy as you expect.

So, I have some implementation of storage of number of attributes inside
index tuple itself.  I made it as additional patch on top of previous patchset.
I attach the whole patchset in order to make commitfest.cputube.org happy.

I decided not to use 13th bit of IndexTuple flags.  Instead I use only high bit
of offset which is also always free on regular tuples.  In fact, we already use
assumption that there is at most 11 significant bits of index tuple offset in
GIN (see ginpostinglist.c).  

Anastasia also pointed that if we're going to do on-disk changes, they
should be compatible not only with suffix truncation, but also with duplicate
compression (which was already posted in thread [1]).  However, I think
there is no problem.  We can use one of 3 free bits in offset as flag that
it's tuple with posting list.  Duplicates compression needs to store
number of posting list items and their offset in the tuple.  Free bits
left in item pointer after reserving 2 bits (1 flag of alternative meaning
of offset and 1 flag of posting list) is far enough for that.

However, I find following arguments against implementing this feature
in covering indexes.

 * We write number of attributes present into tuple.  But how to prove that
it's correct.  I add appropriate checks to amcheck.  But I don't think all the
users runs amcheck frequent enough.  Thus, in order to be sure that it's
correct we should check number of attributes is written correct everywhere
in the B-tree code.  Without that we can face the situation that we've
introduced new on-disk representation better to further B-tree enhancements,
but actually it's broken.  And that would be much worse than nothing.
In order to check number of attributes everywhere in the B-tree code, we
need to actually implement significant part of suffix compression.  And I
really think we shouldn't do this as part as covering indexes patch.
 * Offset number is used now for parent refind (see BTEntrySame() macro).
In the attached patch, this condition is relaxed.  But I don't think I really like
that.  This shoud be thought out very carefully...
 * Now, hikeys are copied together with original t_tid's.  That makes it possible
to find the origin of this hikey.  If we override offset in t_tid, that becomes not
always possible.
* When index tuple is truncated, then pageinspect probably shouldn't show
offset for it, because it meaningless.  Should it rather show number of
attributes in separate column?  Anyway that should be part of suffix truncation
patch.  Not part of covering indexes patch, especially added at the last moment.
* I don't really see how does covering indexes without storing number of
index tuple attributes in the tuple itself blocks future work on suffix truncation.
The code we have after covering indexes doesn't expect more than nkeyatts
number of attributes in pivot tuples.  So, suffix truncation will make them
(sometimes) even shorter.  And that smaller number of attributes may be
stored in the tuple itself.  But default pivot tuple would be still assumed to have
nkeyatts.  I see no problem there.

So, taking into account the arguments of above, I propose to give up with
idea to stick covering indexes and suffix truncation features together.
That wouldn't accelerate appearance one feature after another, but rather
likely would RIP both of them...


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Fri, Mar 30, 2018 at 4:08 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> I'll try it.  But I'm afraid that it's not as easy as you expect.
>
>
> So, I have some implementation of storage of number of attributes inside
> index tuple itself.  I made it as additional patch on top of previous
> patchset.
> I attach the whole patchset in order to make commitfest.cputube.org happy.

Looks like 0004-* became mangled. Can you send a version that is not
mangled, please?

> I decided not to use 13th bit of IndexTuple flags.  Instead I use only high
> bit
> of offset which is also always free on regular tuples.  In fact, we already
> use
> assumption that there is at most 11 significant bits of index tuple offset
> in
> GIN (see ginpostinglist.c).

So? GIN doesn't have the same legacy at all. The GIN posting lists
*don't* have regular heap TID pointers at all. They started out
without them, and still don't have them.

> Anastasia also pointed that if we're going to do on-disk changes, they
> should be compatible not only with suffix truncation, but also with
> duplicate
> compression (which was already posted in thread [1]).

I definitely agree with that, and I think that Anastasia should push
for whatever will make future nbtree enhancements easier, especially
her own pending or planned enhancements.

> However, I think
> there is no problem.  We can use one of 3 free bits in offset as flag that
> it's tuple with posting list.  Duplicates compression needs to store
> number of posting list items and their offset in the tuple.  Free bits
> left in item pointer after reserving 2 bits (1 flag of alternative meaning
> of offset and 1 flag of posting list) is far enough for that.

The issue that I see is that we could easily make this unambiguous,
free of any context, with a tiny bit more work. Why not just do it
that way?

Maybe it won't actually matter, but I see no reason not to do it, since we can.

> However, I find following arguments against implementing this feature
> in covering indexes.
>
>  * We write number of attributes present into tuple.  But how to prove that
> it's correct.  I add appropriate checks to amcheck.  But I don't think all
> the
> users runs amcheck frequent enough.  Thus, in order to be sure that it's
> correct we should check number of attributes is written correct everywhere
> in the B-tree code.

Use an assertion. Problem solved.

I agree that people aren't using amcheck all that much, but give it
time. Oracle and SQL Server have had tools like amcheck for 30+ years.
We have had amcheck for one year.

> Without that we can face the situation that we've
> introduced new on-disk representation better to further B-tree enhancements,
> but actually it's broken.  And that would be much worse than nothing.
> In order to check number of attributes everywhere in the B-tree code, we
> need to actually implement significant part of suffix compression.  And I
> really think we shouldn't do this as part as covering indexes patch.

I don't think that you need to do that, actually. I'm not asking you
to go to those lengths. I have only asked that you make the on-disk
representation *compatible* with a future Postgres version that has
full suffix truncation (and other such enhancements, too). I care
about the on-disk representation more than the code.

>  * Offset number is used now for parent refind (see BTEntrySame() macro).
> In the attached patch, this condition is relaxed.  But I don't think I
> really like
> that.  This shoud be thought out very carefully...

It's safe, although I admit that that's a bit hard to see.
Specifically, look at this code in _bt_insert_parent():

        /*
         * Find the parent buffer and get the parent page.
         *
         * Oops - if we were moved right then we need to change stack item! We
         * want to find parent pointing to where we are, right ?    - vadim
         * 05/27/97
         */
        ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
        pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);

Vadim doesn't seem too sure of why he did it that way. What's clear is
that the offset on all internal pages is always P_HIKEY (that is, 1),
because this is the one and only place where new IndexTuples get
generated for internal pages. That's unambiguous. So how could
BTEntrySame() truly need to care about offset? How could there ever be
an internal page offset that wasn't just P_HIKEY? You can look
yourself, using pg_hexedit or pageinspect.

The comments above BTTidSame()/BTEntrySame() are actually wrong,
including "New Comments". Vadim wanted to make TIDs part of the
keyspace [1], beginning in around 1997. The idea was that we'd have
truly unique keys by including TID, as L&Y intended, but that never
happened. Instead, we got commit 9e85183bf in 2000, which among many
other things changed the L&Y invariant to deal with duplicates. I
think that Tom should have changed BTTidSame() to not care about
offset number in that same commit from 2000.

I actually think that Vadim was correct to want to make heap TID a
unique-ifier, and that that's the best long term solution [2].
Unfortunately, the code that he committed in the late 1990s didn't
really help -- how could it help without including the *entire* heap
TID? This BTTidSame() offset thing seems to be related to some weird
logic for duplicates that Tom killed in 9e85183bf, if it ever made
sense. Note that _bt_getstackbuf(), the only code that uses
BTEntrySame(), does not look at the offset directly -- because it's
always P_HIKEY.

Anyway...

>  * Now, hikeys are copied together with original t_tid's.  That makes it
> possible
> to find the origin of this hikey.  If we override offset in t_tid, that
> becomes not
> always possible.

....that just leaves the original high key at the leaf level, as you
say here. You're right that there is theoretically a loss of forensic
information from actually storing something in the offset at the leaf
level, and storing something interesting in the offset during the
first phase of a page split (not the second, where the aforementioned
_bt_insert_parent() function gets called). I don't think it's worth
worrying about, though.

The fact is that that information can go out of date almost
immediately, whereas high keys usually last forever. The only reason
that there is a heap TID in the high key is because we'd have to add
special code to remove it; not because it has any real value. I find
it very hard to imagine it being used in a forensic situation. If you
actually wanted to do this, the key itself is probably enough -- you
probably wouldn't need the TID.

> * When index tuple is truncated, then pageinspect probably shouldn't show
> offset for it, because it meaningless.  Should it rather show number of
> attributes in separate column?  Anyway that should be part of suffix
> truncation
> patch.  Not part of covering indexes patch, especially added at the last
> moment.

Nobody asked you to write a suffix truncation patch. That has
complexity above and beyond what the covering index patch needs. I
just expect it to be compatible with an eventual suffix truncation
patch, which you've now shown is quite possible. It is clearly a
complimentary technique.

> * I don't really see how does covering indexes without storing number of
> index tuple attributes in the tuple itself blocks future work on suffix
> truncation.

It makes it harder. Your new version gives amcheck a way of
determining the expected number of attributes. That's the main reason
to have it, more so than the suffix truncation issue. Suffix
truncation matters a lot too, though.

> So, taking into account the arguments of above, I propose to give up with
> idea to stick covering indexes and suffix truncation features together.
> That wouldn't accelerate appearance one feature after another, but rather
> likely would RIP both of them...

I think that the thing that's more likely to kill this patch is the
fact that after the first year, it only ever got discussed in the
final CF. That's not something that happened because of my choices. I
made several offers of my time. I did not create this urgency.

[1] https://www.postgresql.org/message-id/18788.963953289@sss.pgh.pa.us
[2]
https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute
-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Fri, Mar 30, 2018 at 10:39 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> It's safe, although I admit that that's a bit hard to see.
> Specifically, look at this code in _bt_insert_parent():
>
>         /*
>          * Find the parent buffer and get the parent page.
>          *
>          * Oops - if we were moved right then we need to change stack item! We
>          * want to find parent pointing to where we are, right ?    - vadim
>          * 05/27/97
>          */
>         ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
>         pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
>
> Vadim doesn't seem too sure of why he did it that way. What's clear is
> that the offset on all internal pages is always P_HIKEY (that is, 1),
> because this is the one and only place where new IndexTuples get
> generated for internal pages. That's unambiguous. So how could
> BTEntrySame() truly need to care about offset? How could there ever be
> an internal page offset that wasn't just P_HIKEY? You can look
> yourself, using pg_hexedit or pageinspect.

Sorry, I meant this code, right before:

        /* form an index tuple that points at the new right page */
        new_item = CopyIndexTuple(ritem);
        ItemPointerSet(&(new_item->t_tid), rbknum, P_HIKEY);

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Fri, Mar 30, 2018 at 6:24 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> With an extreme enough case, this could result in a failure to find a
>> split point. Or at least, if that isn't true then it's not clear why,
>> and I think it needs to be explained.
>
>
> I don't think this could result in a failure to find a split point.
> So, it finds a split point without taking into account that hikey
> will be shorter.  If such split point exists then split point with
> truncated hikey should also exists.  If not, then it would be
> failure even without covering indexes.  I've updated comment
> accordingly.

You're right. We're going to truncate the unneeded trailing attributes
from whatever tuple is to the immediate right of the final split point
that we choose (that's the tuple that we'll copy to make a new high
key for the left page). Truncation already has to result in a tuple
that is less than or equal to the original tuple.

I also agree that it isn't worth trying harder to make sure that space
is distributed evenly when truncation will go ahead. It will only
matter in very rare cases, but the computational overhead of having an
accurate high key size for every candidate split point would be
significant.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
Hi, Peter!

On Sat, Mar 31, 2018 at 8:39 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Mar 30, 2018 at 4:08 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> I'll try it.  But I'm afraid that it's not as easy as you expect.
>
>
> So, I have some implementation of storage of number of attributes inside
> index tuple itself.  I made it as additional patch on top of previous
> patchset.
> I attach the whole patchset in order to make commitfest.cputube.org happy.

Looks like 0004-* became mangled. Can you send a version that is not
mangled, please?
 
Oh, sorry for that.  I forgot to remove some .orig and .rej files and they
accidentally appear in patch.  Correct verion is attached.

> I decided not to use 13th bit of IndexTuple flags.  Instead I use only high
> bit
> of offset which is also always free on regular tuples.  In fact, we already
> use
> assumption that there is at most 11 significant bits of index tuple offset
> in
> GIN (see ginpostinglist.c).

So? GIN doesn't have the same legacy at all. The GIN posting lists
*don't* have regular heap TID pointers at all. They started out
without them, and still don't have them.

Yes, GIN never stored heap TID pointers in t_tid of index tuple.  But GIN
assumes that heap TID pointer has at most 11 significant bits during
posting list encoding.

> However, I find following arguments against implementing this feature
> in covering indexes.
>
>  * We write number of attributes present into tuple.  But how to prove that
> it's correct.  I add appropriate checks to amcheck.  But I don't think all
> the
> users runs amcheck frequent enough.  Thus, in order to be sure that it's
> correct we should check number of attributes is written correct everywhere
> in the B-tree code.

Use an assertion. Problem solved.
 
I don't think we should use assertions, because they are typically disabled on
production PostgreSQL builds.  But we can have some explicit check in some
common path.  In the attached patch I've such check to _bt_compare().  Probably,
together with amcheck, that would be sufficient.

>  * Offset number is used now for parent refind (see BTEntrySame() macro).
> In the attached patch, this condition is relaxed.  But I don't think I
> really like
> that.  This shoud be thought out very carefully...

It's safe, although I admit that that's a bit hard to see.
Specifically, look at this code in _bt_insert_parent():

        /*
         * Find the parent buffer and get the parent page.
         *
         * Oops - if we were moved right then we need to change stack item! We
         * want to find parent pointing to where we are, right ?    - vadim
         * 05/27/97
         */
        ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
        pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);

Vadim doesn't seem too sure of why he did it that way. What's clear is
that the offset on all internal pages is always P_HIKEY (that is, 1),
because this is the one and only place where new IndexTuples get
generated for internal pages. That's unambiguous. So how could
BTEntrySame() truly need to care about offset? How could there ever be
an internal page offset that wasn't just P_HIKEY? You can look
yourself, using pg_hexedit or pageinspect.

The comments above BTTidSame()/BTEntrySame() are actually wrong,
including "New Comments". Vadim wanted to make TIDs part of the
keyspace [1], beginning in around 1997. The idea was that we'd have
truly unique keys by including TID, as L&Y intended, but that never
happened. Instead, we got commit 9e85183bf in 2000, which among many
other things changed the L&Y invariant to deal with duplicates. I
think that Tom should have changed BTTidSame() to not care about
offset number in that same commit from 2000.

I actually think that Vadim was correct to want to make heap TID a
unique-ifier, and that that's the best long term solution [2].
Unfortunately, the code that he committed in the late 1990s didn't
really help -- how could it help without including the *entire* heap
TID? This BTTidSame() offset thing seems to be related to some weird
logic for duplicates that Tom killed in 9e85183bf, if it ever made
sense. Note that _bt_getstackbuf(), the only code that uses
BTEntrySame(), does not look at the offset directly -- because it's
always P_HIKEY.

Anyway...

OK, thank for the explanation.  I agree that check of offset is redundant here.
 
>  * Now, hikeys are copied together with original t_tid's.  That makes it
> possible
> to find the origin of this hikey.  If we override offset in t_tid, that
> becomes not
> always possible.

....that just leaves the original high key at the leaf level, as you
say here. You're right that there is theoretically a loss of forensic
information from actually storing something in the offset at the leaf
level, and storing something interesting in the offset during the
first phase of a page split (not the second, where the aforementioned
_bt_insert_parent() function gets called). I don't think it's worth
worrying about, though.

The fact is that that information can go out of date almost
immediately, whereas high keys usually last forever. The only reason
that there is a heap TID in the high key is because we'd have to add
special code to remove it; not because it has any real value. I find
it very hard to imagine it being used in a forensic situation. If you
actually wanted to do this, the key itself is probably enough -- you
probably wouldn't need the TID.

I don't know,  When I wrote my own implementation of B-tree and debug
it, I found saving hikeys "as is" to be very valuable for debugging.
However, B-trees in PostgreSQL are quite mature, and probably
don't need so much debug information.
 
> * When index tuple is truncated, then pageinspect probably shouldn't show
> offset for it, because it meaningless.  Should it rather show number of
> attributes in separate column?  Anyway that should be part of suffix
> truncation
> patch.  Not part of covering indexes patch, especially added at the last
> moment.

Nobody asked you to write a suffix truncation patch. That has
complexity above and beyond what the covering index patch needs. I
just expect it to be compatible with an eventual suffix truncation
patch, which you've now shown is quite possible. It is clearly a
complimentary technique.
 
OK, but change of on-disk tuple format also changes what people
see in pageinspect.  Right now, they see "1" as offset for tuples in intenal
page and hikeys.  After patch, they would see some large values
(assuming we set some of hi bits) in offset.  I'm not sure it's OK.
We probably should change display of index tuples in pageinspect.

> * I don't really see how does covering indexes without storing number of
> index tuple attributes in the tuple itself blocks future work on suffix
> truncation.

It makes it harder. Your new version gives amcheck a way of
determining the expected number of attributes. That's the main reason
to have it, more so than the suffix truncation issue.
 
I'm sorry, I do not understand.  New version of amcheck determines
the expected number of attributes and compares that to the numer of
attributes stored in the offset number.  But I can get *expected* number of
attributes even wihtout storing them also in the offset number...

Suffix truncation matters a lot too, though.
 
Sure, that's great feature.

> So, taking into account the arguments of above, I propose to give up with
> idea to stick covering indexes and suffix truncation features together.
> That wouldn't accelerate appearance one feature after another, but rather
> likely would RIP both of them...

I think that the thing that's more likely to kill this patch is the
fact that after the first year, it only ever got discussed in the
final CF. That's not something that happened because of my choices. I
made several offers of my time. I did not create this urgency.

I'm sorry my comment was only about particular feature which I'm not
biggest fan of (that doesn't mean I'm strictly against of it).

I'd like to note that I really appreciate your attention to this patch
as well as other patches.  

Anyway, let me know whay do you think about
0004-Covering-natts-v10.patch attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> So? GIN doesn't have the same legacy at all. The GIN posting lists
>> *don't* have regular heap TID pointers at all. They started out
>> without them, and still don't have them.
>
>
> Yes, GIN never stored heap TID pointers in t_tid of index tuple.  But GIN
> assumes that heap TID pointer has at most 11 significant bits during
> posting list encoding.

I think that we should avoid assuming things, unless the cost of
representing them is too high, which I don't think applies here. The
more defensive general purpose code can be, the better.

I will admit to being paranoid here. But experience suggests that
paranoia is a good thing, if it isn't too expensive. Look at the
thread on XFS + fsync() for an example of things being wrong for a
very long time without anyone realizing, and despite the best efforts
of many smart people. As far as anyone can tell, PostgreSQL on Linux +
XFS is kinda, sorta broken, and has been forever. XFS was mature
before ext4 was, and is a popular choice, and yet this is the first
we're hearing about it being kind of broken. After many years.

Look at this check that made it into my amcheck patch, that was
committed yesterday:


https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/amcheck/verify_nbtree.c;h=a15fe21933b9a5b8baefedaa8f38e517d6c91877;hb=7f563c09f8901f6acd72cb8fba7b1bd3cf3aca8e#l745

As it says, nbtree is surprisingly tolerant of corrupt lp_len fields.
You may find it an interesting exercise to use pg_hexedit to corrupt
many lp_len fields in an index page. What's really interesting about
this is that it doesn't appear to break anything at all! We don't get
the length from there in most cases, so reads won't break at all. I
see that we use ItemIdGetLength() in a couple of rare cases (though
even those could be avoided) during a page split. You'd be lucky to
notice a problem if lp_len fields were regularly corrupt. When you
notice, it will probably have already caused big problems.

On a similar note, I've noticed that many of my experimental B-Tree
patches (that I never find time to finish) tend to almost work quite
early on, sometimes without my really understanding why. The whole L&Y
approach of recovering from problems that were detected (detecting
concurrent page splits, and moving right) makes the code *very*
forgiving. I hope that I don't sound trite, but everyone should try to
be modest about what they *don't* know when writing complex system
software with concurrency. It is not a platitude, even though it
probably seems that way. A tiny mistake can have big consequences, so
it's very important that we have a way to easily detect them after the
fact.

> I don't think we should use assertions, because they are typically disabled
> on
> production PostgreSQL builds.  But we can have some explicit check in some
> common path.  In the attached patch I've such check to _bt_compare().
> Probably,
> together with amcheck, that would be sufficient.

Good idea -- a "can't happen" check in _bt_compare seems better, which
I see here:

> diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
> index 51dca64e13..fcf9832147 100644
> --- a/src/backend/access/nbtree/nbtsearch.c
> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
>     if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
>         return 1;
>
> +   /*
> +    * Check tuple has correct number of attributes.
> +    */
> +   if (!_bt_check_natts(rel, page, offnum))
> +   {
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INTERNAL_ERROR),
> +                errmsg("tuple has wrong number of attributes in index \"%s\"",
> +                       RelationGetRelationName(rel))));
> +   }
> +
>     itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));

It seems like it might be a good idea to make this accept an
IndexTuple, though, to possibly save some work. Also, perhaps this
should be an unlikely() condition, if only because it makes the intent
clearer (might actually matter in a tight loop like this too, though).

Do you store an attribute number in the "minus infinity" item (the
leftmost one of internal pages)? I guess that that should be zero,
because it's totally truncated.

> OK, thank for the explanation.  I agree that check of offset is redundant
> here.

Cool.

>> The fact is that that information can go out of date almost
>> immediately, whereas high keys usually last forever. The only reason
>> that there is a heap TID in the high key is because we'd have to add
>> special code to remove it; not because it has any real value. I find
>> it very hard to imagine it being used in a forensic situation. If you
>> actually wanted to do this, the key itself is probably enough -- you
>> probably wouldn't need the TID.
>
>
> I don't know,  When I wrote my own implementation of B-tree and debug
> it, I found saving hikeys "as is" to be very valuable for debugging.

I would like to see your implementation at some point. That sounds interesting.

> However, B-trees in PostgreSQL are quite mature, and probably
> don't need so much debug information.

Today, the highkey at the leaf level is an exact copy of the right
sibling's first item immediately after the split. The absence of a
usable heap TID offset (due to using it for number of attributes in
high keys) is unlikely to make it harder to locate that right
sibling's first item (to get a full heap TID), which could have moved
a lot further right after the split, or even have been removed
entirely. It could now be ambiguous where it wouldn't have been before
in the event of duplicates, but it's unlikely. And when it does
happen, it's unlikely to matter.

We can still include the heap block number, I suppose. I think of the
highkey as only having one simple job -- separating the keyspace
between siblings. We actually have a very neat choke point to check
that it does that one job -- when a high key is generated for a page
split at the leaf level. If we were doing generic suffix truncation,
we'd add a test that made sure that the high key was strictly greater
than the last item on the left, and strictly less than the first item
on the right. As I said yesterday, I don't like how we allow a highkey
to be equal to both sides of the split, which goes against L&Y, and I
think that we would at least be strict about < and > for suffix
truncation.

The highkey's actual value can be invented, provided it does this one
simple job, which needs to be assessed only once at our "neat choke
point". Everything else falls into place afterwards, since that's
where teh downlink actually comes from. You can check it during a leaf
page split while debugging (that's the neat choke point). That's why
the high key doesn't seem very interesting from a debuggability
perspective.

>> Nobody asked you to write a suffix truncation patch. That has
>> complexity above and beyond what the covering index patch needs. I
>> just expect it to be compatible with an eventual suffix truncation
>> patch, which you've now shown is quite possible. It is clearly a
>> complimentary technique.
>
>
> OK, but change of on-disk tuple format also changes what people
> see in pageinspect.  Right now, they see "1" as offset for tuples in intenal
> page and hikeys.  After patch, they would see some large values
> (assuming we set some of hi bits) in offset.  I'm not sure it's OK.
> We probably should change display of index tuples in pageinspect.

This reminds me of a discussion I had with Robert Haas about
pageinspect + t_infomask bits. Robert thought that we should show the
composite bits as single constants, where we do that (with things like
HEAP_XMIN_FROZEN). I disagreed, saying I think that we should just
show "the bits that are on the page", while also documenting that this
situation exists in pageinspect directly.

I think something similar here. I think it's okay to just show offset,
provided it is documented. We have a number of odd things within
nbtree that I actually saw to it were documented, such as the "minus
infinity" item on internal pages, which looks odd and out of places. I
remember Tatsuo Ishii asked about it before this happened. It seems
helpful to show what's really there, and offer guidance on how to
interpret it. I actually thought carefully about many things like this
for pg_hexedit, which tries to be very consistent and logical, uses
color to suggest meaning, and so on.

Anyway, that's what I think about it, though I wouldn't really care if
I lost that particular argument and we did something special with
internal page offset in pageinspect. It seems like a matter of
opinion, or aesthetics.

> I'm sorry, I do not understand.  New version of amcheck determines
> the expected number of attributes and compares that to the numer of
> attributes stored in the offset number.  But I can get *expected* number of
> attributes even wihtout storing them also in the offset number...

Maybe I was confused.

> I'd like to note that I really appreciate your attention to this patch
> as well as other patches.

Thanks. I would like to thank Anastasia and you for your patience and
perseverance, despite what I see as mistakes in how this project was
manged. I really want for it to be possible for there to be more
patches in the nbtree code, because they're really needed. That was a
big part of my motivation for writing amcheck, in fact. It's tedious
to link this patch to a bigger picture about what we need to do with
nbtree in the next 5 years, but I think that that's what it will take
to get this patch in. That's my opinion.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Mon, Apr 2, 2018 at 1:18 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> So? GIN doesn't have the same legacy at all. The GIN posting lists
>> *don't* have regular heap TID pointers at all. They started out
>> without them, and still don't have them.
>
>
> Yes, GIN never stored heap TID pointers in t_tid of index tuple.  But GIN
> assumes that heap TID pointer has at most 11 significant bits during
> posting list encoding.

I think that we should avoid assuming things, unless the cost of
representing them is too high, which I don't think applies here. The
more defensive general purpose code can be, the better.

I thought abut that another time and I decided that it would be safer
to use 13th bit in index tuple flags.  There are already attempt to
use whole 6 bytes of tid for not heap pointer information [1].  Thus, it
would be safe to use 13th bit for indicating alternative offset meaning
in pivot tuples, because it wouldn't block further work.  Revised patchset
in the attachment implements it.

> I don't think we should use assertions, because they are typically disabled
> on
> production PostgreSQL builds.  But we can have some explicit check in some
> common path.  In the attached patch I've such check to _bt_compare().
> Probably,
> together with amcheck, that would be sufficient.

Good idea -- a "can't happen" check in _bt_compare seems better, which
I see here:

> diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
> index 51dca64e13..fcf9832147 100644
> --- a/src/backend/access/nbtree/nbtsearch.c
> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
>     if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
>         return 1;
>
> +   /*
> +    * Check tuple has correct number of attributes.
> +    */
> +   if (!_bt_check_natts(rel, page, offnum))
> +   {
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INTERNAL_ERROR),
> +                errmsg("tuple has wrong number of attributes in index \"%s\"",
> +                       RelationGetRelationName(rel))));
> +   }
> +
>     itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));

It seems like it might be a good idea to make this accept an
IndexTuple, though, to possibly save some work.

I don't know.  We still need an offset number to check expected number
of attributes.  Passing index tuple as separate attribute would be
redundant and open door for extra possible errors.
 
lso, perhaps this
should be an unlikely() condition, if only because it makes the intent
clearer (might actually matter in a tight loop like this too, though).
 
OK, marked that check as unlikely().

Do you store an attribute number in the "minus infinity" item (the
leftmost one of internal pages)? I guess that that should be zero,
because it's totally truncated.

Yes, I store zero number of attributes in "minus infinity" item.  See this
part of the patch.

@@ -2081,7 +2081,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
  left_item_sz = sizeof(IndexTupleData);
  left_item = (IndexTuple) palloc(left_item_sz);
  left_item->t_info = left_item_sz;
- ItemPointerSet(&(left_item->t_tid), lbkno, P_HIKEY);
+ ItemPointerSetBlockNumber(&(left_item->t_tid), lbkno);
+ BTreeTupSetNAtts(left_item, 0);

However, note that I've to store (number_of_attributes + 1) in the offset
in order to correctly store zero number of attributes.  Otherwise, assertion
is faised in ItemPointerIsValid() macro.

/*
 * ItemPointerIsValid
 * True iff the disk item pointer is not NULL.
 */
#define ItemPointerIsValid(pointer) \
((bool) (PointerIsValid(pointer) && ((pointer)->ip_posid != 0)))

>> The fact is that that information can go out of date almost
>> immediately, whereas high keys usually last forever. The only reason
>> that there is a heap TID in the high key is because we'd have to add
>> special code to remove it; not because it has any real value. I find
>> it very hard to imagine it being used in a forensic situation. If you
>> actually wanted to do this, the key itself is probably enough -- you
>> probably wouldn't need the TID.
>
>
> I don't know,  When I wrote my own implementation of B-tree and debug
> it, I found saving hikeys "as is" to be very valuable for debugging.

I would like to see your implementation at some point. That sounds interesting.

It was in-memory B-tree [2].  Will be published one day.
 
> However, B-trees in PostgreSQL are quite mature, and probably
> don't need so much debug information.

Today, the highkey at the leaf level is an exact copy of the right
sibling's first item immediately after the split. The absence of a
usable heap TID offset (due to using it for number of attributes in
high keys) is unlikely to make it harder to locate that right
sibling's first item (to get a full heap TID), which could have moved
a lot further right after the split, or even have been removed
entirely. It could now be ambiguous where it wouldn't have been before
in the event of duplicates, but it's unlikely. And when it does
happen, it's unlikely to matter.

We can still include the heap block number, I suppose. I think of the
highkey as only having one simple job -- separating the keyspace
between siblings. We actually have a very neat choke point to check
that it does that one job -- when a high key is generated for a page
split at the leaf level. If we were doing generic suffix truncation,
we'd add a test that made sure that the high key was strictly greater
than the last item on the left, and strictly less than the first item
on the right. As I said yesterday, I don't like how we allow a highkey
to be equal to both sides of the split, which goes against L&Y, and I
think that we would at least be strict about < and > for suffix
truncation.

The highkey's actual value can be invented, provided it does this one
simple job, which needs to be assessed only once at our "neat choke
point". Everything else falls into place afterwards, since that's
where teh downlink actually comes from. You can check it during a leaf
page split while debugging (that's the neat choke point). That's why
the high key doesn't seem very interesting from a debuggability
perspective.

OK.  So, I mentioned that storing hikeys "as is" might be useful
for debug in some cases.  That doesn't mean it's irreplaceable.  For sure,
there are other ways to debug and obtain debugging information which
could be even better in certain situation.

>> Nobody asked you to write a suffix truncation patch. That has
>> complexity above and beyond what the covering index patch needs. I
>> just expect it to be compatible with an eventual suffix truncation
>> patch, which you've now shown is quite possible. It is clearly a
>> complimentary technique.
>
>
> OK, but change of on-disk tuple format also changes what people
> see in pageinspect.  Right now, they see "1" as offset for tuples in intenal
> page and hikeys.  After patch, they would see some large values
> (assuming we set some of hi bits) in offset.  I'm not sure it's OK.
> We probably should change display of index tuples in pageinspect.

This reminds me of a discussion I had with Robert Haas about
pageinspect + t_infomask bits. Robert thought that we should show the
composite bits as single constants, where we do that (with things like
HEAP_XMIN_FROZEN). I disagreed, saying I think that we should just
show "the bits that are on the page", while also documenting that this
situation exists in pageinspect directly.

I think something similar here. I think it's okay to just show offset,
provided it is documented. We have a number of odd things within
nbtree that I actually saw to it were documented, such as the "minus
infinity" item on internal pages, which looks odd and out of places. I
remember Tatsuo Ishii asked about it before this happened. It seems
helpful to show what's really there, and offer guidance on how to
interpret it. I actually thought carefully about many things like this
for pg_hexedit, which tries to be very consistent and logical, uses
color to suggest meaning, and so on.

Anyway, that's what I think about it, though I wouldn't really care if
I lost that particular argument and we did something special with
internal page offset in pageinspect. It seems like a matter of
opinion, or aesthetics.
 
I just thought that users might be confused when "1" offset in internal
pages became "32769" or something.  However, with attached patchset
it would be as least some more obvious value which is easier to interpret
in decimal form.

> I'd like to note that I really appreciate your attention to this patch
> as well as other patches.

Thanks. I would like to thank Anastasia and you for your patience and
perseverance, despite what I see as mistakes in how this project was
manged. I really want for it to be possible for there to be more
patches in the nbtree code, because they're really needed. That was a
big part of my motivation for writing amcheck, in fact. It's tedious
to link this patch to a bigger picture about what we need to do with
nbtree in the next 5 years, but I think that that's what it will take
to get this patch in. That's my opinion.

Yes.  But that depends on how difficulty to adopt patch to big picture
correlate with difficulty, which non-adopted patch makes to that big
picture.  My point was that second difficulty isn't high.  But we can be
satisfied with implementation in the attached patchset (probably some
small enhancements are still required), then the first difficulty isn't high too.


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I thought abut that another time and I decided that it would be safer
> to use 13th bit in index tuple flags.  There are already attempt to
> use whole 6 bytes of tid for not heap pointer information [1].  Thus, it
> would be safe to use 13th bit for indicating alternative offset meaning
> in pivot tuples, because it wouldn't block further work.  Revised patchset
> in the attachment implements it.

This is definitely not the only time someone has talked about this
13th bit -- it's quite coveted. It also came up with UPSERT, and with
WARM. That's just the cases that I can personally remember.

I'm glad that you found a way to make this work, that will keep things
flexible for future patches, and make testing easier. I think that we
can find a flexible representation that makes almost everyone happy.

> I don't know.  We still need an offset number to check expected number
> of attributes.  Passing index tuple as separate attribute would be
> redundant and open door for extra possible errors.

You're right. I must have been tired when I wrote that. :-)

>> Do you store an attribute number in the "minus infinity" item (the
>> leftmost one of internal pages)? I guess that that should be zero,
>> because it's totally truncated.
>
>
> Yes, I store zero number of attributes in "minus infinity" item.  See this
> part of the patch.

> However, note that I've to store (number_of_attributes + 1) in the offset
> in order to correctly store zero number of attributes.  Otherwise, assertion
> is faised in ItemPointerIsValid() macro.

Makes sense.

> Yes.  But that depends on how difficulty to adopt patch to big picture
> correlate with difficulty, which non-adopted patch makes to that big
> picture.  My point was that second difficulty isn't high.  But we can be
> satisfied with implementation in the attached patchset (probably some
> small enhancements are still required), then the first difficulty isn't high
> too.

I think it's possible.

I didn't have time to look at this properly today, but I will try to
do so tomorrow.

Thanks
-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Tue, Apr 3, 2018 at 7:02 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I thought abut that another time and I decided that it would be safer
> to use 13th bit in index tuple flags.  There are already attempt to
> use whole 6 bytes of tid for not heap pointer information [1].  Thus, it
> would be safe to use 13th bit for indicating alternative offset meaning
> in pivot tuples, because it wouldn't block further work.  Revised patchset
> in the attachment implements it.

This is definitely not the only time someone has talked about this
13th bit -- it's quite coveted. It also came up with UPSERT, and with
WARM. That's just the cases that I can personally remember.

I'm glad that you found a way to make this work, that will keep things
flexible for future patches, and make testing easier. I think that we
can find a flexible representation that makes almost everyone happy.

OK, good. 

I didn't have time to look at this properly today, but I will try to
do so tomorrow.

Great, I'm looking forward your feedback.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Great, I'm looking forward your feedback.

I took a look at V11 (0001-Covering-core-v11.patch,
0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch,
0004-Covering-natts-v11.patch) today.

* What's a pivot tuple?

This is the same thing as what I call a "separator key", I think --
you're talking about the set of IndexTuples including all high keys
(including leaf level high keys), as well as internal items
(downlinks). I think that it's a good idea to have a standard word
that describes this set of keys, to formalize the two categories
(pivot tuples vs. tuples that point to the heap itself). Your word is
just as good as mine, so we can go with that.

Let's put this somewhere central. Maybe in the nbtree README, and/or
nbtree.h. Also, verify_nbtree.c should probably get some small
explanation of pivot tuples. offset_is_negative_infinity() is a nice
place to mention pivot tuples, since that already has a bit of
high-level commentary about them.

* Compiler warning:

/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:
In function ‘index_create’:
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:476:45:
warning: ‘opclassTup’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
   if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID)
                                             ^
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:332:19:
note: ‘opclassTup’ was declared here
   Form_pg_opclass opclassTup;
                   ^

* Your new amcheck tests should definitely use the new
"heapallindexed" option. There were a number of bugs I can remember
seeing in earlier versions of this patch that that would catch
(probably not during regression tests, but let's at least do that
much).

* The modified amcheck contrib regression tests don't actually pass. I
see these unexpected errors:

10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.
10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.

* I see that we use "- 1" with attribute number, like this:

> +/* Get number of attributes in B-tree index tuple */
> +#define BtreeTupGetNAtts(itup, index)  \
> +   ( \
> +       (itup)->t_info & INDEX_ALT_TID_MASK ? \
> +       ( \
> +           AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
> +           ItemPointerGetOffsetNumber(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK - 1 \
> +       ) \
> +       : \
> +       IndexRelationGetNumberOfAttributes(index) \
> +   )

Is this left behind from before you decided to adopt
INDEX_ALT_TID_MASK? Is it your intention here to encode
InvalidOffsetNumber() without tripping up assertions? Or is it
something else?

Maybe we should follow the example of GinItemPointerGetOffsetNumber(),
and use ItemPointerGetOffsetNumberNoCheck() instead of
ItemPointerGetOffsetNumber(). What do you think? That would allow us
to get rid of the -1 thing, which might be nice. Just because we use
ItemPointerGetOffsetNumberNoCheck() in places that use an alternative
offset representation does not mean we need to use it in existing
places. If existing places had a regression tests failure because of
this, that would probably be due to a real bug. No?

* ISTM that the "points to index tid=(289,2)" part of the message just
shown would be a bit clearer if I didn't have to know that 2 actually
means 1 when we talk about the pointed-to offset (yeah, it will
probably become unclear in the future when we start using the reserved
offset status bits, but why not make the low bits of offset
simple/logical way?). Your new amcheck error message should spell it
out (it should say the number of attributes indicated by the offset,
if any) -- regardless of what we do about the "must apply - 1 to
offset" question.

* "Minus infinity" items do not have the new status bit
INDEX_ALT_TID_MASK set in at least some cases. They should.

* _bt_sortaddtup() should not do "trunctuple.t_info =
sizeof(IndexTupleData)", since that destroys useful information. Maybe
that's the reason for the last bug?

* Ditto for _bt_pgaddtup().

* Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call
it? The only reason we have _bt_sortaddtup() is because we cannot
trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context
of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real
change needed, because _bt_sortaddtup() knows that it's inserting on a
non-rightmost page both without this patch, and when this patch needs
to truncate and then add the high key back.

It's clear that you can just use _bt_sortaddtup() (and leave
_bt_pgaddtup() private) because _bt_sortaddtup() is only different to
_bt_pgaddtup() when !P_ISLEAF(), but we only call _bt_pgaddtup() when
P_ISLEAF(). Or have I missed something?

* For inserts, this patch performs an extra truncation step on the
same high key that we'd use with a plain (non-covering/include) index.
That's pretty clean. But it seems more complicated for
nbtsort.c/_bt_buildadd(). I think that a comment should say that we
cannot just rearrange item pointers for high key on the old page when
we also truncate, because overwriting the P_HIKEY position ItemId with
the old page's former final ItemId (whose tuple ended up becoming the
first tuple on new/right page) fails to actually save any space. We
need to truly shift around IndexTuples on the page in order to save
space (both PageIndexTupleDelete() and PageAddItem() end up shifting
both the ItemId array and some IndexTuple space).

Also, maybe say that the performance here really isn't so bad, because
we reclaim IndexTuple space close to the middle of the hole in the
page with our PageIndexTupleDelete(), and then use almost the *same*
space within PageAddItem(). There is not actually that much physical
shifting around for IndexTuples. It turns out that it's not that
different. (You can probably find a better, more succinct way of
putting this -- I'm tired now.)

* I suggest that you teach _bt_check_natts() to expect zero attributes
for "minus infinity" items. It looks like amcheck contrib regression
tests don't pass because you don't look for that (P_FIRSTDATAKEY() is
the "minus infinity" item on internal pages).

* bt_target_page_check() should also have a !P_ISLEAF() check, since
with a covering index every tuple will have INDEX_ALT_TID_MASK. This
should call _bt_check_natts() for each item, including the "minus
infinity" items.

* "minus infinity" items don't have the right number of attributes
set, in at least some cases that I saw. The number matched other
internal items, and wasn't 0 or whatever. Maybe the
ItemPointerGetOffsetNumberNoCheck() idea would leave things so that it
actually could be 0 safely, rather than natts + 1 as you said, which
would be nice.)

* I would reorder the comment to match the order of the code:

> +   /*
> +    * Pivot tuples stored in non-leaf pages and hikeys of leaf pages should
> +    * have nkeyatts number of attributes.  While regular tuples of leaf pages
> +    * should have natts number of attributes.
> +    */
> +   if (P_ISLEAF(opaque) && offnum >= P_FIRSTDATAKEY(opaque))
> +       return (BtreeTupGetNAtts(itup, index) == natts);
> +   else
> +       return (BtreeTupGetNAtts(itup, index) == nkeyatts);

* Please add BT_N_KEYS_OFFSET_MASK + INDEX_MAX_KEYS static assertion.
Maybe add it to _bt_check_natts().

* README-SSI says:

    * The effects of page splits, overflows, consolidations, and
removals must be carefully reviewed to ensure that predicate locks
aren't "lost" during those operations, or kept with pages which could
get re-used for different parts of the index.

Do we need to worry about that here? I guess not, because this is just
like having many duplicates. But a note just above the _bt_doinsert()
call to CheckForSerializableConflictIn() might be a good idea.

That's all I have for today.
--
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
Hi!

Thank you for review!  Revised patchset is attached.

On Wed, Apr 4, 2018 at 6:08 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Great, I'm looking forward your feedback.

I took a look at V11 (0001-Covering-core-v11.patch,
0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch,
0004-Covering-natts-v11.patch) today.

* What's a pivot tuple?

This is the same thing as what I call a "separator key", I think --
you're talking about the set of IndexTuples including all high keys
(including leaf level high keys), as well as internal items
(downlinks). I think that it's a good idea to have a standard word
that describes this set of keys, to formalize the two categories
(pivot tuples vs. tuples that point to the heap itself). Your word is
just as good as mine, so we can go with that.
 
Good, let's use "pivot tuple" term.

Let's put this somewhere central. Maybe in the nbtree README, and/or
nbtree.h. Also, verify_nbtree.c should probably get some small
explanation of pivot tuples. offset_is_negative_infinity() is a nice
place to mention pivot tuples, since that already has a bit of
high-level commentary about them.

I've added some explanation to nbtree README, nbtree.h and
offset_is_negative_infinity().
 
* Compiler warning:

/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:
In function ‘index_create’:
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:476:45:
warning: ‘opclassTup’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
   if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID)
                                             ^
/home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:332:19:
note: ‘opclassTup’ was declared here
   Form_pg_opclass opclassTup;
                   ^

Thank yo for pointing, fixed.
 
* Your new amcheck tests should definitely use the new
"heapallindexed" option. There were a number of bugs I can remember
seeing in earlier versions of this patch that that would catch
(probably not during regression tests, but let's at least do that
much).

Good point.  Tests with "heapallindexed" were added.  I also find that it's useful to
check both index built by sorting, and index built by insertions, because there are
different ways of forming tuples.

* The modified amcheck contrib regression tests don't actually pass. I
see these unexpected errors:

10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.
10037/2018-04-03 16:31:12 PDT ERROR:  wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL:  Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.
 
Right.  Sorry, that appears that I've posted patch with non-working regression tests.
Not they seem to pass.

* I see that we use "- 1" with attribute number, like this:

> +/* Get number of attributes in B-tree index tuple */
> +#define BtreeTupGetNAtts(itup, index)  \
> +   ( \
> +       (itup)->t_info & INDEX_ALT_TID_MASK ? \
> +       ( \
> +           AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
> +           ItemPointerGetOffsetNumber(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK - 1 \
> +       ) \
> +       : \
> +       IndexRelationGetNumberOfAttributes(index) \
> +   )

Is this left behind from before you decided to adopt
INDEX_ALT_TID_MASK? Is it your intention here to encode
InvalidOffsetNumber() without tripping up assertions? Or is it
something else?

Maybe we should follow the example of GinItemPointerGetOffsetNumber(),
and use ItemPointerGetOffsetNumberNoCheck() instead of
ItemPointerGetOffsetNumber(). What do you think? That would allow us
to get rid of the -1 thing, which might be nice. Just because we use
ItemPointerGetOffsetNumberNoCheck() in places that use an alternative
offset representation does not mean we need to use it in existing
places. If existing places had a regression tests failure because of
this, that would probably be due to a real bug. No?
 
Ok.  I've tried to remove both assertions and "+1" hack.  That works
for me.  However, I've to touch a lot of places, not sure if that's a problem.

* ISTM that the "points to index tid=(289,2)" part of the message just
shown would be a bit clearer if I didn't have to know that 2 actually
means 1 when we talk about the pointed-to offset (yeah, it will
probably become unclear in the future when we start using the reserved
offset status bits, but why not make the low bits of offset
simple/logical way?). Your new amcheck error message should spell it
out (it should say the number of attributes indicated by the offset,
if any) -- regardless of what we do about the "must apply - 1 to
offset" question.

Right, since error is related to number of attributes, then we should report
observed number of attributes explicitly here.

* "Minus infinity" items do not have the new status bit
INDEX_ALT_TID_MASK set in at least some cases. They should.

* _bt_sortaddtup() should not do "trunctuple.t_info =
sizeof(IndexTupleData)", since that destroys useful information. Maybe
that's the reason for the last bug?

* Ditto for _bt_pgaddtup().
 
Yes, "minus infinity" items hadn't new status bit INDEX_ALT_TID_MASK set.
And that's because errors in _bt_sortaddtup() and _bt_pgaddtup() that you
pointed.  Fixed. thanks.

* Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call
it? The only reason we have _bt_sortaddtup() is because we cannot
trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context
of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real
change needed, because _bt_sortaddtup() knows that it's inserting on a
non-rightmost page both without this patch, and when this patch needs
to truncate and then add the high key back.

It's clear that you can just use _bt_sortaddtup() (and leave
_bt_pgaddtup() private) because _bt_sortaddtup() is only different to
_bt_pgaddtup() when !P_ISLEAF(), but we only call _bt_pgaddtup() when
P_ISLEAF(). Or have I missed something?
 
Agreed.  I also see no point of exposing _bt_pgaddtup().  I've replaced it
with _bt_sortaddtup(), and it appears to work.

* For inserts, this patch performs an extra truncation step on the
same high key that we'd use with a plain (non-covering/include) index.
That's pretty clean. But it seems more complicated for
nbtsort.c/_bt_buildadd(). I think that a comment should say that we
cannot just rearrange item pointers for high key on the old page when
we also truncate, because overwriting the P_HIKEY position ItemId with
the old page's former final ItemId (whose tuple ended up becoming the
first tuple on new/right page) fails to actually save any space. We
need to truly shift around IndexTuples on the page in order to save
space (both PageIndexTupleDelete() and PageAddItem() end up shifting
both the ItemId array and some IndexTuple space).

Also, maybe say that the performance here really isn't so bad, because
we reclaim IndexTuple space close to the middle of the hole in the
page with our PageIndexTupleDelete(), and then use almost the *same*
space within PageAddItem(). There is not actually that much physical
shifting around for IndexTuples. It turns out that it's not that
different. (You can probably find a better, more succinct way of
putting this -- I'm tired now.)

I wrote some comment there.  Please, check it.
 
* I suggest that you teach _bt_check_natts() to expect zero attributes
for "minus infinity" items. It looks like amcheck contrib regression
tests don't pass because you don't look for that (P_FIRSTDATAKEY() is
the "minus infinity" item on internal pages).

Sure, thank you for catching.
 
* bt_target_page_check() should also have a !P_ISLEAF() check, since
with a covering index every tuple will have INDEX_ALT_TID_MASK. This
should call _bt_check_natts() for each item, including the "minus
infinity" items.

Yes, every item including the "minus infinity" should be checked for
number of attributes.  However, I didn't get how that related to !P_ISLEAF().
In order to check "minus infinity" item I've just pull _bt_check_natts() up
before offset_is_negative_infinity() check.

Regarding !P_ISLEAF(), I think we should check every item on both
leaf and non-leaf pages.  I that is how code now works unless I'm missing
something.

* "minus infinity" items don't have the right number of attributes
set, in at least some cases that I saw. The number matched other
internal items, and wasn't 0 or whatever. Maybe the
ItemPointerGetOffsetNumberNoCheck() idea would leave things so that it
actually could be 0 safely, rather than natts + 1 as you said, which
would be nice.)

Yes, "minus infinity" items didn't have number of attributes set, because
_bt_sortaddtup() and _bt_pgaddtup() didn't handle it as you pointed
above.
 
* I would reorder the comment to match the order of the code:

> +   /*
> +    * Pivot tuples stored in non-leaf pages and hikeys of leaf pages should
> +    * have nkeyatts number of attributes.  While regular tuples of leaf pages
> +    * should have natts number of attributes.
> +    */
> +   if (P_ISLEAF(opaque) && offnum >= P_FIRSTDATAKEY(opaque))
> +       return (BtreeTupGetNAtts(itup, index) == natts);
> +   else
> +       return (BtreeTupGetNAtts(itup, index) == nkeyatts);
 
Thanks for pointing.  Since there are now three cases including handling of
"minus infinity" items, comment is now split to three.

* Please add BT_N_KEYS_OFFSET_MASK + INDEX_MAX_KEYS static assertion.
Maybe add it to _bt_check_natts().
 
Done.

* README-SSI says:

    * The effects of page splits, overflows, consolidations, and
removals must be carefully reviewed to ensure that predicate locks
aren't "lost" during those operations, or kept with pages which could
get re-used for different parts of the index.

Do we need to worry about that here? I guess not, because this is just
like having many duplicates. But a note just above the _bt_doinsert()
call to CheckForSerializableConflictIn() might be a good idea.

I don't seerelations between this patchset and SSI.  We just
change representation of some index tuples in pages.  However,
we didn't change the the order of page modification, the order
of page lookup and so on.  Yes, we change size of some tuples,
but B-tree already worked with tuples of variable sizes.  So, the fact
that tuples now have different size shouldn't affect SSI.  Right now,
I'm not sure if CheckForSerializableConflictIn() just above the
_bt_doinsert() is good idea.  But even if so, I think that should be
a subject of separate patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you for review!  Revised patchset is attached.

Cool.

* btree_xlog_split() still has this code:

    /*
     * On leaf level, the high key of the left page is equal to the first key
     * on the right page.
     */
    if (isleaf)
    {
        ItemId      hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));

        left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
        left_hikeysz = ItemIdGetLength(hiItemId);
    }

However, we never fail to store the high key now, even at the leaf
level, because of this change to the corresponding point in
_bt_split():

> -       /* Log left page */
> -       if (!isleaf)
> -       {
> -           /*
> -            * We must also log the left page's high key, because the right
> -            * page's leftmost key is suppressed on non-leaf levels.  Show it
> -            * as belonging to the left page buffer, so that it is not stored
> -            * if XLogInsert decides it needs a full-page image of the left
> -            * page.
> -            */
> -           itemid = PageGetItemId(origpage, P_HIKEY);
> -           item = (IndexTuple) PageGetItem(origpage, itemid);
> -           XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
> -       }
> +       /*
> +        * We must also log the left page's high key.  There are two reasons
> +        * for that: right page's leftmost key is suppressed on non-leaf levels,
> +        * in covering indexes, included columns are truncated from high keys.
> +        * For simplicity, we don't distinguish these cases, but log the high
> +        * key every time.  Show it as belonging to the left page buffer, so
> +        * that it is not stored if XLogInsert decides it needs a full-page
> +        * image of the left page.
> +        */
> +       itemid = PageGetItemId(origpage, P_HIKEY);
> +       item = (IndexTuple) PageGetItem(origpage, itemid);
> +       XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));

So should we remove the first block of code? Note also that this
existing comment has been made obsolete:

/* don't release the buffer yet; we touch right page's first item below */

/* Now reconstruct left (original) sibling page */
if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO)

Maybe we *should* release the right sibling buffer at the point of the
comment now?

* _bt_mkscankey() should assert that the IndexTuple has the correct
number of attributes.

I don't expect you to change routines like _bt_mkscankey() so they
actually respect the number of attributes from BTreeTupGetNAtts(),
rather than just relying on IndexRelationGetNumberOfKeyAttributes().
However, an assertion seems well worthwhile. It's a big reason for
having BTreeTupGetNAtts().

This also lets you get rid of at least one assertion from
_bt_doinsert(), I think.

* _bt_isequal() should assert that the IndexTuple was not truncated.

* The order could be switched here:

> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
>     if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
>         return 1;
>
> +   /*
> +    * Check tuple has correct number of attributes.
> +    */
> +   if (unlikely(!_bt_check_natts(rel, page, offnum)))
> +   {
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INTERNAL_ERROR),
> +                errmsg("tuple has wrong number of attributes in index \"%s\"",
> +                       RelationGetRelationName(rel))));
> +   }

In principle, we should also check _bt_check_natts() for "minus
infinity" items, just like you did within verify_nbtree.c. Also, there
is no need for parenthesis here.

* Maybe _bt_truncate_tuple() should assert that the caller has not
tried to truncate a tuple that has already been truncated.

I'm not sure if our assertion should be quite that strong, but I think
that that might be good because in general we only need to truncate on
the leaf level -- truncating at any other level on the tree (e.g.
doing traditional suffix truncation) is always subtly wrong. What we
definitely should do, at a minimum, is make sure that attempting to
truncate a tuple to 2 attributes when it already has 0 attributes
fails with an assertion failure.

Can you try adding the strong assertion (truncate only once) to
_bt_truncate_tuple()? Maybe that's not possible, but it seems worth a
try.

* I suggest we invent a new flag for 0x2000 within itup.h, to replace
"/* bit 0x2000 is reserved for index-AM specific usage */".

We can call it INDEX_AM_RESERVED_BIT. Then, we can change
INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the
same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this
neater.

* We should "use" one of the 4 new status bit that are available from
an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf
index tuples. Perhaps call it BT_ALT_TID_NONPIVOT.

I guess you could say that I want us to reserve one of our 4 reserve bits.

* I think that you could add to this:

> +++ b/src/backend/access/nbtree/README
> @@ -590,6 +590,10 @@ original search scankey is consulted as each index entry is sequentially
>  scanned to decide whether to return the entry and whether the scan can
>  stop (see _bt_checkkeys()).
>
> +We use term "pivot" index tuples to distinguish tuples which don't point
> +to heap tuples, but rather used for tree navigation.  Pivot tuples includes
> +all tuples on non-leaf pages and high keys on leaf pages.

I like what you came up with, and where you put it, but I would add
another few sentences: "Note that pivot index tuples are only used to
represent which part of the key space belongs on each page, and can
have attribute values copied from non-pivot tuples that were deleted
and killed by VACUUM some time ago. In principle, we could truncate
away attributes that are not needed for a page high key during a leaf
page split, provided that the remaining attributes distinguish the
last index tuple on the post-split left page as belonging on the left
page, and the first index tuple on the post-split right page as
belonging on the right page. This optimization is sometimes called
suffix truncation, and may appear in a future release. Since the high
key is subsequently reused as the downlink in the parent page for the
new right page, suffix truncation can increase index fan-out
considerably by keeping pivot tuples short. INCLUDE indexes similarly
truncate away non-key attributes at the time of a leaf page split,
increasing fan-out."

> Good point.  Tests with "heapallindexed" were added.  I also find that it's
> useful to
> check both index built by sorting, and index built by insertions, because
> there are
> different ways of forming tuples.

Right. It's a good cross-check for things like that. We'll have to
teach bt_tuple_present_callback() to normalize the representation in
some way for the BT_ALT_TID_NONPIVOT case in the future. But it
already talks about normalizing for reasons like this, so that's okay.

* I think you should add a note about BT_ALT_TID_NONPIVOT to
bt_tuple_present_callback(), though. If it cannot be sure that every
non-pivot tuple will have the same representation, amcheck will have
to normalize to the most flexible representation before hashing.

> Ok.  I've tried to remove both assertions and "+1" hack.  That works
> for me.  However, I've to touch a lot of places, not sure if that's a
> problem.

Looks good to me. If it makes an assertion fail, that's probably a
good thing, because it would have been broken before anyway.

* You missed this comment, which is now not accurate:

> + * It's possible that index tuple has zero attributes (leftmost item of
> + * iternal page).  And we have assertion that offset number is greater or equal
> + * to 1.  This is why we store (number_of_attributes + 1) in offset number.
> + */

I can see that it is actually 0 for a minus infinity item, which is good.

> I wrote some comment there.  Please, check it.

The nbtsort.c comments could maybe do with some tweaks from a native
speaker, but look correct.

> Regarding !P_ISLEAF(), I think we should check every item on both
> leaf and non-leaf pages.  I that is how code now works unless I'm missing
> something.

It does, and should. Thanks.

> Thanks for pointing.  Since there are now three cases including handling of
> "minus infinity" items, comment is now split to three.

That looks good. Thanks.

Right now, it looks like every B-Tree index could use
INDEX_ALT_TID_MASK, regardless of whether or not it's an INCLUDE
index. I think that that's fine, but let's say so in the paragraph
that introduces INDEX_ALT_TID_MASK. This patch establishes that any
nbtree pivot tuple could have INDEX_ALT_TID_MASK set, and that's
something that can be expected. It's also something that might not be
set when pg_upgrade was used, but that's fine too.

> I don't seerelations between this patchset and SSI.  We just
> change representation of some index tuples in pages.  However,
> we didn't change the the order of page modification, the order
> of page lookup and so on.  Yes, we change size of some tuples,
> but B-tree already worked with tuples of variable sizes.  So, the fact
> that tuples now have different size shouldn't affect SSI.  Right now,
> I'm not sure if CheckForSerializableConflictIn() just above the
> _bt_doinsert() is good idea.  But even if so, I think that should be
> a subject of separate patch.

My point was that that nothing changes, because we already use what
_bt_doinsert() calls the "first valid" page. Maybe just add: "(This
reasoning also applies to INCLUDE indexes, whose extra attributes are
not considered part of the key space.)".

That's it for today.
-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Erik Rijkers
Date:
On 2018-04-05 00:09, Alexander Korotkov wrote:
> Hi!
> 
> Thank you for review!  Revised patchset is attached.
> [0001-Covering-core-v12.patch]
> [0002-Covering-btree-v12.patch]
> [0003-Covering-amcheck-v12.patch]
> [0004-Covering-natts-v12.patch]

Really nice performance gains.

I read through the docs and made some changes.  I hope it can count as 
improvement.

It would probably also be a good idea to add the term "covering index" 
somewhere, at least in the documentation's index; the term does now not 
occur anywhere.  (This doc-patch does not add it)

thanks,

Erik Rijkers

Attachment

Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Thu, Apr 5, 2018 at 5:02 PM, Erik Rijkers <er@xs4all.nl> wrote:
On 2018-04-05 00:09, Alexander Korotkov wrote:
Thank you for review!  Revised patchset is attached.
[0001-Covering-core-v12.patch]
[0002-Covering-btree-v12.patch]
[0003-Covering-amcheck-v12.patch]
[0004-Covering-natts-v12.patch]

Really nice performance gains.

I read through the docs and made some changes.  I hope it can count as improvement.

Thank you for your improvements of the docs.  Your chenges will be
incorporated into new revision of patchset which I'm going to post today.
 
It would probably also be a good idea to add the term "covering index" somewhere, at least in the documentation's index; the term does now not occur anywhere.  (This doc-patch does not add it)

I'll think about it.  May be we'll define "covering index" in the docs.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
Hi!

Thank you for review.  Revised patchset is attached.

On Thu, Apr 5, 2018 at 5:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you for review!  Revised patchset is attached.

Cool.

* btree_xlog_split() still has this code:

    /*
     * On leaf level, the high key of the left page is equal to the first key
     * on the right page.
     */
    if (isleaf)
    {
        ItemId      hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));

        left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
        left_hikeysz = ItemIdGetLength(hiItemId);
    }

However, we never fail to store the high key now, even at the leaf
level, because of this change to the corresponding point in
_bt_split():

So should we remove the first block of code?

Right, I think there is absolutely no need in this code.  It's removed in
the attached patchset.
 
Note also that this
existing comment has been made obsolete:

/* don't release the buffer yet; we touch right page's first item below */

/* Now reconstruct left (original) sibling page */
if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO)

Maybe we *should* release the right sibling buffer at the point of the
comment now?

Agreed.  We don't need to hold right buffer for getting hikey from it.
The only remaining concern is concurrency at standby.  But right page
is unrefenced at this point, and nobody should try read it before we
finish split.
 
* _bt_mkscankey() should assert that the IndexTuple has the correct
number of attributes.

I don't expect you to change routines like _bt_mkscankey() so they
actually respect the number of attributes from BTreeTupGetNAtts(),
rather than just relying on IndexRelationGetNumberOfKeyAttributes().
However, an assertion seems well worthwhile. It's a big reason for
having BTreeTupGetNAtts().
 
OK, I've added asserting that number of tuple attributes shoud be
either natts or nkeyatts, because we call _bt_mkscankey() for
pivot index tuples too.

This also lets you get rid of at least one assertion from
_bt_doinsert(), I think.

If you're talking about these assertions
 
Assert(IndexRelationGetNumberOfAttributes(rel) != 0);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
Assert(indnkeyatts != 0);

then I would rather leave both them.  If we know that index tuple
length is either natts or nkeyatts, that doesn't make you sure, that
both natts and nkeyatts are non-zero.

* _bt_isequal() should assert that the IndexTuple was not truncated.

Agreed.  Assertion is added.  I've to change signature of _bt_isequal()
to do that.  However, that shouldn't cause any problem: _bt_isequal()
is even static.

* The order could be switched here:

> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
>     if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
>         return 1;
>
> +   /*
> +    * Check tuple has correct number of attributes.
> +    */
> +   if (unlikely(!_bt_check_natts(rel, page, offnum)))
> +   {
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INTERNAL_ERROR),
> +                errmsg("tuple has wrong number of attributes in index \"%s\"",
> +                       RelationGetRelationName(rel))));
> +   }

In principle, we should also check _bt_check_natts() for "minus
infinity" items, just like you did within verify_nbtree.c. Also, there
is no need for parenthesis here.

Right.  Over is switched.  We now check for number of attributes
before checking for "minus infinity" item.  Extra parenthesis are removed.
 
* Maybe _bt_truncate_tuple() should assert that the caller has not
tried to truncate a tuple that has already been truncated.

I'm not sure if our assertion should be quite that strong, but I think
that that might be good because in general we only need to truncate on
the leaf level -- truncating at any other level on the tree (e.g.
doing traditional suffix truncation) is always subtly wrong. What we
definitely should do, at a minimum, is make sure that attempting to
truncate a tuple to 2 attributes when it already has 0 attributes
fails with an assertion failure.

Can you try adding the strong assertion (truncate only once) to
_bt_truncate_tuple()? Maybe that's not possible, but it seems worth a
try.

I've done so.  Tests are passing for me.

* I suggest we invent a new flag for 0x2000 within itup.h, to replace
"/* bit 0x2000 is reserved for index-AM specific usage */".

We can call it INDEX_AM_RESERVED_BIT. Then, we can change
INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the
same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this
neater.
 
Good point, done.

* We should "use" one of the 4 new status bit that are available from
an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf
index tuples. Perhaps call it BT_ALT_TID_NONPIVOT.
 
Hmm, we have four bit reserved.  But I'm not sure whether we would use
*all* of them for non-pivot tuples.  Probably we would use some of them for
pivot tuples.  I don't know that in advance.  Thus, I propose to don't
rename this.  But I've added comment that non-pivot tuples might also
use those bits.

I guess you could say that I want us to reserve one of our 4 reserve bits.

Sorry, I didn't get which particular further use of reserved bits do you mean?
Did you mean key normalization?
 
* I think that you could add to this:

> +++ b/src/backend/access/nbtree/README
> @@ -590,6 +590,10 @@ original search scankey is consulted as each index entry is sequentially
>  scanned to decide whether to return the entry and whether the scan can
>  stop (see _bt_checkkeys()).
>
> +We use term "pivot" index tuples to distinguish tuples which don't point
> +to heap tuples, but rather used for tree navigation.  Pivot tuples includes
> +all tuples on non-leaf pages and high keys on leaf pages.

I like what you came up with, and where you put it, but I would add
another few sentences: "Note that pivot index tuples are only used to
represent which part of the key space belongs on each page, and can
have attribute values copied from non-pivot tuples that were deleted
and killed by VACUUM some time ago. In principle, we could truncate
away attributes that are not needed for a page high key during a leaf
page split, provided that the remaining attributes distinguish the
last index tuple on the post-split left page as belonging on the left
page, and the first index tuple on the post-split right page as
belonging on the right page. This optimization is sometimes called
suffix truncation, and may appear in a future release. Since the high
key is subsequently reused as the downlink in the parent page for the
new right page, suffix truncation can increase index fan-out
considerably by keeping pivot tuples short. INCLUDE indexes similarly
truncate away non-key attributes at the time of a leaf page split,
increasing fan-out."

Thank you for writing that explanation.  Looks good.

> Good point.  Tests with "heapallindexed" were added.  I also find that it's
> useful to
> check both index built by sorting, and index built by insertions, because
> there are
> different ways of forming tuples.

Right. It's a good cross-check for things like that. We'll have to
teach bt_tuple_present_callback() to normalize the representation in
some way for the BT_ALT_TID_NONPIVOT case in the future. But it
already talks about normalizing for reasons like this, so that's okay.

Ok. 

* I think you should add a note about BT_ALT_TID_NONPIVOT to
bt_tuple_present_callback(), though. If it cannot be sure that every
non-pivot tuple will have the same representation, amcheck will have
to normalize to the most flexible representation before hashing.

Ok.  I've added relevant comment.

> Ok.  I've tried to remove both assertions and "+1" hack.  That works
> for me.  However, I've to touch a lot of places, not sure if that's a
> problem.

Looks good to me. If it makes an assertion fail, that's probably a
good thing, because it would have been broken before anyway.

Ok.
 
* You missed this comment, which is now not accurate:

> + * It's possible that index tuple has zero attributes (leftmost item of
> + * iternal page).  And we have assertion that offset number is greater or equal
> + * to 1.  This is why we store (number_of_attributes + 1) in offset number.
> + */
 
Right.  This comment is no longer needed, removed.

I can see that it is actually 0 for a minus infinity item, which is good.

Ok.
 
> I wrote some comment there.  Please, check it.

The nbtsort.c comments could maybe do with some tweaks from a native
speaker, but look correct.

> Regarding !P_ISLEAF(), I think we should check every item on both
> leaf and non-leaf pages.  I that is how code now works unless I'm missing
> something.

It does, and should. Thanks.

> Thanks for pointing.  Since there are now three cases including handling of
> "minus infinity" items, comment is now split to three.

That looks good. Thanks.

Ok.

Right now, it looks like every B-Tree index could use
INDEX_ALT_TID_MASK, regardless of whether or not it's an INCLUDE
index. I think that that's fine, but let's say so in the paragraph
that introduces INDEX_ALT_TID_MASK. This patch establishes that any
nbtree pivot tuple could have INDEX_ALT_TID_MASK set, and that's
something that can be expected. It's also something that might not be
set when pg_upgrade was used, but that's fine too.

I've added comment about that.

> I don't seerelations between this patchset and SSI.  We just
> change representation of some index tuples in pages.  However,
> we didn't change the the order of page modification, the order
> of page lookup and so on.  Yes, we change size of some tuples,
> but B-tree already worked with tuples of variable sizes.  So, the fact
> that tuples now have different size shouldn't affect SSI.  Right now,
> I'm not sure if CheckForSerializableConflictIn() just above the
> _bt_doinsert() is good idea.  But even if so, I think that should be
> a subject of separate patch.

My point was that that nothing changes, because we already use what
_bt_doinsert() calls the "first valid" page. Maybe just add: "(This
reasoning also applies to INCLUDE indexes, whose extra attributes are
not considered part of the key space.)".

Ok.  I've added this comment. 

This patchset also incorporates docs enhacements by Erik Rijkers and
sentence which states that indexes with included colums are also called
"covering indexes".

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

 
Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Thu, Apr 5, 2018 at 7:59 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> * btree_xlog_split() still has this code:

> Right, I think there is absolutely no need in this code.  It's removed in
> the attached patchset.

I'm now a bit nervous about always logging the high key, since that
could impact performance. I think that there is a good way to only do
it when needed. New plan:

1. Add these new fields to split record's set of xl_info fields (it
should be placed directly after XLOG_BTREE_SPLIT_R):

#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated
highkey */
#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated
highkey */

2. Within _bt_split(), restore the old "leaf vs. internal" logic, so
that the high key is only logged for internal (!isleaf) pages.
However, only log it when needed for leaf pages -- only when the new
highkey was *actually* truncated (or when its an internal page), since
only then will it actually be different to the first item on the right
page. Also, set XLOG_BTREE_SPLIT_L_HIGHKEY instead of
XLOG_BTREE_SPLIT_L when we must log (or set XLOG_BTREE_SPLIT_R_HIGHKEY
instead of XLOG_BTREE_SPLIT_R), so that recovery actually knows that
it should restore the truncated highkey.

(Sometimes I think it would be nice to be able to do more during
recovery, but that's a much bigger issue.)

3. Restore all the master code within btree_xlog_split(), except
instead of restoring the high key when !isleaf, do so when the record
is XLOG_BTREE_SPLIT_L_HIGHKEY|XLOG_BTREE_SPLIT_R_HIGHKEY.

4. Add an assertion within btree_xlog_split(), that ensures that
internal pages never fail to have their high key logged, since there
is no reason why that should ever not happen with internal pages.

5. Fix this struct xl_btree_split comment, which commit 0c504a80 from
2017 missed when it reclaimed two xl_info status bits:

 * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
 * The _L and _R variants indicate whether the inserted tuple went into the
 * left or right split page (and thus, whether newitemoff and the new item
 * are stored or not).  The _ROOT variants indicate that we are splitting
 * the root page, and thus that a newroot record rather than an insert or
 * split record should follow.  Note that a split record never carries a
 * metapage update --- we'll do that in the parent-level update.

6. Add your own xl_btree_split comment in its place, noting the new
usage. Basically, the _ROOT sentence with a similar _HIGHKEY sentence.

7. Don't forget about btree_desc().

I'd say that there is a good change that Anastasia is correct to think
that it isn't worth worrying about the extra WAL that her approach
implied, and that it is in fact good enough to simply always log the
left page's high key. However, it seems easier and lower risk all
around to do it this way. It doesn't leave us with ambiguity. In my
experience, *ambiguity* on design questions makes a patch miss a
release much more frequently than bugs or regressions make that
happen.

Sorry that I didn't just say this the first time I brought up
btree_xlog_split(). I didn't see the opportunity to avoid creating
more WAL until now.

> OK, I've added asserting that number of tuple attributes shoud be
> either natts or nkeyatts, because we call _bt_mkscankey() for
> pivot index tuples too.

Makes sense.

> If you're talking about these assertions
>
> Assert(IndexRelationGetNumberOfAttributes(rel) != 0);
> indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
> Assert(indnkeyatts != 0);

Actually, I was just talking about the first one,
"Assert(IndexRelationGetNumberOfAttributes(rel) != 0)". I was unclear.
Maybe it isn't worth getting rid of even the first one.

> then I would rather leave both them.  If we know that index tuple
> length is either natts or nkeyatts, that doesn't make you sure, that
> both natts and nkeyatts are non-zero.

I suppose.

> I've done so.  Tests are passing for me.

Great. I'm glad that worked out. One simple, broad rule.

> Hmm, we have four bit reserved.  But I'm not sure whether we would use
> *all* of them for non-pivot tuples.  Probably we would use some of them for
> pivot tuples.  I don't know that in advance.  Thus, I propose to don't
> rename this.  But I've added comment that non-pivot tuples might also
> use those bits.

Okay. Good enough.

> Sorry, I didn't get which particular further use of reserved bits do you
> mean?
> Did you mean key normalization?

I was being unclear. I was just reiterating my point about having a
non-pivot bit. It doesn't matter, though.

> Thank you for writing that explanation.  Looks good.

I think that once you realize how INCLUDE indexes don't change pivot
tuples, and actually understand what pivot tuples are, the patch seems
a lot less scary.

> This patchset also incorporates docs enhacements by Erik Rijkers and
> sentence which states that indexes with included colums are also called
> "covering indexes".

Cool.

* Use <quote><quote/> here:

> +       <para>
> +        Indexes with columns listed in the <literal>INCLUDE</literal> clause
> +        are also called "covering indexes".
> +       </para>

* Use <literal><literal/> here:

> +       <para>
> +        In <literal>UNIQUE</literal> indexes, uniqueness is only enforced
> +        for key columns.  Columns listed in the <literal>INCLUDE</literal>
> +        clause have no effect on uniqueness enforcement.  Other constraints
> +        (PRIMARY KEY and EXCLUDE) work the same way.
> +       </para>

* Do the regression tests pass with COPY_PARSE_PLAN_TREES?

* Running pgindent would be nice. I see a bit of trailing whitespace,
and things like that.

* Please tweak the indentation here (perhaps a new line):

> @@ -927,6 +963,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
>         last_off = P_FIRSTKEY;
>     }
>
> +   pageop = (BTPageOpaque) PageGetSpecialPointer(npage);
>     /*

* Does the optimizer README's PathKeys section need a sentence or two
on this patch?

I'm nervous about problems within the optimizer in general, since that
is an area that I am not particularly qualified to review. I hope that
someone with more experience in that area can take a look at it
specifically. I see that there are very few changes in the optimizer,
but in my experience that's often the problem when it comes to the
optimizer -- it lacks subtle things that it actually needs, rather
than having the wrong things.

* Does this existing build_index_pathkeys() comment need to be updated?

 * The result is canonical, meaning that redundant pathkeys are removed;
 * it may therefore have fewer entries than there are index columns.
 *
 * Another reason for stopping early is that we may be able to tell that
 * an index column's sort order is uninteresting for this query.  However,
 * that test is just based on the existence of an EquivalenceClass and not
 * on position in pathkey lists, so it's not complete.  Caller should call
 * truncate_useless_pathkeys() to possibly remove more pathkeys.

* I don't think that there is much point in having separate 0003 +
0004 patches. For the next revision, please squash those down into
0002. Actually, maybe there should be only one patch for the next
revision. Up to you.

* Please write commit messages for your patches. I like to make these
part of the review process.

That's all for now.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Fri, Apr 6, 2018 at 5:00 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Thu, Apr 5, 2018 at 7:59 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> * btree_xlog_split() still has this code:

> Right, I think there is absolutely no need in this code.  It's removed in
> the attached patchset.

I'm now a bit nervous about always logging the high key, since that
could impact performance. I think that there is a good way to only do
it when needed. New plan:

1. Add these new fields to split record's set of xl_info fields (it
should be placed directly after XLOG_BTREE_SPLIT_R):

#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated
highkey */
#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated
highkey */

2. Within _bt_split(), restore the old "leaf vs. internal" logic, so
that the high key is only logged for internal (!isleaf) pages.
However, only log it when needed for leaf pages -- only when the new
highkey was *actually* truncated (or when its an internal page), since
only then will it actually be different to the first item on the right
page. Also, set XLOG_BTREE_SPLIT_L_HIGHKEY instead of
XLOG_BTREE_SPLIT_L when we must log (or set XLOG_BTREE_SPLIT_R_HIGHKEY
instead of XLOG_BTREE_SPLIT_R), so that recovery actually knows that
it should restore the truncated highkey.

(Sometimes I think it would be nice to be able to do more during
recovery, but that's a much bigger issue.)

3. Restore all the master code within btree_xlog_split(), except
instead of restoring the high key when !isleaf, do so when the record
is XLOG_BTREE_SPLIT_L_HIGHKEY|XLOG_BTREE_SPLIT_R_HIGHKEY.

4. Add an assertion within btree_xlog_split(), that ensures that
internal pages never fail to have their high key logged, since there
is no reason why that should ever not happen with internal pages.

5. Fix this struct xl_btree_split comment, which commit 0c504a80 from
2017 missed when it reclaimed two xl_info status bits:

 * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
 * The _L and _R variants indicate whether the inserted tuple went into the
 * left or right split page (and thus, whether newitemoff and the new item
 * are stored or not).  The _ROOT variants indicate that we are splitting
 * the root page, and thus that a newroot record rather than an insert or
 * split record should follow.  Note that a split record never carries a
 * metapage update --- we'll do that in the parent-level update.

6. Add your own xl_btree_split comment in its place, noting the new
usage. Basically, the _ROOT sentence with a similar _HIGHKEY sentence.

7. Don't forget about btree_desc().

I'd say that there is a good change that Anastasia is correct to think
that it isn't worth worrying about the extra WAL that her approach
implied, and that it is in fact good enough to simply always log the
left page's high key. However, it seems easier and lower risk all
around to do it this way. It doesn't leave us with ambiguity. In my
experience, *ambiguity* on design questions makes a patch miss a
release much more frequently than bugs or regressions make that
happen.

Sorry that I didn't just say this the first time I brought up
btree_xlog_split(). I didn't see the opportunity to avoid creating
more WAL until now.
 
Done.  I would note that this aspect also catched my eye, but
since I didn't read very carefully the whole thread I thought that
it was already decided that small extra WAL is harmless.

> If you're talking about these assertions
>
> Assert(IndexRelationGetNumberOfAttributes(rel) != 0);
> indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
> Assert(indnkeyatts != 0);

Actually, I was just talking about the first one,
"Assert(IndexRelationGetNumberOfAttributes(rel) != 0)". I was unclear.
Maybe it isn't worth getting rid of even the first one.
 
OK, "Assert(IndexRelationGetNumberOfAttributes(rel) != 0)" has been
removed.
 
* Use <quote><quote/> here:

> +       <para>
> +        Indexes with columns listed in the <literal>INCLUDE</literal> clause
> +        are also called "covering indexes".
> +       </para>

* Use <literal><literal/> here:

> +       <para>
> +        In <literal>UNIQUE</literal> indexes, uniqueness is only enforced
> +        for key columns.  Columns listed in the <literal>INCLUDE</literal>
> +        clause have no effect on uniqueness enforcement.  Other constraints
> +        (PRIMARY KEY and EXCLUDE) work the same way.
> +       </para>
 
Fixed, thanks.

* Do the regression tests pass with COPY_PARSE_PLAN_TREES?

I've checked.  "make check-world" do pass with COPY_PARSE_PLAN_TREES.

* Running pgindent would be nice. I see a bit of trailing whitespace,
and things like that.

I've run pgindent and added relevant changes into patch.
 
* Please tweak the indentation here (perhaps a new line):

> @@ -927,6 +963,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
>         last_off = P_FIRSTKEY;
>     }
>
> +   pageop = (BTPageOpaque) PageGetSpecialPointer(npage);
>     /*

Done.
 
* Does the optimizer README's PathKeys section need a sentence or two
on this patch?
 
It definitely needs one.  I've added the sentence describing that non-key
attributes don't have corresponding PathKeys.

I'm nervous about problems within the optimizer in general, since that
is an area that I am not particularly qualified to review. I hope that
someone with more experience in that area can take a look at it
specifically. I see that there are very few changes in the optimizer,
but in my experience that's often the problem when it comes to the
optimizer -- it lacks subtle things that it actually needs, rather
than having the wrong things.

I would just note, that Postgres Pro has shipped a version of covering
indexes to customers.  The optimizer part there is pretty same as now
in the patchset.  Assuming that we didn't met optimizer issues in this
part, I expect them to be at least very rare.  I also would like to ask my
Postgres Pro colleague Alexander Kuzmenkov to take a look at optimizer
part of this patch.  If even after that we will miss something in the
optimizer then I expect it could be fixed after feature freeze.

* Does this existing build_index_pathkeys() comment need to be updated?

 * The result is canonical, meaning that redundant pathkeys are removed;
 * it may therefore have fewer entries than there are index columns.
 *
 * Another reason for stopping early is that we may be able to tell that
 * an index column's sort order is uninteresting for this query.  However,
 * that test is just based on the existence of an EquivalenceClass and not
 * on position in pathkey lists, so it's not complete.  Caller should call
 * truncate_useless_pathkeys() to possibly remove more pathkeys.

Yes, I've updated it.
 
* I don't think that there is much point in having separate 0003 +
0004 patches. For the next revision, please squash those down into
0002. Actually, maybe there should be only one patch for the next
revision. Up to you.

Agreed.  I've merge all the patches into one.
 
* Please write commit messages for your patches. I like to make these
part of the review process.

Commit message is included into the patch. 

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
As far I can see, there is no any on-disk representation differece for 
*existing* indexes. So, pg_upgrade is not need here and there isn't any new code 
to support "on-fly" modification. Am I right?



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Fri, Apr 6, 2018 at 10:20 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> As far I can see, there is no any on-disk representation differece for
> *existing* indexes. So, pg_upgrade is not need here and there isn't any new
> code to support "on-fly" modification. Am I right?

Yes.

I'm going to look at this again today, and will post something within
12 hours. Please hold off on committing until then.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Fri, Apr 6, 2018 at 8:22 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Apr 6, 2018 at 10:20 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> As far I can see, there is no any on-disk representation differece for
> *existing* indexes. So, pg_upgrade is not need here and there isn't any new
> code to support "on-fly" modification. Am I right?

Yes.

I'm going to look at this again today, and will post something within
12 hours. Please hold off on committing until then.

Thank you.

Thinking about that again, I found that we should relax our requirements to
"minus infinity" items, because pg_upgraded indexes doesn't have any
special bits set for those items. 

What do you think about applying following patch on the top of v14?

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 44605fb5a4..53dc47ff82 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2000,8 +2000,12 @@ _bt_check_natts(Relation index, Page page, OffsetNumber offnum)
    }
    else if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
    {
-       /* Leftmost tuples on non-leaf pages have no attributes */
-       return (BTreeTupGetNAtts(itup, index) == 0);
+       /*
+        * Leftmost tuples on non-leaf pages have no attributes, or haven't
+        * INDEX_ALT_TID_MASK set in pg_upgraded indexes.
+        */
+       return (BTreeTupGetNAtts(itup, index) == 0 ||
+               ((itup->t_info & INDEX_ALT_TID_MASK) == 0));
    }
    else
    {

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Fri, Apr 6, 2018 at 10:33 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thinking about that again, I found that we should relax our requirements to
> "minus infinity" items, because pg_upgraded indexes doesn't have any
> special bits set for those items.
>
> What do you think about applying following patch on the top of v14?

It's clearly necessary. Looks fine to me.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Fri, Apr 6, 2018 at 8:42 PM, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Apr 6, 2018 at 10:33 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thinking about that again, I found that we should relax our requirements to
> "minus infinity" items, because pg_upgraded indexes doesn't have any
> special bits set for those items.
>
> What do you think about applying following patch on the top of v14?

It's clearly necessary. Looks fine to me.

OK, incorporated into v15.  I've also added sentence about pg_upgrade
to the commit message.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Fri, Apr 6, 2018 at 11:08 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> OK, incorporated into v15.  I've also added sentence about pg_upgrade
> to the commit message.

I will summarize my feelings on this patch. I endorse committing the
patch, because I think that the benefits of committing it now
noticeably outweigh the costs. I have various caveats about pushing
the patch, but these are manageable.

Costs
=====

First, there is the question of risks, or costs. I think that this
patch has a negligible chance of being problematic in a way that will
become memorable. That seems improbable because the patch only really
changes the representation of what we're calling "pivot keys" (high
keys and internal page downlinks), which is something that VACUUM
doesn't care about. I see this patch as a special case of suffix
truncation, a technique that has been around since the 1970s. Although
you have to look carefully to see it, the amount of extra complexity
is pretty small, and the only place where a critical change is made is
during leaf page splits. As long as we get that right, everything else
should fall into place. There are no risks that I can see that are
related to concurrency, or that crop up when doing an anti-wraparound
VACUUM. There may be problems, but at least they won't be *pernicious*
problems that unravel over a long period of time.

The latest amcheck enhancement, and Alexander's recent changes to the
patch to make the on-disk representation explicit (not implicit)
should change things. We now have the tools to detect any corruption
problem that I can think of. For example, if there was some subtle
reason why assessing HOT safety broke, then we'd have a way of
mechanically detecting that without having to devise a custom test
(like the test Pavan happened to be using when the bug fixed by
2aaec654 was originally discovered). The lessons that I applied to
designing amcheck were in a few cases from actual experience with real
world bugs, including that 2aaec654 bug.

I hope that it goes without saying that I've also taken reasonable
steps to address all of these risks directly, by auditing code. And,
that this remains the first line of defense.

Here are the other specific issues that I see with the patch:

* It's possible that something was missed in the optimizer. I'm not sure.

I share the intuition that very little code is actually needed there,
but I'm far from the best person to judge whether or not some subtle
detail was missed.

* This seems out of date:

> +            * NOTE: It is not crucial for reliability in present, but maybe
> +            * it will be that in the future. Now the purpose is just to save
> +            * more space on inner pages of btree.

* CheckIndexCompatible() didn't seem to get the memo about this patch.
Maybe just a comment?

* It's possible that there are some more bugs in places like
relcache.c, or deparsing, or pg_dump, or indexcmds.c; perhaps simple
omissions, like the one I just mentioned. If there are, I don't expect
them to be particularly serious, or to make me reassess my basic
position. But there could be.

* I was wrong to suggest _bt_isequal() has an assertion against
truncation. It is called for the highkey. Suggest you weaken the
assertion, so it only applies when the offset isn't P_HIKEY on
non-rightmost page.

* Suggest adding a comment above BTStackData, about bts_btentry + offset.

* Suggest breaking BTEntrySame() into 3 lines, not 2.

* This comment needs to be updated:

/* get high key from left page == lowest key on new right page */

Suggest "get high key from left page == lower bound for new right page".

* This comment needs to be updated:

13th bit: unused

Suggest "13th bit: AM-defined meaning"

* Suggest adding a note that the use of P_HIKEY here is historical,
since it isn't used to match downlinks:

        /*
         * Find the parent buffer and get the parent page.
         *
         * Oops - if we were moved right then we need to change stack item! We
         * want to find parent pointing to where we are, right ?    - vadim
         * 05/27/97
         */
        ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
        pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);

* I'm slightly concerned that this patch subtly breaks an optimization
within _bt_preprocess_keys(), or _bt_checkkeys(). I cannot find any
evidence of that, though, and I consider it unlikely, based on the
intuition that the simple Pathkey changes in the optimizer don't
provide the executor with a truly new set of constraints for index
scans. Also, if there was a problem here, it would be in the less
serious category of problems -- those that can't really affect anyone
not using the user-visible feature.

* The docs need some more polishing. Didn't spend very much time on this at all.

Benefits
========

There is also the matter of the benefits of this patch, that I think
are considerable, and far greater than they appear. This feature is a
great way to begin to add a broad variety of enhancements to nbtree
that we really need.

* The patch makes index-only scans a lot more compelling.

There are a couple of reasons why it's better to create indexes that
index perhaps as many as 4 or 7 columns to target index-only scans in
other database systems. I think that fan-out may be the main one. The
disadvantage that we have around HOT safety compared to other systems
seem less likely to be the problem when that many columns are
involved, and yet this is something that Oracle/SQL Server people do
frequently, and Postgres people don't really do at all. This one thing
that suffix truncation improves automatically, but INCLUDE indexes can
make that general situation a lot better than truncation alone ever
could.

If you have an index where most columns are INCLUDE columns, and
compare that to an index with the same attributes that are indexed in
the conventional way, then I believe that you will have far fewer
problems with index bloat in some important cases. Apart from
everything else, this provides us with the opportunity to learn how to
mitigate index bloat problems in real world conditions, even without
INCLUDE indexes. We need to get smarter about problems with index
bloat.

* Suffix truncation works on the same principle, and is enabled by
this work. It's prerequisite to making nbtree use the classic L&Y
approach, that assumes that all items in the index are unique.

We could just add heap TID to pivot tuples today, as an "extra"
column, while sorting on TID at the leaf level. This would make TID a
first class part of the key space -- a "unique-ifier", as L&Y
intended. But doing so naively would add enormous overhead, which
would simply be unacceptable. However, once we have suffix truncation,
the overhead is eliminated in virtually all cases. We get to move to
the classic L&Y invariant, simplifying the code, and we have a solid
basis for adding "retail index tuple deletion", which I believe is
almost essentially for zheap. There is a good chance that Postgres
B-Trees are the only implementation in the world that doesn't have
truly unique keys. The design of nbtree would become a whole lot more
elegant if we could restore the classic "Ki < v <= Ki+1" invariant, as
Vadim intended over 20 years ago.

Somebody has to bite the bullet and start changing the representation
of pivot tuples to get these benefits (and many more). This seems like
an ideal place to start that process. I think that what we have here
addresses concerns from Tom [1], in particular.

The patch has been marked "Ready for Committer". While this patch is
primarily the responsibility of the committer, presumably Teodor in
this case, I will take some of the responsibility for the patch after
commit. Certainly, because I see the patch as strategically important,
I am willing to spend quite a lot of time after feature freeze, to
make sure that it is in good shape. I have a general interest in
making sure that amcheck gains acceptance as a way of validating a
complicated patch like this one after commit.

[1] https://www.postgresql.org/message-id/15195.1490988897%40sss.pgh.pa.us

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes. (the good and the bad)

From
Erik Rijkers
Date:
On 2018-04-06 20:08, Alexander Korotkov wrote:
> 
> [0001-Covering-v15.patch]
> 

After some more testing I notice there is also a down-side/slow-down to 
this patch that is not so bad but more than negligible, and I don't 
think it has been mentioned (but I may have missed something in this 
thread that's now been running for 1.5 year, not to mention the 
tangential btree-thread(s)).

I attach my test-program, which compares master (this morning) with 
covered_indexes (warning: it takes a while to generate the used tables).

The test tables are created as:
   create table $t (c1 int, c2 int, c3 int, c4 int);
   insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, 
$rowcount) as x);
   create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) 
include (c3, c4);

or for HEAD, just:
   create unique index ${t}unique_idx on $t using btree (c1, c2);


Here is typical output (edited a bit to prevent email-mangling):

test1:
-- explain analyze select c1, c2 from nt0___100000000 where c1 < 10000   
-- 250x
unpatched 6511: 100M rows Execution Time:  (normal/normal)  98 %  exec 
avg: 2.44
   patched 6976: 100M rows Execution Time: (covered/normal) 108 %  exec 
avg: 2.67
                                                        test1 patched / 
unpatched: 109.49 %

test4:
-- explain analyze select c1, c2 from nt0___100000000 where c1 < 10000 
and c3 < 20
unpatched 6511: 100M rows Execution Time:  (normal/normal)  95 %    exec 
avg: 1.56
   patched 6976: 100M rows Execution Time: (covered/normal)  60 %    exec 
avg: 0.95
                                                        test4 patched / 
unpatched:  60.83 %


So the main good thing is that 60%, a good improvement -- but that ~109% 
(a slow-down) is also quite repeatable.

(there are a more goodies from the patch (like improved insert-speed) 
but I just wanted to draw attention to this particular slow-down too)

I took all timings from explain analyze versions of the statements, on 
the assumption that that would be quite comparable to 'normal' querying. 
(please let me know if that introduces error).


# \dti+ nt0___1*
                                            List of relations
  Schema |               Name               | Type  |  Owner   |      
Table      |  Size
--------+----------------------------------+-------+----------+-----------------+--------
  public | nt0___100000000                  | table | aardvark |          
        | 4224 MB
  public | nt0___100000000uniqueinclude_idx | index | aardvark | 
nt0___100000000 | 3004 MB


(for what it's worth, I'm in favor of getting this patch into v11 
although I can't say I followed the technical details too much)


thanks,


Erik Rijkers




Attachment

Re: WIP: Covering + unique indexes. (the good and the bad)

From
Teodor Sigaev
Date:
Thank you!

>    create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) 
> include (c3, c4);
> or for HEAD, just:
>    create unique index ${t}unique_idx on $t using btree (c1, c2);



> -- explain analyze select c1, c2 from nt0___100000000 where c1 < 10000 
> -- explain analyze select c1, c2 from nt0___100000000 where c1 < 10000 
> and c3 < 20

Not so fair comparison, include index twice bigger because of include 
columns. Try to compare with covering-emulated index:
create unique index ${t}unique_idx on $t using btree (c1, c2, c3, c4)

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes. (the good and the bad)

From
Alexander Korotkov
Date:
On Sat, Apr 7, 2018 at 2:57 PM, Erik Rijkers <er@xs4all.nl> wrote:
On 2018-04-06 20:08, Alexander Korotkov wrote:

[0001-Covering-v15.patch]


After some more testing I notice there is also a down-side/slow-down to this patch that is not so bad but more than negligible, and I don't think it has been mentioned (but I may have missed something in this thread that's now been running for 1.5 year, not to mention the tangential btree-thread(s)).

I attach my test-program, which compares master (this morning) with covered_indexes (warning: it takes a while to generate the used tables).

The test tables are created as:
  create table $t (c1 int, c2 int, c3 int, c4 int);
  insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, $rowcount) as x);
  create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) include (c3, c4);

or for HEAD, just:
  create unique index ${t}unique_idx on $t using btree (c1, c2);

Do I understand correctly that you compare unique index on (c1, c2) with master to unqiue index on (c1, c2) include (c3, c4) with patched version?
If so then I think it's wrong to say about down-side/slow-down of this patch based on this comparison.
Patch *does not* cause slowdown in this case.  Patch provides user a *new option* which has its advantages and disadvantages.  And what you compare is advantages and disadvantages of this option, not slow-down of the patch.
In the case you compare *the same* index on master and patched version, then it's possible to say about slow-down of the patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> First, there is the question of risks, or costs. I think that this
I hope that's acceptable risk.

> * It's possible that something was missed in the optimizer. I'm not sure.
> 
> I share the intuition that very little code is actually needed there,
> but I'm far from the best person to judge whether or not some subtle
> detail was missed.
Of course, it's possible but some variant of this patch is already used 
in production environment and we didn't face  with planer issues. Of 
course it could be, but if so then they are so deep that I doubt that 
they can be found easily.


> 
> * This seems out of date:
> 
>> +            * NOTE: It is not crucial for reliability in present, but maybe
>> +            * it will be that in the future. Now the purpose is just to save
>> +            * more space on inner pages of btree.
removed

> 
> * CheckIndexCompatible() didn't seem to get the memo about this patch.
> Maybe just a comment?
improved

> * I was wrong to suggest _bt_isequal() has an assertion against
> truncation. It is called for the highkey. Suggest you weaken the
> assertion, so it only applies when the offset isn't P_HIKEY on
> non-rightmost page.
Fixed
> 
> * Suggest adding a comment above BTStackData, about bts_btentry + offset.
see below

> 
> * Suggest breaking BTEntrySame() into 3 lines, not 2.
see below

> 
> * This comment needs to be updated:
> /* get high key from left page == lowest key on new right page */
> Suggest "get high key from left page == lower bound for new right page".
fixed

> 
> * This comment needs to be updated:
> 13th bit: unused
> 
> Suggest "13th bit: AM-defined meaning"
done

> * Suggest adding a note that the use of P_HIKEY here is historical,
> since it isn't used to match downlinks:
> 
>          /*
>           * Find the parent buffer and get the parent page.
>           *
>           * Oops - if we were moved right then we need to change stack item! We
>           * want to find parent pointing to where we are, right ?    - vadim
>           * 05/27/97
>           */
>          ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
>          pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
On close look, bts_btentry.ip_posid is not used anymore, I change 
bts_btentry type to BlockNumber. As result, BTEntrySame() is removed.


> * The docs need some more polishing. Didn't spend very much time on this at all.
Suppose, it should be some native English speaker, definitely not me.


I'm not very happy with massive usage of 
ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to  wrap it to 
macro something like this:
#define BTreeInnerTupleGetDownLink(itup) \
    ItemPointerGetBlockNumberNoCheck(&(itup->t_tid))

It will be nice to add assert checking in this macro about inner tuple 
or not, but, as I can see, it's impossible - inner and leaf tuples are 
indistinguishable. So I add pair 
BTreeInnerTupleGetDownLink/TreeInnerTupleSetDownLink except a few places.


If there isn't strong objection, I intend to push it this evening.

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/

Attachment

Re: WIP: Covering + unique indexes. (the good and the bad)

From
Erik Rijkers
Date:
On 2018-04-07 14:27, Alexander Korotkov wrote:
> On Sat, Apr 7, 2018 at 2:57 PM, Erik Rijkers <er@xs4all.nl> wrote:
> 
>> On 2018-04-06 20:08, Alexander Korotkov wrote:
>> 
>>> [0001-Covering-v15.patch]
>>> 
>> After some more testing I notice there is also a down-side/slow-down 
>> to
>> this patch that is not so bad but more than negligible, and I don't 
>> think
>> it has been mentioned (but I may have missed something in this thread
>> that's now been running for 1.5 year, not to mention the tangential
>> btree-thread(s)).
>> 
>> I attach my test-program, which compares master (this morning) with
>> covered_indexes (warning: it takes a while to generate the used 
>> tables).
>> 
>> The test tables are created as:
>>   create table $t (c1 int, c2 int, c3 int, c4 int);
>>   insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, 
>> $rowcount)
>> as x);
>>   create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2)
>> include (c3, c4);
>> 
>> or for HEAD, just:
>>   create unique index ${t}unique_idx on $t using btree (c1, c2);
>> 
> 
> Do I understand correctly that you compare unique index on (c1, c2) 
> with
> master to unqiue index on (c1, c2) include (c3, c4) with patched 
> version?
> If so then I think it's wrong to say about down-side/slow-down of this
> patch based on this comparison.
> Patch *does not* cause slowdown in this case.  Patch provides user a 
> *new
> option* which has its advantages and disadvantages.  And what you 
> compare
> is advantages and disadvantages of this option, not slow-down of the 
> patch.
> In the case you compare *the same* index on master and patched version,
> then it's possible to say about slow-down of the patch.

OK, I take your point -- you are right.  Although my measurement was (I 
think) correct, my comparison was not (as Teodor wrote, not quite 
'fair').

Sorry, I should have better thought that message through.  The somewhat 
longer time is indeed just a disadvantage of this new option, to be 
balanced against the advantages that are pretty clear too.


Erik Rijkers


Re: WIP: Covering + unique indexes.

From
Alvaro Herrera
Date:
I didn't like rel.h being included in itup.h.  Do you really need a
Relation as argument to index_truncate_tuple?  It looks to me like you
could pass the tupledesc instead; indnatts could be passed as a separate
argument instead of IndexRelationGetNumberOfAttributes.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> I didn't like rel.h being included in itup.h.  Do you really need a
> Relation as argument to index_truncate_tuple?  It looks to me like you
> could pass the tupledesc instead; indnatts could be passed as a separate
> argument instead of IndexRelationGetNumberOfAttributes.
> 

Hm, okay, I understand why, will fix by you suggestion

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> I didn't like rel.h being included in itup.h.  Do you really need a
> Relation as argument to index_truncate_tuple?  It looks to me like you
fixed


-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/

Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Sat, Apr 7, 2018 at 5:48 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> On close look, bts_btentry.ip_posid is not used anymore, I change
> bts_btentry type to BlockNumber. As result, BTEntrySame() is removed.

That seems like a good idea.

> I'm not very happy with massive usage of
> ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to  wrap it to
> macro something like this:
> #define BTreeInnerTupleGetDownLink(itup) \
>         ItemPointerGetBlockNumberNoCheck(&(itup->t_tid))

Agreed. We do that with GIN.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
Thanks to everyone, pushed.




Peter Geoghegan wrote:
> On Sat, Apr 7, 2018 at 5:48 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> On close look, bts_btentry.ip_posid is not used anymore, I change
>> bts_btentry type to BlockNumber. As result, BTEntrySame() is removed.
> 
> That seems like a good idea.
> 
>> I'm not very happy with massive usage of
>> ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to  wrap it to
>> macro something like this:
>> #define BTreeInnerTupleGetDownLink(itup) \
>>          ItemPointerGetBlockNumberNoCheck(&(itup->t_tid))
> 
> Agreed. We do that with GIN.
> 

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Sv: Re: WIP: Covering + unique indexes.

From
Andreas Joseph Krogh
Date:
På lørdag 07. april 2018 kl. 22:02:08, skrev Teodor Sigaev <teodor@sigaev.ru>:
Thanks to everyone, pushed.
 
Rock!
 
--
Andreas Joseph Krogh
 

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Sat, Apr 7, 2018 at 1:02 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Thanks to everyone, pushed.

I'll keep an eye on the buildfarm, since it's late in Russia.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> I'll keep an eye on the buildfarm, since it's late in Russia.

Thank you very much! Now 23:10 MSK and I'll be able to follow during 
approximately hour.

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Andres Freund
Date:
On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote:
> Thanks to everyone, pushed.

Marked CF entry as committed.

Greetings,

Andres Freund


Re: WIP: Covering + unique indexes.

From
Andrew Gierth
Date:
>>>>> "Teodor" == Teodor Sigaev <teodor@sigaev.ru> writes:

 >> I'll keep an eye on the buildfarm, since it's late in Russia.
 
 Teodor> Thank you very much! Now 23:10 MSK and I'll be able to follow
 Teodor> during approximately hour.

Support for testing amcaninclude via
pg_indexam_has_property(oid,'can_include') seems to be missing?

Also the return values of pg_index_column_has_property for included
columns seem a bit dubious... should probably be returning NULL for most
properties except 'returnable'.

I can look at fixing these for you if you like?

-- 
Andrew (irc:RhodiumToad)


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
Thank you, I looked to buildfarm and completely forget about commitfest site

Andres Freund wrote:
> On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote:
>> Thanks to everyone, pushed.
> 
> Marked CF entry as committed.
> 
> Greetings,
> 
> Andres Freund
> 

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Sat, Apr 7, 2018 at 1:52 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> Support for testing amcaninclude via
> pg_indexam_has_property(oid,'can_include') seems to be missing?
>
> Also the return values of pg_index_column_has_property for included
> columns seem a bit dubious... should probably be returning NULL for most
> properties except 'returnable'.
>
> I can look at fixing these for you if you like?

I'm happy to accept your help with it, for one.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> Support for testing amcaninclude via
> pg_indexam_has_property(oid,'can_include') seems to be missing?
> 
> Also the return values of pg_index_column_has_property for included
> columns seem a bit dubious... should probably be returning NULL for most
> properties except 'returnable'.
Damn, you right, it's missed.

> I can look at fixing these for you if you like?

If you will do that I will be very grateful

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Andrew Gierth
Date:
>>>>> "Teodor" == Teodor Sigaev <teodor@sigaev.ru> writes:

 >> Support for testing amcaninclude via
 >> pg_indexam_has_property(oid,'can_include') seems to be missing?
 >> 
 >> Also the return values of pg_index_column_has_property for included
 >> columns seem a bit dubious... should probably be returning NULL for most
 >> properties except 'returnable'.
 
 Teodor> Damn, you right, it's missed.

 >> I can look at fixing these for you if you like?

 Teodor> If you will do that I will be very grateful

OK, I will deal with it.

-- 
Andrew (irc:RhodiumToad)


Re: WIP: Covering + unique indexes.

From
Jeff Janes
Date:
On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Thanks to everyone, pushed.


Indeed thanks, this will be a nice feature.

It is giving me a compiler warning on non-cassert builds using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:

indextuple.c: In function 'index_truncate_tuple':
indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
  int   indnatts = tupleDescriptor->natts;

Cheers,

Jeff

Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
Thank you, fixed

Jeff Janes wrote:
> On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev <teodor@sigaev.ru 
> <mailto:teodor@sigaev.ru>> wrote:
> 
>     Thanks to everyone, pushed.
> 
> 
> Indeed thanks, this will be a nice feature.
> 
> It is giving me a compiler warning on non-cassert builds using gcc 
> (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:
> 
> indextuple.c: In function 'index_truncate_tuple':
> indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
>    int   indnatts = tupleDescriptor->natts;
> 
> Cheers,
> 
> Jeff

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
>     Thanks to everyone, pushed.
> 
> 
> Indeed thanks, this will be a nice feature.
> 
> It is giving me a compiler warning on non-cassert builds using gcc 
> (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609:
> 
> indextuple.c: In function 'index_truncate_tuple':
> indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable]
>    int   indnatts = tupleDescriptor->natts;

Thank you, fixed

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Thank you, fixed

I suggest that we remove some unneeded amcheck tests, as in the
attached patch. They don't seem to add anything.

-- 
Peter Geoghegan

Attachment

Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
Thank you, pushed.


Peter Geoghegan wrote:
> On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> Thank you, fixed
> 
> I suggest that we remove some unneeded amcheck tests, as in the
> attached patch. They don't seem to add anything.
> 

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


RE: WIP: Covering + unique indexes.

From
"Shinoda, Noriyoshi"
Date:
Hi,

I tested this feature and found a document shortage in the columns added to the pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog.

Regards,
Noriyoshi Shinoda

-----Original Message-----
From: Teodor Sigaev [mailto:teodor@sigaev.ru]
Sent: Monday, April 9, 2018 3:20 PM
To: Peter Geoghegan <pg@bowt.ie>
Cc: Jeff Janes <jeff.janes@gmail.com>; Alexander Korotkov <a.korotkov@postgrespro.ru>; Anastasia Lubennikova
<a.lubennikova@postgrespro.ru>;PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org> 
Subject: Re: WIP: Covering + unique indexes.

Thank you, pushed.


Peter Geoghegan wrote:
> On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> Thank you, fixed
>
> I suggest that we remove some unneeded amcheck tests, as in the
> attached patch. They don't seem to add anything.
>

--
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Attachment

Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
Hi!

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com> wrote:
I tested this feature and found a document shortage in the columns added to the pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog.

Thank you for pointing this!
I think we need more wordy explanation here.  My proposal is attached.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Attachment

RE: WIP: Covering + unique indexes.

From
"Shinoda, Noriyoshi"
Date:

Hi!

 

Thank you for your response.

I think that it is good with your proposal.

 

Regards,

Noriyoshi Shinoda

 

From: Alexander Korotkov [mailto:a.korotkov@postgrespro.ru]
Sent: Monday, April 9, 2018 11:22 PM
To: Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev <teodor@sigaev.ru>; Peter Geoghegan <pg@bowt.ie>; Jeff Janes <jeff.janes@gmail.com>; Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Subject: Re: WIP: Covering + unique indexes.

 

Hi!

 

On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com> wrote:

I tested this feature and found a document shortage in the columns added to the pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog.

 

Thank you for pointing this!

I think we need more wordy explanation here.  My proposal is attached.

 

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
Thanks to both of you, pushed

Shinoda, Noriyoshi wrote:
> Hi!
> 
> Thank you for your response.
> 
> I think that it is good with your proposal.
> 
> Regards,
> 
> Noriyoshi Shinoda
> 
> *From:*Alexander Korotkov [mailto:a.korotkov@postgrespro.ru]
> *Sent:* Monday, April 9, 2018 11:22 PM
> *To:* Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com>
> *Cc:* PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev 
> <teodor@sigaev.ru>; Peter Geoghegan <pg@bowt.ie>; Jeff Janes 
> <jeff.janes@gmail.com>; Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
> *Subject:* Re: WIP: Covering + unique indexes.
> 
> Hi!
> 
> On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com 
> <mailto:noriyoshi.shinoda@hpe.com>> wrote:
> 
>     I tested this feature and found a document shortage in the columns added to
>     the pg_constraint catalog.
>     The attached patch will add the description of the 'conincluding' column to
>     the manual of the pg_constraint catalog.
> 
> Thank you for pointing this!
> 
> I think we need more wordy explanation here.  My proposal is attached.
> 
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/>
> The Russian Postgres Company
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Sun, Apr 8, 2018 at 11:19 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Thank you, pushed.

I noticed a few more issues following another pass-through of the patch:

* There is no pfree() within _bt_buildadd() for truncated tuples, even
though that's a context where it's clearly not okay.

* It might be a good idea to also pfree() the truncated tuple for most
other _bt_buildadd() callers. Even though it's arguably okay in other
cases, it seems worth being consistent about it (consistent with old
nbtree code).

* There should probably be some documentation around why it's okay
that we call index_truncate_tuple() with an exclusive buffer lock held
(during a page split). For example, there should probably be a comment
on the VARATT_IS_EXTERNAL() situation.

* Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
to inner tuples, which might be worth doing something about (perhaps
just renaming the macro).

I do not have the time to write a patch right away, but I should be
able to post one in a few days. I want to avoid sending several small
patches.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> * There is no pfree() within _bt_buildadd() for truncated tuples, even
> though that's a context where it's clearly not okay.
Agree

> 
> * It might be a good idea to also pfree() the truncated tuple for most
> other _bt_buildadd() callers. Even though it's arguably okay in other
> cases, it seems worth being consistent about it (consistent with old
> nbtree code).
Seems, I don't see other calls to pfree after.

> * There should probably be some documentation around why it's okay
> that we call index_truncate_tuple() with an exclusive buffer lock held
> (during a page split). For example, there should probably be a comment
> on the VARATT_IS_EXTERNAL() situation.
I havn't objection to improve docs/comments.

> 
> * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
> to inner tuples, which might be worth doing something about (perhaps
> just renaming the macro).
What is suspicious place for you opinion?

> 
> I do not have the time to write a patch right away, but I should be
> able to post one in a few days. I want to avoid sending several small
> patches.
no problem, we can wait


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Apr 10, 2018 at 9:03 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
>> to inner tuples, which might be worth doing something about (perhaps
>> just renaming the macro).
>
> What is suspicious place for you opinion?

_bt_mark_page_halfdead() looked like it had a problem, but it now
looks like I was wrong. I also verified every other
BTreeInnerTupleGetDownLink() caller. It now looks like everything is
good here.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Apr 10, 2018 at 1:37 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> _bt_mark_page_halfdead() looked like it had a problem, but it now
> looks like I was wrong.

I did find another problem, though. Looks like the idea to explicitly
represent the number of attributes directly has paid off already:

pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text);
create unique index cov_idx on covering_bug (f1) include(f2);
insert into covering_bug select i, i * random() * 1000, i * random() *
100000 from generate_series(0,100000) i;
DEBUG:  building index "pg_toast_16451_index" on table "pg_toast_16451" serially
CREATE TABLE
DEBUG:  building index "cov_idx" on table "covering_bug" serially
CREATE INDEX
ERROR:  tuple has wrong number of attributes in index "cov_idx"

Note that amcheck can detect the issue with the index after the fact, too:

pg@~[3711]=# select bt_index_check('cov_idx');
ERROR:  wrong number of index tuple attributes for index "cov_idx"
DETAIL:  Index tid=(3,2) natts=2 points to index tid=(2,92) page lsn=0/170DC88.

I don't think that the issue is complicated. Looks like we missed a
place that we have to truncate within _bt_split(), located directly
after this comment block:

    /*
     * If the page we're splitting is not the rightmost page at its level in
     * the tree, then the first entry on the page is the high key for the
     * page.  We need to copy that to the right half.  Otherwise (meaning the
     * rightmost page case), all the items on the right half will be user
     * data.
     */

I believe that the reason that we didn't find this bug prior to commit
is that we only have a single index tuple with the wrong number of
attributes after an initial root page split through insertions, but
the next root page split masks the problems. Not 100% sure that that's
why we missed it just yet, though.

This bug shouldn't be hard to fix. I'll take care of it as part of
that post-commit review patch I'm working on.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> _bt_mark_page_halfdead() looked like it had a problem, but it now
> looks like I was wrong. I also verified every other
> BTreeInnerTupleGetDownLink() caller. It now looks like everything is
> good here.

Right - it tries to find right page by conlsulting in parent page, by taking of 
next key.



-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Apr 10, 2018 at 5:45 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I did find another problem, though. Looks like the idea to explicitly
> represent the number of attributes directly has paid off already:
>
> pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text);
> create unique index cov_idx on covering_bug (f1) include(f2);
> insert into covering_bug select i, i * random() * 1000, i * random() *
> 100000 from generate_series(0,100000) i;
> DEBUG:  building index "pg_toast_16451_index" on table "pg_toast_16451" serially
> CREATE TABLE
> DEBUG:  building index "cov_idx" on table "covering_bug" serially
> CREATE INDEX
> ERROR:  tuple has wrong number of attributes in index "cov_idx"

Actually, this was an error on my part (though I'd still maintain that
the check paid off here!). I'll still add defensive assertions inside
_bt_newroot(), and anywhere else that they're needed. There is no
reason to not add defensive assertions in all code that handles page
splits, and needs to fetch a highkey from some other page. We missed a
few of those.

I'll add an item to "Decisions to Recheck Mid-Beta" section of the
open items page for this patch. We should review the decision to make
a call to _bt_check_natts() within _bt_compare(). It might work just
as well as an assertion, and it would be unfortunate if workloads that
don't use covering indexes had to pay a price for the
_bt_check_natts() call, even if it was a small price. I've seen
_bt_compare() appear prominently in profiles quite a few times.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:

Peter Geoghegan wrote:
> On Tue, Apr 10, 2018 at 5:45 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> I did find another problem, though. Looks like the idea to explicitly
>> represent the number of attributes directly has paid off already:
>>
>> pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text);
>> create unique index cov_idx on covering_bug (f1) include(f2);
>> insert into covering_bug select i, i * random() * 1000, i * random() *
>> 100000 from generate_series(0,100000) i;
>> DEBUG:  building index "pg_toast_16451_index" on table "pg_toast_16451" serially
>> CREATE TABLE
>> DEBUG:  building index "cov_idx" on table "covering_bug" serially
>> CREATE INDEX
>> ERROR:  tuple has wrong number of attributes in index "cov_idx"
> 
> Actually, this was an error on my part (though I'd still maintain that
> the check paid off here!). I'll still add defensive assertions inside
> _bt_newroot(), and anywhere else that they're needed. There is no
> reason to not add defensive assertions in all code that handles page
> splits, and needs to fetch a highkey from some other page. We missed a
> few of those.
Agree, I prefer to add more Assert, even. may be, more than actually 
needed. Assert-documented code :)

> 
> I'll add an item to "Decisions to Recheck Mid-Beta" section of the
> open items page for this patch. We should review the decision to make
> a call to _bt_check_natts() within _bt_compare(). It might work just
> as well as an assertion, and it would be unfortunate if workloads that
> don't use covering indexes had to pay a price for the
> _bt_check_natts() call, even if it was a small price. I've seen
> _bt_compare() appear prominently in profiles quite a few times.
> 

Could you show a patch?

I think, we need move _bt_check_natts() and its call under 
USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't 
pay for unused feature.
-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Andrey Borodin
Date:
Hi!

> 12 апр. 2018 г., в 21:21, Teodor Sigaev <teodor@sigaev.ru> написал(а):

I was adapting tests for GiST covering index and found out that REINDEX test is somewhat not a REINDEX test...
I propose following micropatch.

Best regards, Andrey Borodin.

Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Thu, Apr 12, 2018 at 9:21 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Agree, I prefer to add more Assert, even. may be, more than actually needed.
> Assert-documented code :)

Absolutely. The danger with a feature like this is that we'll miss one
place. I suppose that you could say that I am in the Poul-Henning Kamp
camp on assertions [1].

>> I'll add an item to "Decisions to Recheck Mid-Beta" section of the
>> open items page for this patch. We should review the decision to make
>> a call to _bt_check_natts() within _bt_compare(). It might work just
>> as well as an assertion, and it would be unfortunate if workloads that
>> don't use covering indexes had to pay a price for the
>> _bt_check_natts() call, even if it was a small price. I've seen
>> _bt_compare() appear prominently in profiles quite a few times.
>>
>
> Could you show a patch?

Attached patch makes the changes that I talked about, and a few
others. The commit message has full details. The general direction of
the patch is that it documents our assumptions, and verifies them in
more cases. Most of the changes I've made are clear improvements,
though in a few cases I've made changes that are perhaps more
debatable. These other, more debatable cases are:

* The comments added to _bt_isequal() about suffix truncation may not
be to your taste. The same is true of the way that I restored the
previous _bt_isequal() function signature. (Yes, I want to change it
back despite the fact that I was the person that originally suggested
we change _bt_isequal().)

* I added BTreeTupSetNAtts() calls to a few places that don't truly
need them, such as the point where we generate a dummy 0-attribute
high key within _bt_mark_page_halfdead(). I think that we should try
to be as consistent as possible about using BTreeTupSetNAtts(), to set
a good example. I don't think it's necessary to use BTreeTupSetNAtts()
for pivot tuples when the number of key attributes matches indnatts
(it seems inconvenient to have to palloc() our own scratch buffer to
do this when we don't have to), but that doesn't apply to these
now-covered cases.

I imagine that you'll have no problem with the other changes in the
patch, which is why I haven't mentioned them here. Let me know what
you think.

> I think, we need move _bt_check_natts() and its call under
> USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't pay
> for unused feature.

I eventually decided that you were right about this, and made the
_bt_compare() call to _bt_check_natts() a simple assertion without
waiting to hear more opinions on the matter. Concurrency isn't a
factor here, so adding a check to standard release builds isn't
particularly likely to detect bugs. Besides, there is really only a
small number of places that need to do truncation for themselves. And,
if you want to be sure that the structure is consistent in the field,
there is always amcheck, which can check _bt_check_natts() (while also
checking other things that we care about just as much).

Note that I removed some dead code from _bt_insertonpg() that wasn't
added by the INCLUDE patch. It confused matters for this patch, since
we don't want to consider what's supposed to happen when there is a
retail insertion of a new, second negative infinity item -- clearly,
that should simply never happen (I thought about adding a
BTreeTupSetNAtts() call, but then decided to just remove the dead code
and add a new "can't happen" elog error). Finally, I made sure that we
don't drop all tables in the regression tests, so that we have some
pg_dump coverage for INCLUDE indexes, per a request from Tom.

[1] https://queue.acm.org/detail.cfm?id=2220317
-- 
Peter Geoghegan

Attachment

Re: WIP: Covering + unique indexes.

From
Alexander Korotkov
Date:
On Mon, Apr 16, 2018 at 1:05 AM, Peter Geoghegan <pg@bowt.ie> wrote:
Attached patch makes the changes that I talked about, and a few
others. The commit message has full details. The general direction of
the patch is that it documents our assumptions, and verifies them in
more cases. Most of the changes I've made are clear improvements,
though in a few cases I've made changes that are perhaps more
debatable.

Great, thank you very much!
 
These other, more debatable cases are:

* The comments added to _bt_isequal() about suffix truncation may not
be to your taste. The same is true of the way that I restored the
previous _bt_isequal() function signature. (Yes, I want to change it
back despite the fact that I was the person that originally suggested
we change _bt_isequal().)
 
Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc
argument, not relation>  It anyway doesn't need number of key attributes,
only total number of attributes.  Then _bt_isequal() would be able to use
BTreeTupGetNAtts().

* I added BTreeTupSetNAtts() calls to a few places that don't truly
need them, such as the point where we generate a dummy 0-attribute
high key within _bt_mark_page_halfdead(). I think that we should try
to be as consistent as possible about using BTreeTupSetNAtts(), to set
a good example. I don't think it's necessary to use BTreeTupSetNAtts()
for pivot tuples when the number of key attributes matches indnatts
(it seems inconvenient to have to palloc() our own scratch buffer to
do this when we don't have to), but that doesn't apply to these
now-covered cases.

+1 

> I think, we need move _bt_check_natts() and its call under
> USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't pay
> for unused feature.

I eventually decided that you were right about this, and made the
_bt_compare() call to _bt_check_natts() a simple assertion without
waiting to hear more opinions on the matter. Concurrency isn't a
factor here, so adding a check to standard release builds isn't
particularly likely to detect bugs. Besides, there is really only a
small number of places that need to do truncation for themselves. And,
if you want to be sure that the structure is consistent in the field,
there is always amcheck, which can check _bt_check_natts() (while also
checking other things that we care about just as much).

Good point, risk of performance degradation caused by _bt_check_natts()
in _bt_compare() is high.  So, let's move in to assertion.

Note that I removed some dead code from _bt_insertonpg() that wasn't
added by the INCLUDE patch. It confused matters for this patch, since
we don't want to consider what's supposed to happen when there is a
retail insertion of a new, second negative infinity item -- clearly,
that should simply never happen (I thought about adding a
BTreeTupSetNAtts() call, but then decided to just remove the dead code
and add a new "can't happen" elog error).

I think it's completely OK to fix broken things when you've to touch
them.  Probably, Teodor would decide to make that by separate commit.
So, it's up to him.
 
Finally, I made sure that we
don't drop all tables in the regression tests, so that we have some
pg_dump coverage for INCLUDE indexes, per a request from Tom.

Makes sense, because that've already appeared to be broken.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc
> argument, not relation>  It anyway doesn't need number of key attributes,
> only total number of attributes.  Then _bt_isequal() would be able to use
> BTreeTupGetNAtts().

That would make the BTreeTupGetNAtts() assertions quite a bit more
verbose, since there is usually no existing tuple descriptor variable,
but there is almost always a "rel" variable. The coverage within
_bt_isequal() does not seem important, because we only use it with the
page high key in rare cases, where _bt_moveright() will already have
tested the highkey.

> I think it's completely OK to fix broken things when you've to touch
> them.  Probably, Teodor would decide to make that by separate commit.
> So, it's up to him.

You're right to say that this old negative infinity tuple code within
_bt_insertonpg() is broken code, and not just dead code. The code
doesn't just confuse things (e.g. see recent commit 2a67d644). It also
seems like it could actually be harmful. This is code that could only
ever corrupt your database.

I'm fine if Teodor wants to commit that change separately, of course.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
I mostly agree with your patch, nice work, but I have some notices for your patch:

1)
bt_target_page_check():
     if (!P_RIGHTMOST(topaque) &&
         !_bt_check_natts(state->rel, state->target, P_HIKEY))

Seems not very obvious: it looks like we don't need to check nattrs on rightmost 
page. Okay, I remember that on rightmost page there is no hikey at all, but at 
least comment should added. Implicitly bt_target_page_check() already takes into 
account 'is page rightmost or not?' by using  P_FIRSTDATAKEY, so, may be better 
to move rightmost check into bt_target_page_check() with some refactoring if-logic:

if (offset > maxoff)
    return true; //nothing to check, also covers empty rightmost page

if (P_ISLEAF) {
    if (offnum >= P_FIRSTDATAKEY)
        ...
    else /* if (offnum == P_HIKEY) */
        ...
}
else // !P_ISLEAF
{
    if (offnum == P_FIRSTDATAKEY)
        ...
    else if (offnum > P_FIRSTDATAKEY)
        ...
    else /* if (offnum == P_HIKEY) */
        ...
}
    
I see it's possible only 3 nattrs value: 0, nkey and nkey+nincluded, but 
collapsing if-clause to three branches causes difficulties for code readers. Let 
compiler optimize that. Sorry for late notice, but it takes my attention only 
when I noticed (!P_RIGHTMOST && !_bt_check_natts) condition.

2)
Style notice:
         ItemPointerSetInvalid(&trunctuple.t_tid);
+   BTreeTupSetNAtts(&trunctuple, 0);
     if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY,
It's better to have blank line between BTreeTupSetNAtts() and if clause.

3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use full 
Tuple word in dowlink macroses, here we use just Tup. Seems, better to have 
Tuple in both cases. Or Tup, but still in both cases.

4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to 
BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK 
bits, now it will overwrite that bits.

Attached patch is rebased to current head and contains some comment improvement 
in index_truncate_tuple() - you save some amount of memory with TupleDescCopy() 
call but didn't explain why pfree is enough to free all allocated memory.




Peter Geoghegan wrote:
> On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
>> Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc
>> argument, not relation>  It anyway doesn't need number of key attributes,
>> only total number of attributes.  Then _bt_isequal() would be able to use
>> BTreeTupGetNAtts().
> 
> That would make the BTreeTupGetNAtts() assertions quite a bit more
> verbose, since there is usually no existing tuple descriptor variable,
> but there is almost always a "rel" variable. The coverage within
> _bt_isequal() does not seem important, because we only use it with the
> page high key in rare cases, where _bt_moveright() will already have
> tested the highkey.
> 
>> I think it's completely OK to fix broken things when you've to touch
>> them.  Probably, Teodor would decide to make that by separate commit.
>> So, it's up to him.
> 
> You're right to say that this old negative infinity tuple code within
> _bt_insertonpg() is broken code, and not just dead code. The code
> doesn't just confuse things (e.g. see recent commit 2a67d644). It also
> seems like it could actually be harmful. This is code that could only
> ever corrupt your database.
> 
> I'm fine if Teodor wants to commit that change separately, of course.
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
(()
On Wed, Apr 18, 2018 at 10:10 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> I mostly agree with your patch, nice work, but I have some notices for your
> patch:

Thanks.

> 1)
> bt_target_page_check():
>     if (!P_RIGHTMOST(topaque) &&
>         !_bt_check_natts(state->rel, state->target, P_HIKEY))
>
> Seems not very obvious: it looks like we don't need to check nattrs on
> rightmost page. Okay, I remember that on rightmost page there is no hikey at
> all, but at least comment should added. Implicitly bt_target_page_check()
> already takes into account 'is page rightmost or not?' by using
> P_FIRSTDATAKEY, so, may be better to move rightmost check into
> bt_target_page_check() with some refactoring if-logic:

I don't understand. We do check the number of attributes on rightmost
pages, but we do so separately, in the main loop. For every item that
isn't the high key.

This code appears before the main bt_target_page_check() loop because
we're checking the high key itself, on its own, which is a new thing.
The high key is also involved in the loop (on non-rightmost pages),
but that's only because we check real items *against* the high key (we
assume the high key is good and that the item might be bad). The high
key is involved in every iteration of the main loop (on non-rightmost
pages), rather than getting its own loop.

That said, I am quite happy if you want to put a comment about this
being the rightmost page at the beginning of the check.

> 2)
> Style notice:
>         ItemPointerSetInvalid(&trunctuple.t_tid);
> +   BTreeTupSetNAtts(&trunctuple, 0);
>     if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData),
> P_HIKEY,
> It's better to have blank line between BTreeTupSetNAtts() and if clause.

Sure.

> 3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use
> full Tuple word in dowlink macroses, here we use just Tup. Seems, better to
> have Tuple in both cases. Or Tup, but still in both cases.

+1

> 4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to
> BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK
> bits, now it will overwrite that bits.

An assertion sounds like it would be an improvement, though I don't
see that in the patch you posted.

> Attached patch is rebased to current head and contains some comment
> improvement in index_truncate_tuple() - you save some amount of memory with
> TupleDescCopy() call but didn't explain why pfree is enough to free all
> allocated memory.

Makes sense.

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
> I don't understand. We do check the number of attributes on rightmost
> pages, but we do so separately, in the main loop. For every item that
> isn't the high key.
Comment added, pls, verify. And refactored _bt_check_natts(), I hope, 
now it's a bit more readable.

>> 4) BTreeTupSetNAtts - seems, it's better to add check  of nattrs to fits  to
>> BT_N_KEYS_OFFSET_MASK  mask, and it should not touch BT_RESERVED_OFFSET_MASK
>> bits, now it will overwrite that bits.
> 
> An assertion sounds like it would be an improvement, though I don't
> see that in the patch you posted.
I didn't do that in v1, sorry, I was unclear. Attached patch contains 
all changes suggested in my previous email.

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/

Attachment

Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Wed, Apr 18, 2018 at 1:32 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> I don't understand. We do check the number of attributes on rightmost
>> pages, but we do so separately, in the main loop. For every item that
>> isn't the high key.
>
> Comment added, pls, verify. And refactored _bt_check_natts(), I hope, now
> it's a bit more readable.

The new comment looks good.

Now I understand what you meant about _bt_check_natts(). And, I agree
that this is an improvement -- the extra verbosity is worth it.

> I didn't do that in v1, sorry, I was unclear. Attached patch contains all
> changes suggested in my previous email.

Looks new BTreeTupSetNAtts () assertion good to me.

I suggest committing this patch as-is.

Thank you
-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Wed, Apr 18, 2018 at 1:45 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I suggest committing this patch as-is.

Actually, I see one tiny issue with extra '*' characters here:

> +            * The number of attributes won't be explicitly represented if the
> +            * negative infinity tuple was generated during a page split that
> +            * occurred with a version of Postgres before v11.  There must be a
> +            * problem when there is an explicit representation that is
> +            * non-zero, * or when there is no explicit representation and the
> +            * tuple is * evidently not a pre-pg_upgrade tuple.

I also suggest fixing this indentation before commit:

> +   /*
> +    *Cannot leak memory here, TupleDescCopy() doesn't allocate any
> +    * inner structure, so, plain pfree() should clean all allocated memory
> +    */

-- 
Peter Geoghegan


Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
Thank you, pushed.

> Actually, I see one tiny issue with extra '*' characters here:
> 
>> +            * The number of attributes won't be explicitly represented if the
>> +            * negative infinity tuple was generated during a page split that
>> +            * occurred with a version of Postgres before v11.  There must be a
>> +            * problem when there is an explicit representation that is
>> +            * non-zero, * or when there is no explicit representation and the
>> +            * tuple is * evidently not a pre-pg_upgrade tuple.
> 
> I also suggest fixing this indentation before commit:
> 
>> +   /*
>> +    *Cannot leak memory here, TupleDescCopy() doesn't allocate any
>> +    * inner structure, so, plain pfree() should clean all allocated memory
>> +    */

fixed

-- 
Teodor Sigaev                      E-mail: teodor@sigaev.ru
                                       WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Peter Geoghegan
Date:
On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> Thank you, pushed.

Thanks.

I saw another preexisting issue, this time one that has been around
since 2007. Commit bc292937 forgot to remove a comment above
_bt_insertonpg() (the 'afteritem' stuff ended up being moved to the
bottom of _bt_findinsertloc(), where it remains today). The attached
patch fixes this, and in passing mentions the fact that
_bt_insertonpg() only performs retail insertions, and specifically
never inserts high key items.

I don't think it's necessary to add something about negative infinity
items to the same comment block. While it's true that _bt_insertonpg()
cannot truncate downlinks to make new minus infinity items, I see that
as a narrower issue.

-- 
Peter Geoghegan

Attachment

Re: WIP: Covering + unique indexes.

From
Teodor Sigaev
Date:
Thank you, pushed

Peter Geoghegan wrote:
> On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> Thank you, pushed.
> 
> Thanks.
> 
> I saw another preexisting issue, this time one that has been around
> since 2007. Commit bc292937 forgot to remove a comment above
> _bt_insertonpg() (the 'afteritem' stuff ended up being moved to the
> bottom of _bt_findinsertloc(), where it remains today). The attached
> patch fixes this, and in passing mentions the fact that
> _bt_insertonpg() only performs retail insertions, and specifically
> never inserts high key items.
> 
> I don't think it's necessary to add something about negative infinity
> items to the same comment block. While it's true that _bt_insertonpg()
> cannot truncate downlinks to make new minus infinity items, I see that
> as a narrower issue.
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: WIP: Covering + unique indexes.

From
Alvaro Herrera
Date:
I'm wondering what's the genesis of this coninclude column actually.
As far as I can tell, the only reason this column is there, is to be
able to print the INCLUDE clause in a UNIQUE/PK constraint in ruleutils
... but surely the same list can be obtained from the pg_index.indkey
instead?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services