Thread: WIP: Covering + unique indexes.
I'm working on a patch that allows to combine covering and unique functionality for btree indexes.
Previous discussion was here:
1) Proposal thread
2) Message with proposal clarification
In a nutshell, the feature allows to create index with "key" columns and "included" columns.
"key" columns can be used as scan keys. Unique constraint relates only to "key" columns.
"included" columns may be used as scan keys if they have suitable opclass.
Both "key" and "included" columns can be returned from index by IndexOnlyScan.
Btree is the default index and it's used everywhere. So it requires properly testing. Volunteers are welcome)
Use case:
- We have a table (c1, c2, c3, c4);
- We need to have an unique index on (c1, c2).
- We would like to have a covering index on all columns to avoid reading of heap pages.
Old way:
CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2);
CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4);
What's wrong?
Two indexes contain repeated data. Overhead to data manipulation operations and database size.
New way:
CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4);
The patch is attached.
In 'test.sql' you can find a test with detailed comments on each step, and comparison of old and new indexes.
New feature has following syntax:
CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4);
Keyword INCLUDING defines the "included" columns of index. These columns aren't concern to unique constraint.
Also, them are not stored in index inner pages. It allows to decrease index size.
Results:
1) Additional covering index is not required anymore.
2) New index can use IndexOnlyScan on queries, where old index can't.
For example,
explain analyze select c1, c2 from newt where c1<10000 and c3<20;
*more examples in 'test.sql'
Future work:
To do opclasses for "included" columns optional.
CREATE TABLE tbl (c1 int, c4 box);
CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4);
If we don't need c4 as an index scankey, we don't need any btree opclass on it.
But we still want to have it in covering index for queries like
SELECT c4 FROM tbl WHERE c1=1000;
SELECT * FROM tbl WHERE c1=1000;
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 8 October 2015 at 16:18, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > > Hi hackers, > > I'm working on a patch that allows to combine covering and unique > functionality for btree indexes. > > Previous discussion was here: > 1) Proposal thread > 2) Message with proposal clarification > > In a nutshell, the feature allows to create index with "key" columns and > "included" columns. > "key" columns can be used as scan keys. Unique constraint relates only to > "key" columns. > "included" columns may be used as scan keys if they have suitable opclass. > Both "key" and "included" columns can be returned from index by > IndexOnlyScan. > > Btree is the default index and it's used everywhere. So it requires properly > testing. Volunteers are welcome) > > Use case: > - We have a table (c1, c2, c3, c4); > - We need to have an unique index on (c1, c2). > - We would like to have a covering index on all columns to avoid reading of > heap pages. > > Old way: > CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2); > CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4); > > What's wrong? > Two indexes contain repeated data. Overhead to data manipulation operations > and database size. > > New way: > CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4); > > The patch is attached. > In 'test.sql' you can find a test with detailed comments on each step, and > comparison of old and new indexes. > > New feature has following syntax: > CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4); > Keyword INCLUDING defines the "included" columns of index. These columns > aren't concern to unique constraint. > Also, them are not stored in index inner pages. It allows to decrease index > size. > > Results: > 1) Additional covering index is not required anymore. > 2) New index can use IndexOnlyScan on queries, where old index can't. > > For example, > explain analyze select c1, c2 from newt where c1<10000 and c3<20; > > *more examples in 'test.sql' > > Future work: > To do opclasses for "included" columns optional. > > CREATE TABLE tbl (c1 int, c4 box); > CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4); > > If we don't need c4 as an index scankey, we don't need any btree opclass on > it. > But we still want to have it in covering index for queries like > > SELECT c4 FROM tbl WHERE c1=1000; > SELECT * FROM tbl WHERE c1=1000; The definition output needs a space after "INCLUDING": # SELECT pg_get_indexdef('people_first_name_last_name_email_idx'::regclass::oid); pg_get_indexdef --------------------------------------------------------------------------------------------------------------------------CREATE UNIQUEINDEX people_first_name_last_name_email_idx ON people USING btree (first_name, last_name) INCLUDING(email) (1 row) There is also no collation output: # CREATE UNIQUE INDEX test_idx ON people (first_name COLLATE "en_GB", last_name) INCLUDING (email COLLATE "pl_PL"); CREATE INDEX # SELECT pg_get_indexdef('test_idx'::regclass::oid); pg_get_indexdef -------------------------------------------------------------------------------------------------------------CREATE UNIQUEINDEX test_idx ON people USING btree (first_name COLLATE "en_GB", last_name) INCLUDING(email) (1 row) As for functioning, it works as described: # EXPLAIN SELECT email FROM people WHERE (first_name,last_name) = ('Paul','Freeman'); QUERY PLAN ----------------------------------------------------------------------------------------------------------Index Only Scanusing people_first_name_last_name_email_idx on people(cost=0.28..1.40 rows=1 width=21) Index Cond: ((first_name = 'Paul'::text)AND (last_name = 'Freeman'::text)) (2 rows) Typo: "included columns must not intersects with key columns" should be: "included columns must not intersect with key columns" One thing I've noticed you can do with your patch, which you haven't mentioned, is have a non-unique covering index: # CREATE INDEX covering_idx ON people (first_name) INCLUDING (last_name); CREATE INDEX # EXPLAIN SELECT first_name, last_name FROM people WHERE first_name = 'Paul'; QUERY PLAN ---------------------------------------------------------------------------------Index Only Scan using covering_idx on people (cost=0.28..1.44 rows=4 width=13) Index Cond: (first_name = 'Paul'::text) (2 rows) But this appears to behave as if it were a regular multi-column index, in that it will use the index for ordering rather than sort after fetching from the index. So is this really stored the same as a multi-column index? The index sizes aren't identical, so something is different. Thom
08.10.2015 19:31, Thom Brown пишет: > On 8 October 2015 at 16:18, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> Hi hackers, >> >> I'm working on a patch that allows to combine covering and unique >> functionality for btree indexes. >> >> Previous discussion was here: >> 1) Proposal thread >> 2) Message with proposal clarification >> >> In a nutshell, the feature allows to create index with "key" columns and >> "included" columns. >> "key" columns can be used as scan keys. Unique constraint relates only to >> "key" columns. >> "included" columns may be used as scan keys if they have suitable opclass. >> Both "key" and "included" columns can be returned from index by >> IndexOnlyScan. >> >> Btree is the default index and it's used everywhere. So it requires properly >> testing. Volunteers are welcome) >> >> Use case: >> - We have a table (c1, c2, c3, c4); >> - We need to have an unique index on (c1, c2). >> - We would like to have a covering index on all columns to avoid reading of >> heap pages. >> >> Old way: >> CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2); >> CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4); >> >> What's wrong? >> Two indexes contain repeated data. Overhead to data manipulation operations >> and database size. >> >> New way: >> CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4); >> >> The patch is attached. >> In 'test.sql' you can find a test with detailed comments on each step, and >> comparison of old and new indexes. >> >> New feature has following syntax: >> CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4); >> Keyword INCLUDING defines the "included" columns of index. These columns >> aren't concern to unique constraint. >> Also, them are not stored in index inner pages. It allows to decrease index >> size. >> >> Results: >> 1) Additional covering index is not required anymore. >> 2) New index can use IndexOnlyScan on queries, where old index can't. >> >> For example, >> explain analyze select c1, c2 from newt where c1<10000 and c3<20; >> >> *more examples in 'test.sql' >> >> Future work: >> To do opclasses for "included" columns optional. >> >> CREATE TABLE tbl (c1 int, c4 box); >> CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4); >> >> If we don't need c4 as an index scankey, we don't need any btree opclass on >> it. >> But we still want to have it in covering index for queries like >> >> SELECT c4 FROM tbl WHERE c1=1000; >> SELECT * FROM tbl WHERE c1=1000; > The definition output needs a space after "INCLUDING": > > # SELECT pg_get_indexdef('people_first_name_last_name_email_idx'::regclass::oid); > pg_get_indexdef > -------------------------------------------------------------------------------------------------------------------------- > CREATE UNIQUE INDEX people_first_name_last_name_email_idx ON people > USING btree (first_name, last_name) INCLUDING(email) > (1 row) > > > There is also no collation output: > > # CREATE UNIQUE INDEX test_idx ON people (first_name COLLATE "en_GB", > last_name) INCLUDING (email COLLATE "pl_PL"); > CREATE INDEX > > # SELECT pg_get_indexdef('test_idx'::regclass::oid); > pg_get_indexdef > ------------------------------------------------------------------------------------------------------------- > CREATE UNIQUE INDEX test_idx ON people USING btree (first_name > COLLATE "en_GB", last_name) INCLUDING(email) > (1 row) > > > As for functioning, it works as described: > > # EXPLAIN SELECT email FROM people WHERE (first_name,last_name) = > ('Paul','Freeman'); > QUERY PLAN > ---------------------------------------------------------------------------------------------------------- > Index Only Scan using people_first_name_last_name_email_idx on people > (cost=0.28..1.40 rows=1 width=21) > Index Cond: ((first_name = 'Paul'::text) AND (last_name = 'Freeman'::text)) > (2 rows) > > > Typo: > > "included columns must not intersects with key columns" > > should be: > > "included columns must not intersect with key columns" Thank you for testing. Mentioned issues are fixed. > One thing I've noticed you can do with your patch, which you haven't > mentioned, is have a non-unique covering index: > > # CREATE INDEX covering_idx ON people (first_name) INCLUDING (last_name); > CREATE INDEX > > # EXPLAIN SELECT first_name, last_name FROM people WHERE first_name = 'Paul'; > QUERY PLAN > --------------------------------------------------------------------------------- > Index Only Scan using covering_idx on people (cost=0.28..1.44 rows=4 width=13) > Index Cond: (first_name = 'Paul'::text) > (2 rows) > > But this appears to behave as if it were a regular multi-column index, > in that it will use the index for ordering rather than sort after > fetching from the index. So is this really stored the same as a > multi-column index? The index sizes aren't identical, so something is > different. Yes, it behaves as a regular multi-column index. Index sizes are different, because included attributes are not stored in index inner pages. It allows to decrease index size. I don't sure that it doesn't decrease search speed. But I assumed that we are never execute search on included columns without clause on key columns. So it must be not too costly to recheck included attributes on leaf pages. Furthermore, it's a first step of work on "optional oplasses for included columns". If attribute hasn't opclass, we certainly don't need to store it in inner index page. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
It includes the functionality discussed above in the thread, regression tests and docs update.
I think it's quite ready for review.
Future work:
Besides that, I'd like to get feedback about attached patch "optional_opclass_3.0.patch".
It should be applied on the "covering_unique_3.0.patch".
Actually, this patch is the first step to do opclasses for "included" columns optional
and implement real covering indexing.
Example:
CREATE TABLE tbl (c1 int, c4 box);
CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4);
If we don't need c4 as an index scankey, we don't need any btree opclass on it.
But we still want to have it in covering index for queries like
SELECT c4 FROM tbl WHERE c1=1000; // uses the IndexOnlyScan
SELECT * FROM tbl WHERE c1=1000; // uses the IndexOnlyScan
The patch "optional_opclass" completely ignores opclasses of included attributes.
To see the difference, look at the explain analyze output:
explain analyze select * from tbl where c1=2 and c4 && box '(0,0,1,1)';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Index Only Scan using idx on tbl (cost=0.13..4.15 rows=1 width=36) (actual time=0.010..0.013 rows=1 loops=1)
Index Cond: (c1 = 2)
Filter: (c4 && '(1,1),(0,0)'::box)
"Index Cond" shows the index ScanKey conditions and "Filter" is for conditions which are used after index scan. Anyway it is faster than SeqScan that we had before, because IndexOnlyScan avoids extra heap fetches.
As I already said, this patch is just WIP, so included opclass is not "optional" but actually "ignored".
And following example works worse than without the patch. Please, don't care about it.
CREATE TABLE tbl2 (c1 int, c2 int);
CREATE UNIQUE INDEX idx2 ON tbl2 USING btree (c1) INCLUDING (c2);
explain analyze select * from tbl2 where c1<20 and c2<5;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Only Scan using idx2 on tbl2 (cost=0.28..4.68 rows=10 width=8) (actual time=0.055..0.066 rows=9 loops=1)
Index Cond: (c1 < 20)
Filter: (c2 < 5)
The question is more about suitable syntax.
We have two different optimizations here:
1. INCLUDED columns
2. Optional opclasses
It's logical to provide optional opclasses only for included columns.
Is it ok, to handle it using the same syntax and resolve all opclass conflicts while create index?
CREATE TABLE tbl2 (c1 int, c2 int, c4 box);
CREATE UNIQUE INDEX idx2 ON tbl2 USING btree (c1) INCLUDING (c2, c4);
CREATE UNIQUE INDEX idx3 ON tbl2 USING btree (c1) INCLUDING (c4, c2);
Of course, order of attributes is important.
Attrs which have oplass and want to use it in ScanKey must be situated before the others.
idx2 will use c2 in IndexCond, while idx3 will not. But I think that it's the job for DBA.
If you see any related changes in planner, please mention them. I haven't explored that part of code yet and could have missed something.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Dec 1, 2015 at 7:53 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > If we don't need c4 as an index scankey, we don't need any btree opclass on > it. > But we still want to have it in covering index for queries like > > SELECT c4 FROM tbl WHERE c1=1000; // uses the IndexOnlyScan > SELECT * FROM tbl WHERE c1=1000; // uses the IndexOnlyScan > > The patch "optional_opclass" completely ignores opclasses of included > attributes. OK, I don't get it. Why have an opclass here at all, even optionally? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
03.12.2015 04:03, Robert Haas пишет: > On Tue, Dec 1, 2015 at 7:53 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> If we don't need c4 as an index scankey, we don't need any btree opclass on >> it. >> But we still want to have it in covering index for queries like >> >> SELECT c4 FROM tbl WHERE c1=1000; // uses the IndexOnlyScan >> SELECT * FROM tbl WHERE c1=1000; // uses the IndexOnlyScan >> >> The patch "optional_opclass" completely ignores opclasses of included >> attributes. > OK, I don't get it. Why have an opclass here at all, even optionally? We haven't opclass on c4 and there's no need to have it. But now (without a patch) it's impossible to create covering index, which contains columns with no opclass for btree. test=# create index on tbl using btree (c1, c4); ERROR: data type box has no default operator class for access method "btree" ComputeIndexAttrs() processes the list of index attributes and trying to get an opclass for each of them via GetIndexOpClass(). The patch drops this check for included attributes. So it makes possible to store any datatype in btree and use IndexOnlyScan advantages. I hope that this helps to clarify. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Dec 1, 2015 at 4:53 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Finally, completed patch "covering_unique_3.0.patch" is here. > It includes the functionality discussed above in the thread, regression > tests and docs update. > I think it's quite ready for review. Thanks for the patch. I get a compiler warning when building it on gcc (SUSE Linux) 4.8.1 20130909 [gcc-4_8-branch revision 202388]: nbtinsert.c: In function '_bt_check_unique': nbtinsert.c:256:2: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement] SnapshotData SnapshotDirty; ^ And the dblink contrib module fails its make check. I'm trying to find a good test case for it. Unfortunately in most of my natural use cases, the inclusion of the extra column causes the updates to become non-HOT, which causes more problems than it solves. Cheers, Jeff
On Sat, Dec 26, 2015 at 5:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > > And the dblink contrib module fails its make check. Ignore the dblink complaint. It seems to have been some wonky build issue that is not reproducible.
Finally, completed patch "covering_unique_3.0.patch" is here.
It includes the functionality discussed above in the thread, regression tests and docs update.
I think it's quite ready for review.
"An optional <literal>INCLUDING</> clause allows a list of columns to be specified which will be included in the index, in the non-key portion of the index. Columns which are part of this clause cannot also exist in the indexed columns portion of the index, and vice versa. The <literal>INCLUDING</> columns exist solely to allow more queries to benefit from <firstterm>index only scans</> by including certain columns in the index, the value of which would otherwise have to be obtained by reading the table's heap. Having these columns in the <literal>INCLUDING</> clause in some cases allows <productname>PostgreSQL</> to skip the heap read completely. This also allows <literal>UNIQUE</> indexes to be defined on one set of columns, which can include another set of column in the <literal>INCLUDING</> clause, on which the uniqueness is not enforced upon. This can also be useful for non-unique indexes as any columns which are not required for the searching or ordering of records can defined in the <literal>INCLUDING</> clause, which can often reduce the size of the index."
Maybe not perfect, but maybe it's an improvement?
I've not tested the patch yet. I will send another email soon with the results of that.
On Tue, Jan 5, 2016 at 11:55 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 4 January 2016 at 21:49, David Rowley <david.rowley@2ndquadrant.com> > wrote: >> >> I've not tested the patch yet. I will send another email soon with the >> results of that. > > > Hi, > > As promised I've done some testing on this, and I've found something which > is not quite right: > > create table ab (a int,b int); > insert into ab select x,y from generate_series(1,20) x(x), > generate_series(10,1,-1) y(y); > create index on ab (a) including (b); > explain select * from ab order by a,b; > QUERY PLAN > ---------------------------------------------------------- > Sort (cost=10.64..11.14 rows=200 width=8) > Sort Key: a, b > -> Seq Scan on ab (cost=0.00..3.00 rows=200 width=8) > (3 rows) If you set enable_sort=off, then you get the index-only scan with no sort. So it believes the index can be used for ordering (correctly, I think), just sometimes it thinks it is not faster to do it that way. I'm not sure why this would be a correctness problem. The covered column does not participate in uniqueness checks, but it still usually participates in index ordering. (That is why dummy op-classes are needed if you want to include non-sortable-type columns as being covered.) > > This is what I'd expect > > truncate table ab; > insert into ab select x,y from generate_series(1,20) x(x), > generate_series(10,1,-1) y(y); > explain select * from ab order by a,b; > QUERY PLAN > ------------------------------------------------------------------------------ > Index Only Scan using ab_a_b_idx on ab (cost=0.15..66.87 rows=2260 > width=8) > (1 row) > > This index, as we've defined it should not be able to satisfy the query's > order by, although it does give correct results, that's because the index > seems to be built wrongly in cases where the rows are added after the index > exists. I think this just causes differences in planner statistics leading to different plans. ANALYZE the table and it goes back to doing the sort, because it thinks the sort is faster. Cheers, Jeff
On Tue, Jan 5, 2016 at 11:55 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> create table ab (a int,b int);
> insert into ab select x,y from generate_series(1,20) x(x),
> generate_series(10,1,-1) y(y);
> create index on ab (a) including (b);
> explain select * from ab order by a,b;
> QUERY PLAN
> ----------------------------------------------------------
> Sort (cost=10.64..11.14 rows=200 width=8)
> Sort Key: a, b
> -> Seq Scan on ab (cost=0.00..3.00 rows=200 width=8)
> (3 rows)
If you set enable_sort=off, then you get the index-only scan with no
sort. So it believes the index can be used for ordering (correctly, I
think), just sometimes it thinks it is not faster to do it that way.
I'm not sure why this would be a correctness problem. The covered
column does not participate in uniqueness checks, but it still usually
participates in index ordering. (That is why dummy op-classes are
needed if you want to include non-sortable-type columns as being
covered.)
On 2 December 2015 at 01:53, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:Finally, completed patch "covering_unique_3.0.patch" is here.
It includes the functionality discussed above in the thread, regression tests and docs update.
I think it's quite ready for review.Hi Anastasia,I've maybe mentioned before that I think this is a great feature and I think it will be very useful to have, so I've signed up to review the patch, and below is the results of my first pass from reading the code. Apologies if some of the things seem like nitpicks, I've basically just listed everything I've noticed during, no matter how small.
First of all, I would like to thank you for writing such a detailed review.
All mentioned style problems, comments and typos are fixed in the patch v4.0.
agree+ An access method that supports this feature sets <structname>pg_am</>.<structfield>amcanincluding</> true.I don't think this belongs under the "Index Uniqueness Checks" title. I think the "Columns included with clause INCLUDING aren't used to enforce uniqueness." that you've added before it is a good idea, but perhaps the details of amcanincluding are best explained elsewhere.
Yes, this explanation is much better. I've just added couple of notes.+ This clause specifies additional columns to be appended to the set of index columns.+ Included columns don't support any constraints <literal>(UNIQUE, PRMARY KEY, EXCLUSION CONSTRAINT)</>.+ These columns can improve the performance of some queries through using advantages of index-only scan+ (Or so called <firstterm>covering</firstterm> indexes. Covering index is the index that+ covers all columns required in the query and prevents a table access).+ Besides that, included attributes are not stored in index inner pages.+ It allows to decrease index size and furthermore it provides a way to extend included+ columns to store atttributes without suitable opclass (not implemented yet).+ This clause could be applied to both unique and nonunique indexes.+ It's possible to have non-unique covering index, which behaves as a regular+ multi-column index with a bit smaller index-size.+ Currently, only the B-tree access method supports this feature."PRMARY KEY" should be "PRIMARY KEY". I ended up rewriting this paragraph as follows."An optional <literal>INCLUDING</> clause allows a list of columns to be specified which will be included in the index, in the non-key portion of the index. Columns which are part of this clause cannot also exist in the indexed columns portion of the index, and vice versa. The <literal>INCLUDING</> columns exist solely to allow more queries to benefit from <firstterm>index only scans</> by including certain columns in the index, the value of which would otherwise have to be obtained by reading the table's heap. Having these columns in the <literal>INCLUDING</> clause in some cases allows <productname>PostgreSQL</> to skip the heap read completely. This also allows <literal>UNIQUE</> indexes to be defined on one set of columns, which can include another set of column in the <literal>INCLUDING</> clause, on which the uniqueness is not enforced upon. This can also be useful for non-unique indexes as any columns which are not required for the searching or ordering of records can defined in the <literal>INCLUDING</> clause, which can often reduce the size of the index."
Maybe not perfect, but maybe it's an improvement?
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 7 January 2016 at 06:36, Jeff Janes <jeff.janes@gmail.com> wrote:On Tue, Jan 5, 2016 at 11:55 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> create table ab (a int,b int);
> insert into ab select x,y from generate_series(1,20) x(x),
> generate_series(10,1,-1) y(y);
> create index on ab (a) including (b);
> explain select * from ab order by a,b;
> QUERY PLAN
> ----------------------------------------------------------
> Sort (cost=10.64..11.14 rows=200 width=8)
> Sort Key: a, b
> -> Seq Scan on ab (cost=0.00..3.00 rows=200 width=8)
> (3 rows)
If you set enable_sort=off, then you get the index-only scan with no
sort. So it believes the index can be used for ordering (correctly, I
think), just sometimes it thinks it is not faster to do it that way.
I'm not sure why this would be a correctness problem. The covered
column does not participate in uniqueness checks, but it still usually
participates in index ordering. (That is why dummy op-classes are
needed if you want to include non-sortable-type columns as being
covered.)If that's the case, then it appears that I've misunderstood INCLUDING. From reading _bt_doinsert() it appeared that it'll ignore the INCLUDING columns and just find the insert position based on the key columns. Yet that's not the way that it appears to work. I was also a bit confused, as from working with another database which has very similar syntax to this, that one only includes the columns to allow index only scans, and the included columns are not indexed, therefore can't be part of index quals and the index only provides a sorted path for the indexed columns, and not the included columns.
Thank you for properly testing. Order by clause in this case definitely doesn't work as expected.
The problem is fixed by patching a planner function "build_index_pathkeys()'. It disables using of index if sorting of included columns is required.
Test example works correctly now - it always performs seq scan and sort.
Saying that, I'm now a bit confused to why the following does not produce 2 indexes which are the same size:create table t1 (a int, b text);insert into t1 select x,md5(random()::text) from generate_series(1,1000000) x(x);create index t1_a_inc_b_idx on t1 (a) including (b);create index t1_a_b_idx on t1 (a,b);select pg_relation_Size('t1_a_b_idx'),pg_relation_size('t1_a_inc_b_idx');pg_relation_size | pg_relation_size------------------+------------------59064320 | 58744832(1 row)
I suppose you've already found that in discussion above. Included columns are stored only in leaf index pages. The difference is the size of attributes 'b' which are situated in inner pages of index "t1_a_b_idx".
Also, if we want INCLUDING() to mean "uniqueness is not enforced on these columns, but they're still in the index", then I don't really think allowing types without a btree opclass is a good idea. It's likely too surprised filled and might not be what the user actually wants. I'd suggest that these non-indexed columns would be better defined by further expanding the syntax, the first (perhaps not very good) thing that comes to mind is:create unique index idx_name on table (unique_col) also index (other,idx,cols) including (leaf,onlycols);Looking up thread, I don't think I was the first to be confused by this.
Included columns are still in the index physically - they are stored in the index relation. But they are not indexed in the true sense of the word. It's impossible to use them for index scan or ordering. At the beginning, I've got an idea that included columns are supposed to be used for combination of unique index on one columns and covering on others. In a very rare instances one could prefer a non-unique index with included columns "t1_a_inc_b_idx" to a regular multicolumn index "t1_a_b_idx". Frankly, I didn't see such use cases at all. Index size reduction is not considerable, while we lose some useful index functionality on included column. I think that it should be mentioned as a note in documentation, but I need help to phrase it clear.
But now I see the reason to create non-unique index with included columns - lack of suitable opclass on column "b".
It's impossible to add it into the index as a key column, but that's not a problem with INCLUDING clause.
Look at example.
create index on tbl (a,b);
ERROR: data type box has no default operator class for access method "btree"
HINT: You must specify an operator class for the index or define a default operator class for the data type.
create index on tbl (a) including (b);
CREATE INDEX
This functionality is provided by the attached patch "omit_opclass_4.0", which must be applied over covering_unique_4.0.patch.
I see what you were confused about, I'd had the same question at the very beginning of the discussion of this patch.
Now it seems a bit more clear to me. INCLUDING columns are not used for the searching or ordering of records, so there is no need to check whether they have an opclass. INCLUDING columns perform as expected and it agrees with other database experience. And this patch is completed.
But it isn't perfect definitely... I found test case to explain that. See below.
That's why we need optional_opclass functionality, which will use opclass where possible and omit it in other cases.
This idea have been already described in a message Re: [PROPOSAL] Covering + unique indexes as "partially unique index".
I suggest to separate optional_opclass task to ease syntax discussion and following review. And I'll implement it in the next patch a bit later.
Test case:
1) patch covering_unique_4.0 + test_covering_unique_4.0
If included columns' opclasses are used, new query plan is the same with the old one.
and have nearly the same execution time:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using oldcoveringidx on oldt (cost=0.43..301.72 rows=1 width=8) (actual time=0.021..0.676 rows=6 loops=1)
Index Cond: ((c1 < 10000) AND (c3 < 20))
Heap Fetches: 0
Planning time: 0.101 ms
Execution time: 0.697 ms
(5 rows)
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Only Scan using newidx on newt (cost=0.43..276.51 rows=1 width=8) (actual time=0.020..0.665 rows=6 loops=1)
Index Cond: ((c1 < 10000) AND (c3 < 20))
Heap Fetches: 0
Planning time: 0.082 ms
Execution time: 0.687 ms
(5 rows)
2) patch covering_unique_4.0 + patch omit_opclass_4.0 + test_covering_unique_4.0
Otherwise, new query can not use included column in Index Cond and uses filter instead. It slows down the query significantly.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using oldcoveringidx on oldt (cost=0.43..230.39 rows=1 width=8) (actual time=0.021..0.722 rows=6 loops=1)
Index Cond: ((c1 < 10000) AND (c3 < 20))
Heap Fetches: 0
Planning time: 0.091 ms
Execution time: 0.744 ms
(5 rows)
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Only Scan using newidx on newt (cost=0.43..374.68 rows=1 width=8) (actual time=0.018..2.595 rows=6 loops=1)
Index Cond: (c1 < 10000)
Filter: (c3 < 20)
Rows Removed by Filter: 9993
Heap Fetches: 0
Planning time: 0.078 ms
Execution time: 2.612 ms
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Jan 12, 2016 at 8:59 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 08.01.2016 00:12, David Rowley: > > On 7 January 2016 at 06:36, Jeff Janes <jeff.janes@gmail.com> wrote: >> > But now I see the reason to create non-unique index with included columns - > lack of suitable opclass on column "b". > It's impossible to add it into the index as a key column, but that's not a > problem with INCLUDING clause. > Look at example. > > create table t1 (a int, b box); > create index t1_a_inc_b_idx on t1 (a) including (b); > create index on tbl (a,b); > ERROR: data type box has no default operator class for access method > "btree" > HINT: You must specify an operator class for the index or define a default > operator class for the data type. > create index on tbl (a) including (b); > CREATE INDEX > > This functionality is provided by the attached patch "omit_opclass_4.0", > which must be applied over covering_unique_4.0.patch. Thanks for the updates. Why is omit_opclass a separate patch? If the included columns now never participate in the index ordering, shouldn't it be an inherent property of the main patch that you can "cover" things without btree opclasses? Are you keeping them separate just to make review easier? Or do you think there might be a reason to commit one but not the other? I think that if we decide not to use the omit_opclass patch, then we should also not allow covering columns to be specified on non-unique indexes. It looks like the "covering" patch, with or without the "omit_opclass" patch, does not support expressions as included columns: create table foobar (x text, y xml); create index on foobar (x) including (md5(x)); ERROR: unrecognized node type: 904 create index on foobar (x) including ((y::text)); ERROR: unrecognized node type: 911 I think we would probably want it to work with those (or at least to throw a better error message). Thanks, Jeff
08.01.2016 00:12, David Rowley:On 7 January 2016 at 06:36, Jeff Janes <jeff.janes@gmail.com> wrote:On Tue, Jan 5, 2016 at 11:55 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> create table ab (a int,b int);
> insert into ab select x,y from generate_series(1,20) x(x),
> generate_series(10,1,-1) y(y);
> create index on ab (a) including (b);
> explain select * from ab order by a,b;
> QUERY PLAN
> ----------------------------------------------------------
> Sort (cost=10.64..11.14 rows=200 width=8)
> Sort Key: a, b
> -> Seq Scan on ab (cost=0.00..3.00 rows=200 width=8)
> (3 rows)
If you set enable_sort=off, then you get the index-only scan with no
sort. So it believes the index can be used for ordering (correctly, I
think), just sometimes it thinks it is not faster to do it that way.
I'm not sure why this would be a correctness problem. The covered
column does not participate in uniqueness checks, but it still usually
participates in index ordering. (That is why dummy op-classes are
needed if you want to include non-sortable-type columns as being
covered.)If that's the case, then it appears that I've misunderstood INCLUDING. From reading _bt_doinsert() it appeared that it'll ignore the INCLUDING columns and just find the insert position based on the key columns. Yet that's not the way that it appears to work. I was also a bit confused, as from working with another database which has very similar syntax to this, that one only includes the columns to allow index only scans, and the included columns are not indexed, therefore can't be part of index quals and the index only provides a sorted path for the indexed columns, and not the included columns.
Thank you for properly testing. Order by clause in this case definitely doesn't work as expected.
The problem is fixed by patching a planner function "build_index_pathkeys()'. It disables using of index if sorting of included columns is required.
Test example works correctly now - it always performs seq scan and sort.
Saying that, I'm now a bit confused to why the following does not produce 2 indexes which are the same size:create table t1 (a int, b text);insert into t1 select x,md5(random()::text) from generate_series(1,1000000) x(x);create index t1_a_inc_b_idx on t1 (a) including (b);create index t1_a_b_idx on t1 (a,b);select pg_relation_Size('t1_a_b_idx'),pg_relation_size('t1_a_inc_b_idx');pg_relation_size | pg_relation_size------------------+------------------59064320 | 58744832(1 row)
I suppose you've already found that in discussion above. Included columns are stored only in leaf index pages. The difference is the size of attributes 'b' which are situated in inner pages of index "t1_a_b_idx".
Why is omit_opclass a separate patch? If the included columns now
never participate in the index ordering, shouldn't it be an inherent
property of the main patch that you can "cover" things without btree
opclasses?
On 13 January 2016 at 06:47, Jeff Janes <jeff.janes@gmail.com> wrote:
Why is omit_opclass a separate patch? If the included columns now
never participate in the index ordering, shouldn't it be an inherent
property of the main patch that you can "cover" things without btree
opclasses?I don't personally think the covering_unique_4.0.patch is that close to being too big to review, I think things would make more sense of the omit_opclass_4.0.patch was included together with this.
I agree that these patches should be merged. It'll be fixed it the next updates.
I kept them separate only for historical reasons, it was more convenient for me to debug them. Furthermore, I wanted to show some performance degradation caused by "omit_opclass" and give a way to reproduce it performing test with and whithot the patch.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
13.01.2016 04:27, David Rowley: > I've also done some testing: > > create table ab (a int, b int); > insert into ab select a,b from generate_Series(1,10) a(a), > generate_series(1,10000) b(b); > set enable_bitmapscan=off; > set enable_indexscan=off; > > select * from ab where a = 1 and b=1; > a | b > ---+--- > 1 | 1 > (1 row) > > set enable_indexscan = on; > select * from ab where a = 1 and b=1; > a | b > ---+--- > (0 rows) > > This is broken. I've not looked into why yet, but from looking at the > EXPLAIN output I was a bit surprised to see b=1 as an index condition. > I'd have expected a Filter maybe, but I've not looked at the EXPLAIN > code to see how those are determined yet. Hmm... Do you use both patches? And could you provide index definition, I can't reproduce the problem assuming that index is created by the statement CREATE INDEX idx ON ab (a) INCLUDING (b); -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
13.01.2016 04:27, David Rowley:I've also done some testing:
create table ab (a int, b int);
insert into ab select a,b from generate_Series(1,10) a(a), generate_series(1,10000) b(b);
set enable_bitmapscan=off;
set enable_indexscan=off;
select * from ab where a = 1 and b=1;
a | b
---+---
1 | 1
(1 row)
set enable_indexscan = on;
select * from ab where a = 1 and b=1;
a | b
---+---
(0 rows)
This is broken. I've not looked into why yet, but from looking at the EXPLAIN output I was a bit surprised to see b=1 as an index condition. I'd have expected a Filter maybe, but I've not looked at the EXPLAIN code to see how those are determined yet.
Hmm... Do you use both patches?
And could you provide index definition, I can't reproduce the problem assuming that index is created by the statement
CREATE INDEX idx ON ab (a) INCLUDING (b);
I will try to review the omit_opclass_4.0.patch soon.
Thank you again. All mentioned points are fixed and patches are merged.On 14 January 2016 at 08:24, David Rowley <david.rowley@2ndquadrant.com> wrote:I will try to review the omit_opclass_4.0.patch soon.Hi, as promised, here's my review of the omit_opclass_4.0.patch patch.
I hope it's all right now. Please check comments one more time. I rather doubt that I wrote everything correctly.
Also this makes me think that the name ii_KeyAttrNumbers is now out-of-date, as it containsthe including columns too by the looks of it. Maybe it just needs to drop the "Key" and become"ii_AttrNumbers". It would be interesting to hear what others think of that.I'm also wondering if indexkeys is still a good name for the IndexOptInfo struct member.Including columns are not really keys, but I feel renaming that might cause a fair bit of code churn, so I'd be interested to hear what other's have to say.
I agree that KeyAttrNumbers and indexkeys are a bit confusing names, but I'd like to keep them at least in this patch.
It's may be worth doing "index structures refactoring" as a separate patch.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
12.01.2016 20:47, Jeff Janes: > It looks like the "covering" patch, with or without the "omit_opclass" > patch, does not support expressions as included columns: > > create table foobar (x text, y xml); > create index on foobar (x) including (md5(x)); > ERROR: unrecognized node type: 904 > create index on foobar (x) including ((y::text)); > ERROR: unrecognized node type: 911 > > I think we would probably want it to work with those (or at least to > throw a better error message). Thank you for the notice. I couldn't fix it quickly and added a stub in the latest patch. But I'll try to fix it and add expressions support a bit later. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Jan 19, 2016 at 9:08 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > > > 18.01.2016 01:02, David Rowley пишет: > > On 14 January 2016 at 08:24, David Rowley <david.rowley@2ndquadrant.com> > wrote: >> >> I will try to review the omit_opclass_4.0.patch soon. > > > Hi, as promised, here's my review of the omit_opclass_4.0.patch patch. > > Thank you again. All mentioned points are fixed and patches are merged. > I hope it's all right now. Please check comments one more time. I rather > doubt that I wrote everything correctly. Unfortunately there are several merge conflicts between your patch and this commit: commit 65c5fcd353a859da9e61bfb2b92a99f12937de3b Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Sun Jan 17 19:36:59 2016 -0500 Restructure index access method API to hide most of it at the C level. Can you rebase past that commit? Thanks, Jeff
On 20 January 2016 at 06:08, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > > > > 18.01.2016 01:02, David Rowley пишет: > > On 14 January 2016 at 08:24, David Rowley <david.rowley@2ndquadrant.com> wrote: >> >> I will try to review the omit_opclass_4.0.patch soon. > > > Hi, as promised, here's my review of the omit_opclass_4.0.patch patch. > > Thank you again. All mentioned points are fixed and patches are merged. > I hope it's all right now. Please check comments one more time. I rather doubt that I wrote everything correctly. Thanks for updating. + for the searching or ordering of records can defined in the should be: + for the searching or ordering of records can be defined in the but perhaps "defined" should be "included". The following is still quite wasteful. CopyIndexTuple() does a palloc() and memcpy(), and then you throw that away if rel->rd_index->indnatts != rel->rd_index->indnkeyatts. I think you just need to add an "else" and move the CopyIndexTuple() below the if. item = (IndexTuple) PageGetItem(lpage, itemid); right_item = CopyIndexTuple(item); + if (rel->rd_index->indnatts != rel->rd_index->indnkeyatts) + right_item = index_reform_tuple(rel, right_item, rel->rd_index->indnatts, rel->rd_index->indnkeyatts); Tom also commited http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=65c5fcd353a859da9e61bfb2b92a99f12937de3b So it looks like you'll need to update your pg_am.h changes. Looks like you'll need a new struct member in IndexAmRoutine and just populate that new member in each of the *handler functions listed in pg_am.h -#define Natts_pg_am 30 +#define Natts_pg_am 31 Can the following be changed to - (At present, only b-tree supports it.) + (At present, only b-tree supports it.) Columns included with clause + INCLUDING aren't used to enforce uniqueness. - (At present, only b-tree supports it.) + (At present, only b-tree supports it.) Columns which are present in the<literal>INCLUDING</> clause are not used to enforceuniqueness. > Also this makes me think that the name ii_KeyAttrNumbers is now out-of-date, as it contains > the including columns too by the looks of it. Maybe it just needs to drop the "Key" and become > "ii_AttrNumbers". It would be interesting to hear what others think of that. > > I'm also wondering if indexkeys is still a good name for the IndexOptInfo struct member. > Including columns are not really keys, but I feel renaming that might cause a fair bit of code churn, so I'd be interestedto hear what other's have to say. > > > I agree that KeyAttrNumbers and indexkeys are a bit confusing names, but I'd like to keep them at least in this patch. > It's may be worth doing "index structures refactoring" as a separate patch. I agree. A separate patch sounds like the best course of action, but authoring that can wait until after this is committed (I think). -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
22.01.2016 01:47, David Rowley: > On 20 January 2016 at 06:08, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> >> >> 18.01.2016 01:02, David Rowley пишет: >> >> On 14 January 2016 at 08:24, David Rowley <david.rowley@2ndquadrant.com> wrote: >>> I will try to review the omit_opclass_4.0.patch soon. >> >> Hi, as promised, here's my review of the omit_opclass_4.0.patch patch. >> >> Thank you again. All mentioned points are fixed and patches are merged. >> I hope it's all right now. Please check comments one more time. I rather doubt that I wrote everything correctly. > > Thanks for updating. > > + for the searching or ordering of records can defined in the > > should be: > > + for the searching or ordering of records can be defined in the > > but perhaps "defined" should be "included". > > The following is still quite wasteful. CopyIndexTuple() does a > palloc() and memcpy(), and then you throw that away if > rel->rd_index->indnatts != rel->rd_index->indnkeyatts. I think you > just need to add an "else" and move the CopyIndexTuple() below the if. > > item = (IndexTuple) PageGetItem(lpage, itemid); > right_item = CopyIndexTuple(item); > + if (rel->rd_index->indnatts != rel->rd_index->indnkeyatts) > + right_item = index_reform_tuple(rel, right_item, > rel->rd_index->indnatts, rel->rd_index->indnkeyatts); Fixed. Thank you for reminding me. > Tom also commited > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=65c5fcd353a859da9e61bfb2b92a99f12937de3b > So it looks like you'll need to update your pg_am.h changes. Looks > like you'll need a new struct member in IndexAmRoutine and just > populate that new member in each of the *handler functions listed in > pg_am.h > > -#define Natts_pg_am 30 > +#define Natts_pg_am 31 Done. I hope that my patch is close to the commit too. Thank you again for review. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Fri, Jan 22, 2016 at 7:19 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > > Done. I hope that my patch is close to the commit too. > Thanks for the update. I've run into this problem: create table foobar (x text, w text); create unique index foobar_pkey on foobar (x) including (w); alter table foobar add constraint foobar_pkey primary key using index foobar_pkey; ERROR: index "foobar_pkey" does not have default sorting behavior LINE 1: alter table foobar add constraint foobar_pkey primary key us... ^ DETAIL: Cannot create a primary key or unique constraint using such an index. Time: 1.577 ms If I instead define the table as create table foobar (x int, w xml); Then I can create the index and then the primary key the first time I do this in a session. But then if I drop the table and repeat the process, I get "does not have default sorting behavior" error even for this index that previously succeeded, so I think there is some kind of problem with the backend syscache or catcache. create table foobar (x int, w xml); create unique index foobar_pkey on foobar (x) including (w); alter table foobar add constraint foobar_pkey primary key using index foobar_pkey; drop table foobar ; create table foobar (x int, w xml); create unique index foobar_pkey on foobar (x) including (w); alter table foobar add constraint foobar_pkey primary key using index foobar_pkey; ERROR: index "foobar_pkey" does not have default sorting behavior LINE 1: alter table foobar add constraint foobar_pkey primary key us... ^ DETAIL: Cannot create a primary key or unique constraint using such an index. Cheers, Jeff
On Fri, Jan 22, 2016 at 7:19 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:Done. I hope that my patch is close to the commit too.Thanks for the update. I've run into this problem: create table foobar (x text, w text); create unique index foobar_pkey on foobar (x) including (w); alter table foobar add constraint foobar_pkey primary key using index foobar_pkey; ERROR: index "foobar_pkey" does not have default sorting behavior LINE 1: alter table foobar add constraint foobar_pkey primary key us... ^ DETAIL: Cannot create a primary key or unique constraint using such an index. Time: 1.577 ms If I instead define the table as create table foobar (x int, w xml); Then I can create the index and then the primary key the first time I do this in a session. But then if I drop the table and repeat the process, I get "does not have default sorting behavior" error even for this index that previously succeeded, so I think there is some kind of problem with the backend syscache or catcache. create table foobar (x int, w xml); create unique index foobar_pkey on foobar (x) including (w); alter table foobar add constraint foobar_pkey primary key using index foobar_pkey; drop table foobar ; create table foobar (x int, w xml); create unique index foobar_pkey on foobar (x) including (w); alter table foobar add constraint foobar_pkey primary key using index foobar_pkey; ERROR: index "foobar_pkey" does not have default sorting behavior LINE 1: alter table foobar add constraint foobar_pkey primary key us... ^ DETAIL: Cannot create a primary key or unique constraint using such an index.
Great, I've fixed that. Thank you for the tip about cache.
I've also found and fixed related bug in copying tables with indexes:
create table tbl2 (like tbl including all);
And there's one more tiny fix in get_pkey_attnames in dblink module.
including_columns_3.0 is the latest version of patch.
And changes regarding the previous version are attached in a separate patch. Just to ease the review and debug.
I've changed size of pg_index.indclass array. It contains indnkeyatts elements now.
While pg_index.indkey still contains all attributes. And this query Retrieve primary key columns provides pretty non-obvious result. Is it a normal behavior here or some changes are required? Do you know any similar queries?
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 27 January 2016 at 03:35, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > including_columns_3.0 is the latest version of patch. > And changes regarding the previous version are attached in a separate patch. > Just to ease the review and debug. Hi, I've made another pass over the patch. There's still a couple of things that I think need to be looked at. Do we need the "b (included)" here? The key is (a) = (1). Having irrelevant details might be confusing. postgres=# create table a (a int not null, b int not null); CREATE TABLE postgres=# create unique index on a (a) including(b); CREATE INDEX postgres=# insert into a values(1,1); INSERT 0 1 postgres=# insert into a values(1,1); ERROR: duplicate key value violates unique constraint "a_a_b_idx" DETAIL: Key (a, b (included))=(1, 1) already exists. Extra tabs: /* Truncate nonkey attributes when inserting on nonleaf pages. */ if (rel->rd_index->indnatts != rel->rd_index->indnkeyatts && !P_ISLEAF(lpageop)) { itup = index_reform_tuple(rel, itup, rel->rd_index->indnatts, rel->rd_index->indnkeyatts); } In index_reform_tuple() I find it a bit scary that you change the TupleDesc's number of attributes then set it back again once you're finished reforming the shortened tuple. Maybe it would be better to modify index_form_tuple() to accept a new argument with a number of attributes, then you can just Assert that this number is never higher than the number of attributes in the TupleDesc. I'm also not that keen on index_reform_tuple() in general. I wonder if there's a way we can just keep the Datum/isnull arrays a bit longer, and only form the tuple when needed. I've not looked into this in detail, but it does look like reforming the tuple is not going to be cheap. If we do need to keep this function, I think a better name might be index_trim_tuple() and I don't think you need to pass the original length. It might make sense to Assert() that the trim length is smaller than the tuple size What statement will cause this: numberOfKeyAttributes = list_length(stmt->indexParams); if (numberOfKeyAttributes <= 0) ereport(ERROR, (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), errmsg("must specify at least one key column"))); I seem to just get errors from the parser when trying. Much of this goes over 80 chars: /* * We append any INCLUDING columns onto the indexParams list so that * we have one list with all columns. Later we can determine which of these * are key columns, and which are just part of the INCLUDING list by check the list * position. A list item in a position less than ii_NumIndexKeyAttrs is part of * the key columns, and anything equal to and over is part of the * INCLUDING columns. */ stmt->indexParams = list_concat(stmt->indexParams, stmt->indexIncludingParams); in gistrescan() there is some code: for (attno = 1; attno <= natts; attno++) { TupleDescInitEntry(so->giststate->fetchTupdesc, attno, NULL, scan->indexRelation->rd_opcintype[attno - 1], -1, 0); } Going by RelationInitIndexAccessInfo() rd_opcintype[] is allocated to be sized by the number of key columns, but this loop goes over the number of attribute columns. Perhaps this is not a big problem since GIST does not support INCLUDING columns, but it does seem wrong still. Which brings me to the fact that I've spent a bit of time trying to look for places where you've forgotten to change natts to nkeyatts. I did find this one, but I don't have much confidence that there's not lots more places that have been forgotten. Apart from this one, how confident are you that you've found all the places? I'm getting towards being happy with the code that I see that's been changed, but I'm hesitant to mark as "Ready for committer" due to not being all that comfortable that all the code that needs to be updated has been updated. I'm not quite sure of a good way to find all these places. I wondering if hacking the code so that each btree index which is created with > 1 column puts all but the first column into the INCLUDING columns, then run the regression tests to see if there are any crashes. I'm really not that sure of how else to increase the confidence levels on this. Do you have ideas? -- David Rowley http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
31.01.2016 11:04, David Rowley: > On 27 January 2016 at 03:35, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> including_columns_3.0 is the latest version of patch. >> And changes regarding the previous version are attached in a separate patch. >> Just to ease the review and debug. > Hi, > > I've made another pass over the patch. There's still a couple of > things that I think need to be looked at. Thank you again. I just write here to say that I do not disappear and I do remember about the issue. But I'm very very busy this week. I'll send an updated patch next week as soon as possible. > Do we need the "b (included)" here? The key is (a) = (1). Having > irrelevant details might be confusing. > > postgres=# create table a (a int not null, b int not null); > CREATE TABLE > postgres=# create unique index on a (a) including(b); > CREATE INDEX > postgres=# insert into a values(1,1); > INSERT 0 1 > postgres=# insert into a values(1,1); > ERROR: duplicate key value violates unique constraint "a_a_b_idx" > DETAIL: Key (a, b (included))=(1, 1) already exists. I thought that it could be strange if user inserts two values and then sees only one of them in error message. But now I see that you're right. I'll also look at the same functional in other DBs and fix it. > In index_reform_tuple() I find it a bit scary that you change the > TupleDesc's number of attributes then set it back again once you're > finished reforming the shortened tuple. > Maybe it would be better to modify index_form_tuple() to accept a new > argument with a number of attributes, then you can just Assert that > this number is never higher than the number of attributes in the > TupleDesc. Good point. I agree that this function is a bit strange. I have to set tupdesc->nattrs to support compatibility with index_form_tuple(). I didn't want to add neither a new field to tupledesc nor a new parameter to index_form_tuple(), because they are used widely. > I'm also not that keen on index_reform_tuple() in general. I wonder if > there's a way we can just keep the Datum/isnull arrays a bit longer, > and only form the tuple when needed. I've not looked into this in > detail, but it does look like reforming the tuple is not going to be > cheap. It is used in splits, for example. There is no datum array, we just move tuple key from a child page to a parent page or something like that. And according to INCLUDING algorithm we need to truncate nonkey attributes. > If we do need to keep this function, I think a better name might be > index_trim_tuple() and I don't think you need to pass the original > length. It might make sense to Assert() that the trim length is > smaller than the tuple size As regards the performance, I don't think that it's a big problem here. Do you suggest to do it in a following way memcpy(oldtup, newtup, newtuplength)? I will > in gistrescan() there is some code: > > for (attno = 1; attno <= natts; attno++) > { > TupleDescInitEntry(so->giststate->fetchTupdesc, attno, NULL, > scan->indexRelation->rd_opcintype[attno - 1], > -1, 0); > } > > Going by RelationInitIndexAccessInfo() rd_opcintype[] is allocated to > be sized by the number of key columns, but this loop goes over the > number of attribute columns. > Perhaps this is not a big problem since GIST does not support > INCLUDING columns, but it does seem wrong still. GiST doesn't support INCLUDING clause, so natts and nkeyatts are always equal. I don't see any problem here. And I think that it's an extra work to this patch. Maybe I or someone else would add this feature to other access methods later. > Which brings me to the fact that I've spent a bit of time trying to > look for places where you've forgotten to change natts to nkeyatts. I > did find this one, but I don't have much confidence that there's not > lots more places that have been forgotten. Apart from this one, how > confident are you that you've found all the places? I'm getting > towards being happy with the code that I see that's been changed, but > I'm hesitant to mark as "Ready for committer" due to not being all > that comfortable that all the code that needs to be updated has been > updated. I'm not quite sure of a good way to find all these places. I found all mentions of natts and other related variables with grep, and replaced (or expand) them with nkeyatts where it was necessary. As mentioned before, I didn't change other AMs. I strongly agree that any changes related to btree require thorough inspection, so I'll recheck it again. But I'm almost sure that it's okay. > I wondering if hacking the code so that each btree index which is > created with > 1 column puts all but the first column into the > INCLUDING columns, then run the regression tests to see if there are > any crashes. I'm really not that sure of how else to increase the > confidence levels on this. Do you have ideas? Do I understand correctly that you suggest to replace all multicolumn indexes with (1key column) + included? I don't think it's a good idea. INCLUDING clause brings some disadvantages. For example, included columns must be filtered after the search, while key columns could be used in scan key directly. I already mentioned this in test example: explain analyze select c1, c2 from tbl where c1<10000 and c3<20; If columns' opclasses are used, new query plan uses them in Index Cond: ((c1 < 10000) AND (c3 < 20)) Otherwise, new query can not use included column in Index Cond and uses filter instead: Index Cond: (c1 < 10000) Filter: (c3 < 20) Rows Removed by Filter: 9993 It slows down the query significantly. And besides that, we still want to have multicolumn unique indexes. CREATE UNIQUE INDEX on tbl (a, b, c) INCLUDING (d); -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Anastasia Lubennikova wrote: > I just write here to say that I do not disappear and I do remember about the > issue. > But I'm very very busy this week. I'll send an updated patch next week as > soon as possible. That's great to know, thanks. I moved your patch to the next commitfest. Please do submit a new version before it starts! -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
02.02.2016 15:50, Anastasia Lubennikova: > 31.01.2016 11:04, David Rowley: >> On 27 January 2016 at 03:35, Anastasia Lubennikova >> <a.lubennikova@postgrespro.ru> wrote: >>> including_columns_3.0 is the latest version of patch. >>> And changes regarding the previous version are attached in a >>> separate patch. >>> Just to ease the review and debug. >> Hi, >> >> I've made another pass over the patch. There's still a couple of >> things that I think need to be looked at. > Thank you again. > I just write here to say that I do not disappear and I do remember > about the issue. > But I'm very very busy this week. I'll send an updated patch next week > as soon as possible. > As promised, here's the new version of the patch "including_columns_4.0". I fixed all issues except some points mentioned below. Besides, I did some refactoring: - use macros IndexRelationGetNumberOfAttributes, IndexRelationGetNumberOfKeyAttributes where possible. Use macro RelationGetNumberOfAttributes. Maybe that's a bit unrelated changes, but it'll make development much easier in future. - rename related variables to indnatts, indnkeyatts. >> I'm also not that keen on index_reform_tuple() in general. I wonder if >> there's a way we can just keep the Datum/isnull arrays a bit longer, >> and only form the tuple when needed. I've not looked into this in >> detail, but it does look like reforming the tuple is not going to be >> cheap. > It is used in splits, for example. There is no datum array, we just > move tuple key from a child page to a parent page or something like that. > And according to INCLUDING algorithm we need to truncate nonkey > attributes. >> If we do need to keep this function, I think a better name might be >> index_trim_tuple() and I don't think you need to pass the original >> length. It might make sense to Assert() that the trim length is >> smaller than the tuple size > > As regards the performance, I don't think that it's a big problem here. > Do you suggest to do it in a following way memcpy(oldtup, newtup, > newtuplength)? I've tested it some more, and still didn't find any performance issues. >> in gistrescan() there is some code: >> >> for (attno = 1; attno <= natts; attno++) >> { >> TupleDescInitEntry(so->giststate->fetchTupdesc, attno, NULL, >> scan->indexRelation->rd_opcintype[attno - 1], >> -1, 0); >> } >> >> Going by RelationInitIndexAccessInfo() rd_opcintype[] is allocated to >> be sized by the number of key columns, but this loop goes over the >> number of attribute columns. >> Perhaps this is not a big problem since GIST does not support >> INCLUDING columns, but it does seem wrong still. > > GiST doesn't support INCLUDING clause, so natts and nkeyatts are > always equal. I don't see any problem here. > And I think that it's an extra work to this patch. Maybe I or someone > else would add this feature to other access methods later. Still the same. >> Which brings me to the fact that I've spent a bit of time trying to >> look for places where you've forgotten to change natts to nkeyatts. I >> did find this one, but I don't have much confidence that there's not >> lots more places that have been forgotten. Apart from this one, how >> confident are you that you've found all the places? I'm getting >> towards being happy with the code that I see that's been changed, but >> I'm hesitant to mark as "Ready for committer" due to not being all >> that comfortable that all the code that needs to be updated has been >> updated. I'm not quite sure of a good way to find all these places. > I found all mentions of natts and other related variables with grep, > and replaced (or expand) them with nkeyatts where it was necessary. > As mentioned before, I didn't change other AMs. > I strongly agree that any changes related to btree require thorough > inspection, so I'll recheck it again. But I'm almost sure that it's okay. > I rechecked everything again and fixed couple of omissions. Thank you for being exacting reviewer) I don't know how to ensure that everything is ok, but I have no idea what else I can do. >> I wondering if hacking the code so that each btree index which is >> created with > 1 column puts all but the first column into the >> INCLUDING columns, then run the regression tests to see if there are >> any crashes. I'm really not that sure of how else to increase the >> confidence levels on this. Do you have ideas? > > Do I understand correctly that you suggest to replace all multicolumn > indexes with (1key column) + included? > I don't think it's a good idea. INCLUDING clause brings some > disadvantages. For example, included columns must be filtered after > the search, while key columns could be used in scan key directly. I > already mentioned this in test example: > > explain analyze select c1, c2 from tbl where c1<10000 and c3<20; > If columns' opclasses are used, new query plan uses them in Index > Cond: ((c1 < 10000) AND (c3 < 20)) > Otherwise, new query can not use included column in Index Cond and > uses filter instead: > Index Cond: (c1 < 10000) > Filter: (c3 < 20) > Rows Removed by Filter: 9993 > It slows down the query significantly. > > And besides that, we still want to have multicolumn unique indexes. > CREATE UNIQUE INDEX on tbl (a, b, c) INCLUDING (d); > I started a new thread about related refactoring, because I think that it should be a separate patch. http://www.postgresql.org/message-id/56BB7788.30808@postgrespro.ru -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Thu, Feb 11, 2016 at 8:46 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 02.02.2016 15:50, Anastasia Lubennikova: > > As promised, here's the new version of the patch "including_columns_4.0". > I fixed all issues except some points mentioned below. Thanks for the update patch. I get a compiler warning: genam.c: In function 'BuildIndexValueDescription': genam.c:259: warning: unused variable 'tupdesc' Also, I can't create a primary key INCLUDING columns directly: jjanes=# create table foobar (a int, b int, c int); jjanes=# alter table foobar add constraint foobar_pkey primary key (a,b) including (c); ERROR: syntax error at or near "including" But I can get there using a circuitous route: jjanes=# create unique index on foobar (a,b) including (c); jjanes=# alter table foobar add constraint foobar_pkey primary key using index foobar_a_b_c_idx; The description of the table's index knows to include the including column: jjanes=# \d foobar Table "public.foobar"Column | Type | Modifiers --------+---------+-----------a | integer | not nullb | integer | not nullc | integer | Indexes: "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c) Since the machinery appears to all be in place to have primary keys with INCLUDING columns, it would be nice if the syntax for adding primary keys allowed one to implement them directly. Is this something or future expansion, or could it be added at the same time as the main patch? I think this is something it would be pretty frustrating for the user to be unable to do right from the start. Cheers, Jeff
25.02.2016 21:39, Jeff Janes: >> As promised, here's the new version of the patch "including_columns_4.0". >> I fixed all issues except some points mentioned below. > Thanks for the update patch. I get a compiler warning: > > genam.c: In function 'BuildIndexValueDescription': > genam.c:259: warning: unused variable 'tupdesc' Thank you for the notice, I'll fix it in the next update. > Also, I can't create a primary key INCLUDING columns directly: > > jjanes=# create table foobar (a int, b int, c int); > jjanes=# alter table foobar add constraint foobar_pkey primary key > (a,b) including (c); > ERROR: syntax error at or near "including" > > But I can get there using a circuitous route: > > jjanes=# create unique index on foobar (a,b) including (c); > jjanes=# alter table foobar add constraint foobar_pkey primary key > using index foobar_a_b_c_idx; > > The description of the table's index knows to include the including column: > > jjanes=# \d foobar > Table "public.foobar" > Column | Type | Modifiers > --------+---------+----------- > a | integer | not null > b | integer | not null > c | integer | > Indexes: > "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c) > > > Since the machinery appears to all be in place to have primary keys > with INCLUDING columns, it would be nice if the syntax for adding > primary keys allowed one to implement them directly. > > Is this something or future expansion, or could it be added at the > same time as the main patch? Good point. At quick glance, this looks easy to implement it. The only problem is that there are too many places in code which must be updated. I'll try to do it, and if there would be difficulties, it's fine with me to delay this feature for the future work. I found one more thing to do. Pgdump does not handle included columns now. I will fix it in the next version of the patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
29.02.2016 18:17, Anastasia Lubennikova: > 25.02.2016 21:39, Jeff Janes: >>> As promised, here's the new version of the patch >>> "including_columns_4.0". >>> I fixed all issues except some points mentioned below. >> Thanks for the update patch. I get a compiler warning: >> >> genam.c: In function 'BuildIndexValueDescription': >> genam.c:259: warning: unused variable 'tupdesc' > > Thank you for the notice, I'll fix it in the next update. >> Also, I can't create a primary key INCLUDING columns directly: >> >> jjanes=# create table foobar (a int, b int, c int); >> jjanes=# alter table foobar add constraint foobar_pkey primary key >> (a,b) including (c); >> ERROR: syntax error at or near "including" >> >> But I can get there using a circuitous route: >> >> jjanes=# create unique index on foobar (a,b) including (c); >> jjanes=# alter table foobar add constraint foobar_pkey primary key >> using index foobar_a_b_c_idx; >> >> The description of the table's index knows to include the including >> column: >> >> jjanes=# \d foobar >> Table "public.foobar" >> Column | Type | Modifiers >> --------+---------+----------- >> a | integer | not null >> b | integer | not null >> c | integer | >> Indexes: >> "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c) >> >> >> Since the machinery appears to all be in place to have primary keys >> with INCLUDING columns, it would be nice if the syntax for adding >> primary keys allowed one to implement them directly. >> >> Is this something or future expansion, or could it be added at the >> same time as the main patch? > > Good point. > At quick glance, this looks easy to implement it. The only problem is > that there are too many places in code which must be updated. > I'll try to do it, and if there would be difficulties, it's fine with > me to delay this feature for the future work. > > I found one more thing to do. Pgdump does not handle included columns > now. I will fix it in the next version of the patch. > As promised, fixed patch is in attachments. It allows to perform following statements: create table utbl (a int, b box); alter table utbl add unique (a) including(b); create table ptbl (a int, b box); alter table ptbl add primary key (a) including(b); And now they can be dumped/restored successfully. I used following settings pg_dump --verbose -Fc postgres -f pg.dump pg_restore -d newdb pg.dump It is not the final version, because it breaks pg_dump for previous versions. I need some help from hackers here. pgdump. line 5466 if (fout->remoteVersion >= 90400) What does 'remoteVersion' mean? And what is the right way to change it? Or it changes between releases? I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really so? That is totally new to me. BTW, While we are on the subject, maybe it's worth to replace these magic numbers with some set of macro? P.S. I'll update documentation for ALTER TABLE in the next patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
01.03.2016 19:55, Anastasia Lubennikova: > > 29.02.2016 18:17, Anastasia Lubennikova: >> 25.02.2016 21:39, Jeff Janes: >>>> As promised, here's the new version of the patch >>>> "including_columns_4.0". >>>> I fixed all issues except some points mentioned below. >>> Thanks for the update patch. I get a compiler warning: >>> >>> genam.c: In function 'BuildIndexValueDescription': >>> genam.c:259: warning: unused variable 'tupdesc' >> >> Thank you for the notice, I'll fix it in the next update. >>> Also, I can't create a primary key INCLUDING columns directly: >>> >>> jjanes=# create table foobar (a int, b int, c int); >>> jjanes=# alter table foobar add constraint foobar_pkey primary key >>> (a,b) including (c); >>> ERROR: syntax error at or near "including" >>> >>> But I can get there using a circuitous route: >>> >>> jjanes=# create unique index on foobar (a,b) including (c); >>> jjanes=# alter table foobar add constraint foobar_pkey primary key >>> using index foobar_a_b_c_idx; >>> >>> The description of the table's index knows to include the including >>> column: >>> >>> jjanes=# \d foobar >>> Table "public.foobar" >>> Column | Type | Modifiers >>> --------+---------+----------- >>> a | integer | not null >>> b | integer | not null >>> c | integer | >>> Indexes: >>> "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c) >>> >>> >>> Since the machinery appears to all be in place to have primary keys >>> with INCLUDING columns, it would be nice if the syntax for adding >>> primary keys allowed one to implement them directly. >>> >>> Is this something or future expansion, or could it be added at the >>> same time as the main patch? >> >> Good point. >> At quick glance, this looks easy to implement it. The only problem is >> that there are too many places in code which must be updated. >> I'll try to do it, and if there would be difficulties, it's fine with >> me to delay this feature for the future work. >> >> I found one more thing to do. Pgdump does not handle included columns >> now. I will fix it in the next version of the patch. >> > > As promised, fixed patch is in attachments. It allows to perform > following statements: > > create table utbl (a int, b box); > alter table utbl add unique (a) including(b); > create table ptbl (a int, b box); > alter table ptbl add primary key (a) including(b); > > And now they can be dumped/restored successfully. > I used following settings > pg_dump --verbose -Fc postgres -f pg.dump > pg_restore -d newdb pg.dump > > It is not the final version, because it breaks pg_dump for previous > versions. I need some help from hackers here. > pgdump. line 5466 > if (fout->remoteVersion >= 90400) > > What does 'remoteVersion' mean? And what is the right way to change > it? Or it changes between releases? > I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really > so? That is totally new to me. > BTW, While we are on the subject, maybe it's worth to replace these > magic numbers with some set of macro? > > P.S. I'll update documentation for ALTER TABLE in the next patch. Sorry for missed attachment. Now it's here. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Mar 2, 2016 at 2:10 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 01.03.2016 19:55, Anastasia Lubennikova: >> It is not the final version, because it breaks pg_dump for previous >> versions. I need some help from hackers here. >> pgdump. line 5466 >> if (fout->remoteVersion >= 90400) >> >> What does 'remoteVersion' mean? And what is the right way to change it? Or >> it changes between releases? >> I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really so? >> That is totally new to me. Yes, you got it. That's basically PG_VERSION_NUM as compiled on the server that has been queried, in this case the server from which a dump is taken. If you are changing the system catalog layer, you would need to provide a query at least equivalent to what has been done until now for your patch, the modify pg_dump as follows: if (fout->remoteVersion >= 90600) { query = my_new_query; } else if (fout->remoteVersion >= 90400) { query = the existing 9.4 query } etc. In short you just need to add a new block so as remote servers newer than 9.6 will be able to dump objects correctly. pg_upgrade is a good way to check the validity of pg_dump actually, this explains why some objects are not dropped in the regression tests. Perhaps you'd want to do the same with your patch if the current test coverage of pg_dump is not enough. I have not looked at your patch so I cannot say for sure. -- Michael
02.03.2016 08:50, Michael Paquier: > On Wed, Mar 2, 2016 at 2:10 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> 01.03.2016 19:55, Anastasia Lubennikova: >>> It is not the final version, because it breaks pg_dump for previous >>> versions. I need some help from hackers here. >>> pgdump. line 5466 >>> if (fout->remoteVersion >= 90400) >>> >>> What does 'remoteVersion' mean? And what is the right way to change it? Or >>> it changes between releases? >>> I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really so? >>> That is totally new to me. > Yes, you got it. That's basically PG_VERSION_NUM as compiled on the > server that has been queried, in this case the server from which a > dump is taken. If you are changing the system catalog layer, you would > need to provide a query at least equivalent to what has been done > until now for your patch, the modify pg_dump as follows: > if (fout->remoteVersion >= 90600) > { > query = my_new_query; > } > else if (fout->remoteVersion >= 90400) > { > query = the existing 9.4 query > } > etc. > > In short you just need to add a new block so as remote servers newer > than 9.6 will be able to dump objects correctly. pg_upgrade is a good > way to check the validity of pg_dump actually, this explains why some > objects are not dropped in the regression tests. Perhaps you'd want to > do the same with your patch if the current test coverage of pg_dump is > not enough. I have not looked at your patch so I cannot say for sure. Thank you for the explanation. New version of the patch implements pg_dump well. Documentation related to constraints is updated. I hope, that patch is in a good shape now. Brief overview for reviewers: This patch allows unique indexes to be defined on one set of columns and include another set of column in the INCLUDING clause, on which the uniqueness is not enforced upon. It allows more queries to benefit from using index-only scan. Currently, only the B-tree access method supports this feature. Syntax example: CREATE TABLE tbl (c1 int, c2 int, c3 box); CREATE INDEX idx ON TABLE tbl (c1) INCLUDING (c2, c3); In opposite to key columns (c1), included columns (c2,c3) are not used in index scankeys neither in "search" scankeys nor in "insertion" scankeys. Included columns are stored only in leaf pages and it can help to slightly reduce index size. Hence, included columns do not require any opclass for btree access method. As you can see from example above, it's possible to add into index columns of "box" type. The most common use-case for this feature is combination of UNIQUE or PRIMARY KEY constraint on columns (a,b) and covering index on columns (a,b,c). So, there is a new syntax for constraints. CREATE TABLE tblu (c1 int, c2 int, c3 box, UNIQUE (c1,c2) INCLUDING (c3)); Index, created for this constraint contains three columns. "tblu_c1_c2_c3_key" UNIQUE CONSTRAINT, btree (c1, c2) INCLUDING (c3) CREATE TABLE tblpk (c1 int, c2 int, c3 box, PRIMARY KEY (c1) INCLUDING (c3)); Index, created for this constraint contains two columns. Note that NOT NULL constraint is applied only to key column(s) as well as unique constraint. postgres=# \d tblpk Table "public.tblpk" Column | Type | Modifiers --------+---------+----------- c1 | integer | not null c2 | integer | c3 | box | Indexes: "tblpk_pkey" PRIMARY KEY, btree (c1) INCLUDING (c3) Same for ALTER TABLE statements: CREATE TABLE tblpka (c1 int, c2 int, c3 box); ALTER TABLE tblpka ADD PRIMARY KEY (c1) INCLUDING (c3); pg_dump is updated and seems to work fine with this kind of indexes. I see only one problem left (maybe I've mentioned it before). Queries like this [1] must be rewritten, because after catalog changes, i.indkey contains both key and included attrs. One more thing to do is some refactoring of names, since "indkey" looks really confusing to me. But it could be done as a separate patch [2]. [1] https://wiki.postgresql.org/wiki/Retrieve_primary_key_columns [2] http://www.postgresql.org/message-id/56BB7788.30808@postgrespro.ru -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 3/14/16 9:57 AM, Anastasia Lubennikova wrote: > New version of the patch implements pg_dump well. > Documentation related to constraints is updated. > > I hope, that patch is in a good shape now. It looks like this patch should be marked "needs review" and I have done so. -- -David david@pgmasters.net
On Fri, Mar 18, 2016 at 5:15 AM, David Steele <david@pgmasters.net> wrote: > It looks like this patch should be marked "needs review" and I have done so. Uh, no it shouldn't. I've posted an extensive review on the original design thread. See CF entry: https://commitfest.postgresql.org/9/433/ Marked "Waiting on Author". -- Peter Geoghegan
19.03.2016 08:00, Peter Geoghegan: > On Fri, Mar 18, 2016 at 5:15 AM, David Steele <david@pgmasters.net> wrote: >> It looks like this patch should be marked "needs review" and I have done so. > Uh, no it shouldn't. I've posted an extensive review on the original > design thread. See CF entry: > > https://commitfest.postgresql.org/9/433/ > > Marked "Waiting on Author". Thanks to David, I've missed these letters at first. I'll answer here. > * You truncate (remove suffix attributes -- the "included" attributes) > within _bt_insertonpg(): > > - right_item = CopyIndexTuple(item); > + indnatts = IndexRelationGetNumberOfAttributes(rel); > + indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); > + > + if (indnatts != indnkeyatts) > + { > + right_item = index_reform_tuple(rel, item, indnatts, indnkeyatts); > + right_item_sz = IndexTupleDSize(*right_item); > + right_item_sz = MAXALIGN(right_item_sz); > + } > + else > + right_item = CopyIndexTuple(item); > ItemPointerSet(&(right_item->t_tid), rbkno, P_HIKEY); > > I suggest that you do this within _bt_insert_parent(), instead, iff > the original target page is know to be a leaf page. That's where it > needs to happen for conventional suffix truncation, which has special > considerations when determining which attributes are safe to truncate > (or even which byte in the first distinguishing attribute it is okay > to truncate past) I agree that _bt_insertonpg() is not right for truncation. Furthermore, I've noticed that all internal keys are solely the copies of "High keys" from the leaf pages. Which is pretty logical. Therefore, if we have already truncated the tuple, when it became a High key, we do not need the same truncation within _bt_insert_parent() or any other function. So the only thing to worry about is the HighKey truncation. I rewrote the code. Now only _bt_split cares about truncation. It's a bit more complicated to add it into index creation algorithm. There's a trick with a "high key". /* * We copy the last item on the page into the new page, and then * rearrange the old page so that the 'last item' becomes its high key * rather than a true data item. There had better be at least two * items on the page already, else the page would be empty of useful * data. */ /* * Move 'last' into the high key position on opage */ To be consistent with other steps of algorithm ( all high keys must be truncated tuples), I had to update this high key on place: delete the old one, and insert truncated high key. The very same logic I use to truncate posting list of a compressed tuple in the "btree_compression" patch. [1] I hope, both patches will be accepted, and then I'll thoroughly merge them . > * I think the comparison logic may have a bug. > > Does this work with amcheck? Maybe it works with bt_index_check(), but > not bt_index_parent_check()? I think that you need to make sure that > _bt_compare() knows about this, too. That's because it isn't good > enough to let a truncated internal IndexTuple compare equal to a > scankey when non-truncated attributes are equal. It is a very important issue. But I don't think it's a bug there. I've read amcheck sources thoroughly and found that the problem appears at "invariant_key_less_than_equal_nontarget_offset() static bool invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state, Page nontarget, ScanKey key, OffsetNumber upperbound) { int16 natts = state->rel->rd_rel->relnatts; int32 cmp; cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound); return cmp <= 0; } It uses scankey, made with _bt_mkscankey() which uses only key attributes, but calls _bt_compare with wrong keysz. If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of natts, all the checks would be passed successfully. Same for invariant_key_greater_than_equal_offset() and invariant_key_less_than_equal_nontarget_offset(). In my view, it's the correct way to fix this problem, because the caller is responsible for passing proper arguments to the function. Of course I will add a check into bt_compare, but I'd rather make it an assertion (see the patch attached). I'll add a flag to distinguish regular and truncated tuples, but it will not be used in this patch. Please, comment, if I've missed something. As you've already mentioned, neither high keys, nor tuples on internal pages are using "itup->t_tid.ip_posid", so I'll take one bit of it. It will definitely require changes in the future works on suffix truncation or something like that, but IMHO for now it's enough. Do you have any objections or comments? [1] https://commitfest.postgresql.org/9/494/ -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
21.03.2016 19:53, Anastasia Lubennikova: > 19.03.2016 08:00, Peter Geoghegan: >> On Fri, Mar 18, 2016 at 5:15 AM, David Steele <david@pgmasters.net> >> wrote: >>> It looks like this patch should be marked "needs review" and I have >>> done so. >> Uh, no it shouldn't. I've posted an extensive review on the original >> design thread. See CF entry: >> >> https://commitfest.postgresql.org/9/433/ >> >> Marked "Waiting on Author". > Thanks to David, > I've missed these letters at first. > I'll answer here. > >> * You truncate (remove suffix attributes -- the "included" attributes) >> within _bt_insertonpg(): >> >> - right_item = CopyIndexTuple(item); >> + indnatts = IndexRelationGetNumberOfAttributes(rel); >> + indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); >> + >> + if (indnatts != indnkeyatts) >> + { >> + right_item = index_reform_tuple(rel, item, indnatts, >> indnkeyatts); >> + right_item_sz = IndexTupleDSize(*right_item); >> + right_item_sz = MAXALIGN(right_item_sz); >> + } >> + else >> + right_item = CopyIndexTuple(item); >> ItemPointerSet(&(right_item->t_tid), rbkno, P_HIKEY); >> >> I suggest that you do this within _bt_insert_parent(), instead, iff >> the original target page is know to be a leaf page. That's where it >> needs to happen for conventional suffix truncation, which has special >> considerations when determining which attributes are safe to truncate >> (or even which byte in the first distinguishing attribute it is okay >> to truncate past) > > I agree that _bt_insertonpg() is not right for truncation. > Furthermore, I've noticed that all internal keys are solely the copies > of "High keys" from the leaf pages. Which is pretty logical. > Therefore, if we have already truncated the tuple, when it became a > High key, we do not need the same truncation within > _bt_insert_parent() or any other function. > So the only thing to worry about is the HighKey truncation. I rewrote > the code. Now only _bt_split cares about truncation. > > It's a bit more complicated to add it into index creation algorithm. > There's a trick with a "high key". > /* > * We copy the last item on the page into the new page, and then > * rearrange the old page so that the 'last item' becomes its > high key > * rather than a true data item. There had better be at least > two > * items on the page already, else the page would be empty of > useful > * data. > */ > /* > * Move 'last' into the high key position on opage > */ > > To be consistent with other steps of algorithm ( all high keys must be > truncated tuples), I had to update this high key on place: > delete the old one, and insert truncated high key. > The very same logic I use to truncate posting list of a compressed > tuple in the "btree_compression" patch. [1] > I hope, both patches will be accepted, and then I'll thoroughly merge > them . > >> * I think the comparison logic may have a bug. >> >> Does this work with amcheck? Maybe it works with bt_index_check(), but >> not bt_index_parent_check()? I think that you need to make sure that >> _bt_compare() knows about this, too. That's because it isn't good >> enough to let a truncated internal IndexTuple compare equal to a >> scankey when non-truncated attributes are equal. > > It is a very important issue. But I don't think it's a bug there. > I've read amcheck sources thoroughly and found that the problem > appears at > "invariant_key_less_than_equal_nontarget_offset() > > > static bool > invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state, > Page nontarget, ScanKey > key, > OffsetNumber upperbound) > { > int16 natts = state->rel->rd_rel->relnatts; > int32 cmp; > > cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound); > > return cmp <= 0; > } > > It uses scankey, made with _bt_mkscankey() which uses only key > attributes, but calls _bt_compare with wrong keysz. > If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of > natts, all the checks would be passed successfully. > > Same for invariant_key_greater_than_equal_offset() and > invariant_key_less_than_equal_nontarget_offset(). > > In my view, it's the correct way to fix this problem, because the > caller is responsible for passing proper arguments to the function. > Of course I will add a check into bt_compare, but I'd rather make it > an assertion (see the patch attached). > > I'll add a flag to distinguish regular and truncated tuples, but it > will not be used in this patch. Please, comment, if I've missed > something. > As you've already mentioned, neither high keys, nor tuples on internal > pages are using "itup->t_tid.ip_posid", so I'll take one bit of it. > > It will definitely require changes in the future works on suffix > truncation or something like that, but IMHO for now it's enough. > > Do you have any objections or comments? > > [1] https://commitfest.postgresql.org/9/494/ > One more version of the patch is attached. I did more testing, and fixed couple of bugs. Now, if any indexed column is deleted from table, we perform cascade deletion of constraint and index. /* * 3.1 Test ALTER TABLE tbl DROP COLUMN c. * Included column deletion leads to the index deletion, * as well as key columns deletion. It's explained in documentation. */ Constraint definition is fixed too. Also, I added separate regression test for INCLUDING clause, that covers both indexes and constraints. I've tested pg_dump, and didn't find any problems. Test script is attached. It seems to me that the patch is completed. Except, maybe, grammar check of comments and documentation. Looking forward to your review. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
> It seems to me that the patch is completed. > Except, maybe, grammar check of comments and documentation. > > Looking forward to your review. Are there any objectins on it? I'm planning to look closely today or tommorrow and commit it. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Mon, Apr 4, 2016 at 7:14 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Are there any objectins on it? I'm planning to look closely today or > tommorrow and commit it. I object to committing the patch in that time frame. I'm looking at it again. -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Mon, Apr 4, 2016 at 7:14 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> Are there any objectins on it? I'm planning to look closely today or >> tommorrow and commit it. > I object to committing the patch in that time frame. I'm looking at it again. Since it's a rather complex patch, pushing it in advance of the reviewers signing off on it doesn't seem like a great idea ... regards, tom lane
On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Thanks to David, > I've missed these letters at first. > I'll answer here. Sorry about using the wrong thread. > I agree that _bt_insertonpg() is not right for truncation. Cool. > It's a bit more complicated to add it into index creation algorithm. > There's a trick with a "high key". > /* > * We copy the last item on the page into the new page, and then > * rearrange the old page so that the 'last item' becomes its high > key > * rather than a true data item. There had better be at least two > * items on the page already, else the page would be empty of useful > * data. > */ > /* > * Move 'last' into the high key position on opage > */ > > To be consistent with other steps of algorithm ( all high keys must be > truncated tuples), I had to update this high key on place: > delete the old one, and insert truncated high key. Hmm. But the high key comparing equal to the Scankey gives insertion the choice of where to put its IndexTuple (it can go on the page with the high key, or its right-sibling, according only to considerations about fillfactor, etc). Is this changed? Does it not matter? Why not? Is it just worth it? The right-most page on every level has no high-key. But you say those pages have an "imaginary" *positive* infinity high key, just as internal pages have (non-imaginary) minus infinity downlinks as their first item/downlink. So tuples in a (say) leaf page are always bound by the downlink lower bound in parent, while their own high key is an upper bound. Either (and, rarely, both) could be (positive or negative) infinity. Maybe you now see why I talked about special _bt_compare() logic for this. I proposed special logic that is similar to the existing minus infinity thing _bt_compare() does (although _bt_binsrch(), an important caller of _bt_compare() also does special things for internal .vs leaf case, so I'm not sure any new special logic must go in _bt_compare()). > It is a very important issue. But I don't think it's a bug there. > I've read amcheck sources thoroughly and found that the problem appears at > "invariant_key_less_than_equal_nontarget_offset() > It uses scankey, made with _bt_mkscankey() which uses only key attributes, > but calls _bt_compare with wrong keysz. > If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of natts, > all the checks would be passed successfully. I probably shouldn't have brought amcheck into that particular discussion. I thought amcheck might be a useful way to frame the discussion, because amcheck always cares about specific invariants, and notes a few special cases. > In my view, it's the correct way to fix this problem, because the caller is > responsible for passing proper arguments to the function. > Of course I will add a check into bt_compare, but I'd rather make it an > assertion (see the patch attached). I see what you mean, but I think we need to decide what to do about the key space when leaf high keys are truncated. I do think that truncating the high key was the right idea, though, and it nicely illustrates that nothing special should happen in upper levels. Suffix truncation should only happen when leaf pages are split, generally speaking. As I said, the high key is very similar to the downlinks, in that both bound the things that go on each page. Together with downlinks they represent *discrete* ranges *unambiguously*, so INCLUDING truncation needs to make it clear which page new items go on. As I said, _bt_binsrch() already takes special actions for internal pages, making sure to return the item that is < the scankey, not <= the scankey which is only allowed for leaf pages. (See README, from "Lehman and Yao assume that the key range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page..."). To give a specific example, I worry about the case where two sibling downlinks in a parent page are distinct, but per specific-to-Postgres "Ki <= v <= Ki+1" thing (which differs from the classic L&Y invariant), some tuples with all right downlink's attributes matching end up in left child page, not right child page. I worry that since _bt_findsplitloc() doesn't consider this (for example), the split point doesn't *reliably* and unambiguously divide the key space between the new halves of a page being split. I think the "Ki <= v <= Ki+1"/_bt_binsrch() thing might save you in common cases where all downlink attributes are distinct, so maybe that simpler case is okay. But to be even more specific, what about the more complicated case where the downlinks *are* fully _bt_compare()-wise equal? This could happen even though they're constrained to be unique in leaf pages, due to bloat. Unique indexes aren't special here; they just make it far less likely that this would happen in practice, because it takes a lot of bloat. Less importantly, when that bloat happens, you don't want to have to do a linear scan through many leaf pages (that should only happen when there are many fully matching IndexTuples at the leaf level -- not just matching on constrained attributes). The more I think about it, the more I doubt that it's okay to not ensure downlinks are always distinct with their siblings, by sometimes including non-constrained (truncatable) attributes within internal pages, as needed to *distinguish* downlinks (also, we must occasionally have *all* attributes including truncatable attributes in internal pages -- we must truncate nothing to keep the key space sane in the parent). Unfortunately, these requirements are very close to the actual full requirements for a full, complete suffix truncation patch, including storing how many attributes are stored in each and every internal IndexTuple (no general thing for the index), page split code to determine where to truncate to make adjacent downlinks distinct, etc. You may think: But that fully-matching-downlink case is okay, because it only makes us do more linear scanning due to the lack of non-truncatable attributes, which is still correct, if a little more slow when there is bloat -- at the leaf level, we'll start at the correct place (the first place the item could be on), per the "Ki <= v <= Ki+1"/_bt_binsrch() thing. I don't think it's correct, though. We need to be able to reliably detect a concurrent page-split. Otherwise, we'll move right within _bt_search() before even considering if anything of interest for our index scan *might* be on the initial page found from downlink (before even calling _bt_binsrch()). Even this bug wouldn't happen in the common case where nextkey = true, but what about when nextkey = false (e.g. for backwards scans)? We'd skip stuff we are not supposed to by spuriously moving right, I think. I have a bad feeling that even then we'd "accidentally fail to fail", because of how backwards scans work at a higher level, but it's just too hard to prove that that is correct. It's just too complicated to rely on so much from a great distance. This might not be the simplest example of where we could run into trouble, but it's one example that I could see. The assumption that downlinks and highkeys discretely separate ranges in the key space is probably made many times. There could be more problematic spots, and it's really hard to know where they might be. :-( In general, it's common for any modification to the B-Tree code to only break in a very subtle way, like this. I would be more comfortable if I knew the patch received extensive stress-testing, probably involving amcheck, lots of bloat, lots of VACUUMing, etc. But generally, I believe we should not allow the key space to fail to be separated fully by downlinks and high keys, even if our original "Ki <= v <= Ki+1" changes to the L&Y algorithm to make duplicates work happens to mask the problems in simple testing. It's too different to what we have today. > I'll add a flag to distinguish regular and truncated tuples, but it will not > be used in this patch. Please, comment, if I've missed something. > As you've already mentioned, neither high keys, nor tuples on internal pages > are using "itup->t_tid.ip_posid", so I'll take one bit of it. > > It will definitely require changes in the future works on suffix truncation > or something like that, but IMHO for now it's enough. I think that we need to discuss whether or not it's okay that we can have that fully-matching-downlink case before we can be sure either way. -- Peter Geoghegan
05.04.2016 01:48, Peter Geoghegan : > On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> It's a bit more complicated to add it into index creation algorithm. >> There's a trick with a "high key". >> /* >> * We copy the last item on the page into the new page, and then >> * rearrange the old page so that the 'last item' becomes its high >> key >> * rather than a true data item. There had better be at least two >> * items on the page already, else the page would be empty of useful >> * data. >> */ >> /* >> * Move 'last' into the high key position on opage >> */ >> >> To be consistent with other steps of algorithm ( all high keys must be >> truncated tuples), I had to update this high key on place: >> delete the old one, and insert truncated high key. > Hmm. But the high key comparing equal to the Scankey gives insertion > the choice of where to put its IndexTuple (it can go on the page with > the high key, or its right-sibling, according only to considerations > about fillfactor, etc). Is this changed? Does it not matter? Why not? > Is it just worth it? I would say, this is changed, but it doesn't matter. Performing any search in btree (including choosing suitable page for insertion), we use only key attributes. We assume that included columns are stored in index unordered. Simple example. create table tbl(id int, data int); create index idx on tbl (id) including (data); Select query does not consider included columns in scan key. It selects all tuples satisfying the condition on key column. And only after that it applies filter to remove wrong rows from the result. If key attribute doesn't satisfy query condition, there are no more tuples to return and we can interrupt scan. You can find more explanations in the attached sql script, that contains queries to recieve detailed information about index structure using pageinspect. > The right-most page on every level has no high-key. But you say those > pages have an "imaginary" *positive* infinity high key, just as > internal pages have (non-imaginary) minus infinity downlinks as their > first item/downlink. So tuples in a (say) leaf page are always bound > by the downlink lower bound in parent, while their own high key is an > upper bound. Either (and, rarely, both) could be (positive or > negative) infinity. > > Maybe you now see why I talked about special _bt_compare() logic for > this. I proposed special logic that is similar to the existing minus > infinity thing _bt_compare() does (although _bt_binsrch(), an > important caller of _bt_compare() also does special things for > internal .vs leaf case, so I'm not sure any new special logic must go > in _bt_compare()). > >> In my view, it's the correct way to fix this problem, because the caller is >> responsible for passing proper arguments to the function. >> Of course I will add a check into bt_compare, but I'd rather make it an >> assertion (see the patch attached). > I see what you mean, but I think we need to decide what to do about > the key space when leaf high keys are truncated. I do think that > truncating the high key was the right idea, though, and it nicely > illustrates that nothing special should happen in upper levels. Suffix > truncation should only happen when leaf pages are split, generally > speaking. > As I said, the high key is very similar to the downlinks, in that both > bound the things that go on each page. Together with downlinks they > represent *discrete* ranges *unambiguously*, so INCLUDING truncation > needs to make it clear which page new items go on. As I said, > _bt_binsrch() already takes special actions for internal pages, making > sure to return the item that is < the scankey, not <= the scankey > which is only allowed for leaf pages. (See README, from "Lehman and > Yao assume that the key range for a subtree S is described by Ki < v > <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent > page..."). > > To give a specific example, I worry about the case where two sibling > downlinks in a parent page are distinct, but per specific-to-Postgres > "Ki <= v <= Ki+1" thing (which differs from the classic L&Y > invariant), some tuples with all right downlink's attributes matching > end up in left child page, not right child page. I worry that since > _bt_findsplitloc() doesn't consider this (for example), the split > point doesn't *reliably* and unambiguously divide the key space > between the new halves of a page being split. I think the "Ki <= v <= > Ki+1"/_bt_binsrch() thing might save you in common cases where all > downlink attributes are distinct, so maybe that simpler case is okay. > But to be even more specific, what about the more complicated case > where the downlinks *are* fully _bt_compare()-wise equal? This could > happen even though they're constrained to be unique in leaf pages, due > to bloat. Unique indexes aren't special here; they just make it far > less likely that this would happen in practice, because it takes a lot > of bloat. Less importantly, when that bloat happens, you don't want to > have to do a linear scan through many leaf pages (that should only > happen when there are many fully matching IndexTuples at the leaf > level -- not just matching on constrained attributes). "just matching on constrained attributes" is the core idea of the whole patch. Included columns just provide us possibility to use index-only scan. Nothing more. We assume use case where index-only-scan is faster than index-scan + heap fetch. For example, in queries like "select data from tbl where id = 1;" we have no scan condition on data. Maybe you afraid of long linear scan when we have enormous index bloat even on unique index. It will happen anyway, whether we have index-only scan on covering index or index-scan on unique index + heap fetch. The only difference is that the covering index is faster. At the very beginning of the proposal discussion, I suggested to implement third kind of columns, which are not constrained, but used in scankey. They must have op class to do it, and they are not truncated. But it was decided to abandon this feature. > The more I think about it, the more I doubt that it's okay to not > ensure downlinks are always distinct with their siblings, by sometimes > including non-constrained (truncatable) attributes within internal > pages, as needed to *distinguish* downlinks (also, we must > occasionally have *all* attributes including truncatable attributes in > internal pages -- we must truncate nothing to keep the key space sane > in the parent). Unfortunately, these requirements are very close to > the actual full requirements for a full, complete suffix truncation > patch, including storing how many attributes are stored in each and > every internal IndexTuple (no general thing for the index), page split > code to determine where to truncate to make adjacent downlinks > distinct, etc. > > You may think: But that fully-matching-downlink case is okay, because > it only makes us do more linear scanning due to the lack of > non-truncatable attributes, which is still correct, if a little more > slow when there is bloat -- at the leaf level, we'll start at the > correct place (the first place the item could be on), per the "Ki <= v > <= Ki+1"/_bt_binsrch() thing. I don't think it's correct, though. We > need to be able to reliably detect a concurrent page-split. Otherwise, > we'll move right within _bt_search() before even considering if > anything of interest for our index scan *might* be on the initial page > found from downlink (before even calling _bt_binsrch()). Even this bug > wouldn't happen in the common case where nextkey = true, but what > about when nextkey = false (e.g. for backwards scans)? We'd skip stuff > we are not supposed to by spuriously moving right, I think. I have a > bad feeling that even then we'd "accidentally fail to fail", because > of how backwards scans work at a higher level, but it's just too hard > to prove that that is correct. It's just too complicated to rely on so > much from a great distance. > > This might not be the simplest example of where we could run into > trouble, but it's one example that I could see. The assumption that > downlinks and highkeys discretely separate ranges in the key space is > probably made many times. There could be more problematic spots, and > it's really hard to know where they might be. :-( > > In general, it's common for any modification to the B-Tree code to > only break in a very subtle way, like this. I would be more > comfortable if I knew the patch received extensive stress-testing, > probably involving amcheck, lots of bloat, lots of VACUUMing, etc. But > generally, I believe we should not allow the key space to fail to be > separated fully by downlinks and high keys, even if our original "Ki > <= v <= Ki+1" changes to the L&Y algorithm to make duplicates work > happens to mask the problems in simple testing. It's too different to > what we have today. Frankly, I still do not understand what you're worried about. If high key is greater than the scan key, we definitely cannot find any more tuples, because key attributes are ordered. If high key is equal to the scan key, we will continue searching and read next page. The code is not changed here, it is the same as processing of duplicates spreading over several pages. If you do not trust postgresql btree changes to the L&Y to make duplicates work, I don't know what to say, but it's definitely not related to my patch. Of course I do not mind if someone will do more testing. I did some tests and didn't find anything special. Besides, don't we have special alpha and beta release stages to find tricky bugs? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Apr 5, 2016 at 7:56 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > I would say, this is changed, but it doesn't matter. Actually, I would now say that it hasn't really changed (see below), based on my new understanding. The *choice* to go on one page or the other still exists. > Performing any search in btree (including choosing suitable page for > insertion), we use only key attributes. > We assume that included columns are stored in index unordered. The patch assumes no ordering for the non-indexed columns in the index? While I knew that the patch was primarily motivated by enabling index-only scans, I didn't realize that at all. The patch is much much less like a general suffix truncation patch than I thought. I may have been confused in part by the high key issue that you only recently fixed, but you should have corrected me about suffix truncation earlier. Obviously, this was a significant misunderstanding; we have been "talking at cross purposes" this whole time. There seems to have been significant misunderstanding about this before now: http://www.postgresql.org/message-id/CAKJS1f9W0aB-g7H6yYgNBq7hJsOKF3UwHU7-Q5jobbaTyK9f4g@mail.gmail.com My new understanding: The extra "included" columns are stored in the index, but do not affect its sort order at all. They are no more part of the key than, say, the heap TID that the key points to. They are just "payload". > "just matching on constrained attributes" is the core idea of the whole > patch. Included columns just provide us possibility to use index-only scan. > Nothing more. We assume use case where index-only-scan is faster than > index-scan + heap fetch. For example, in queries like "select data from tbl > where id = 1;" we have no scan condition on data. Maybe you afraid of long > linear scan when we have enormous index bloat even on unique index. It will > happen anyway, whether we have index-only scan on covering index or > index-scan on unique index + heap fetch. The only difference is that the > covering index is faster. My concern about performance when that happens is very much secondary. I really only mentioned it to help explain my primary concern. > At the very beginning of the proposal discussion, I suggested to implement > third kind of columns, which are not constrained, but used in scankey. > They must have op class to do it, and they are not truncated. But it was > decided to abandon this feature. I must have missed that. Obviously, I wasn't paying enough attention to earlier discussion. Earlier versions of the patch did fail to recognize that the sort order was not the entire indexed order, but that isn't the case with V8. That that was ever possible was only a bug, it turns out. >> The more I think about it, the more I doubt that it's okay to not >> ensure downlinks are always distinct with their siblings, by sometimes >> including non-constrained (truncatable) attributes within internal >> pages, as needed to *distinguish* downlinks (also, we must >> occasionally have *all* attributes including truncatable attributes in >> internal pages -- we must truncate nothing to keep the key space sane >> in the parent). > Frankly, I still do not understand what you're worried about. > If high key is greater than the scan key, we definitely cannot find any more > tuples, because key attributes are ordered. > If high key is equal to the scan key, we will continue searching and read > next page. I thought, because of the emphasis on unique indexes, that this patch was mostly to offer a way of getting an index with uniqueness only enforced on certain columns, but otherwise just the same as having a non-unique index on those same columns. Plus, some suffix truncation, because point-lookups involving later attributes are unlikely to be useful when this is scoped to just unique indexes (which were emphasized by you), because truncating key columns is not helpful unless bloat is terrible. I now understand that it was quite wrong to link this to suffix truncation at all. The two are really not the same. That does make the patch seem significantly simpler, at least as far as nbtree goes; a tool like amcheck is not likely to detect problems in this patch that a human tester could not catch. That was the kind of problem that I feared. > The code is not changed here, it is the same as processing of duplicates > spreading over several pages. If you do not trust postgresql btree changes > to the L&Y to make duplicates work, I don't know what to say, but it's > definitely not related to my patch. My point about the postgres btree changes to L&Y to make duplicates work is that I think it makes the patch work, but perhaps not absolutely reliably. I don't have any specific misgivings about it on its own. Again, my earlier remarks were based on a misguided understanding of the patch, so it doesn't matter now. Communication is hard. There may be a lesson here for both of us about that. > Of course I do not mind if someone will do more testing. > I did some tests and didn't find anything special. Besides, don't we have > special alpha and beta release stages to find tricky bugs? Our history of committing performance improvements to the B-Tree code is limited, particularly in the last 5 years. That's definitely a problem, and one that I have tried to make smaller, but it is the reality. BTW, I can see why you used index_reform_tuple(), rather than trying to modify an existing tuple in place. NULL bitmaps have a storage overhead in IndexTuples (presumably an alternative approach would make truncated IndexTuples have NULL attributes to represent truncation), whereas the cost of index_reform_tuple() only has to be paid when there is a leaf page split. It's important that truncation is 100% guaranteed to produce a tuple smaller than the inserted tuple, otherwise the user could get a non-recoverable "1/3 of page size exceeded" when they were not the one to insert the big IndexTuple. I should try to see if this could be possible due to some index_reform_tuple() edge-case. -- Peter Geoghegan
On Tue, Apr 5, 2016 at 1:31 PM, Peter Geoghegan <pg@heroku.com> wrote: > My new understanding: The extra "included" columns are stored in the > index, but do not affect its sort order at all. They are no more part > of the key than, say, the heap TID that the key points to. They are > just "payload". Noticed a few issues following another pass: * tuplesort.c should handle the CLUSTER case in the same way as the btree case. No? * Why have a RelationGetNumberOfAttributes(indexRel) call in tuplesort_begin_index_btree() at all now? * This critical section is unnecessary, because this happens during index builds: + if (indnkeyatts != indnatts && P_ISLEAF(opageop)) + { + /* + * It's essential to truncate High key here. + * The purpose is not just to save more space on this particular page, + * but to keep whole b-tree structure consistent. Subsequent insertions + * assume that hikey is already truncated, and so they should not + * worry about it, when copying the high key into the parent page + * as a downlink. + * NOTE It is not crutial for reliability in present, + * but maybe it will be that in the future. + * NOTE this code will be changed by the "btree compression" patch, + * which is in progress now. + */ + keytup = index_reform_tuple(wstate->index, oitup, + indnatts, indnkeyatts); + + /* delete "wrong" high key, insert keytup as P_HIKEY. */ + START_CRIT_SECTION(); + PageIndexTupleDelete(opage, P_HIKEY); + + if (!_bt_pgaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY)) + elog(ERROR, "failed to rewrite compressed item in index \"%s\"", + RelationGetRelationName(wstate->index)); + END_CRIT_SECTION(); + } Note that START_CRIT_SECTION() promotes any ERROR to PANIC, which isn't useful here, because we have no buffer lock held, and nothing must be WAL-logged. * Think you forgot to update spghandler(). (You did not add a test for just that one AM, either) * I wonder why this restriction needs to exist: + else + elog(ERROR, "Expressions are not supported in included columns."); What does not supporting it buy us? Was it just that the pg_index representation is more complicated, and you wanted to put it off? An error like this should use ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED ..., btw. * I would like to see index_reform_tuple() assert that the new, truncated index tuple is definitely <= the original (I worry about the 1/3 page restriction issue). Maybe you should also change the name of index_reform_tuple(), per David. * There is some stray whitespace within RelationGetIndexAttrBitmap(). I think you should have updated it with code, though. I don't think it's necessary for HOT updates to work, but I think it could be necessary so that we don't need to get a row lock that blocks non-conflict foreign key locking (see heap_update() callers). I think you need to be careful for non-key columns within the loop in RelationGetIndexAttrBitmap(), basically, because it seems to still go through all columns. UPSERT also must call this code, FWIW. * I think that a similar omission is also made for the replica identity stuff in RelationGetIndexAttrBitmap(). Some thought is needed on how this patch interacts with logical decoding, I guess. * Valgrind shows an error with an aggregate statement I tried: 2016-04-05 17:01:31.129 PDT 12310 LOG: statement: explain analyze select count(*) from ab where b > 5 group by a, b; ==12310== Invalid read of size 4 ==12310== at 0x656615: match_clause_to_indexcol (indxpath.c:2226) ==12310== by 0x656615: match_clause_to_index (indxpath.c:2144) ==12310== by 0x656DBC: match_clauses_to_index (indxpath.c:2115) ==12310== by 0x658054: match_restriction_clauses_to_index (indxpath.c:2026) ==12310== by 0x658054: create_index_paths (indxpath.c:269) ==12310== by 0x64D1DB: set_plain_rel_pathlist (allpaths.c:649) ==12310== by 0x64D1DB: set_rel_pathlist (allpaths.c:427) ==12310== by 0x64D93B: set_base_rel_pathlists (allpaths.c:299) ==12310== by 0x64D93B: make_one_rel (allpaths.c:170) ==12310== by 0x66876C: query_planner (planmain.c:246) ==12310== by 0x669FBA: grouping_planner (planner.c:1666) ==12310== by 0x66D0C9: subquery_planner (planner.c:751) ==12310== by 0x66D3DA: standard_planner (planner.c:300) ==12310== by 0x66D714: planner (planner.c:170) ==12310== by 0x6FD692: pg_plan_query (postgres.c:798) ==12310== by 0x59082D: ExplainOneQuery (explain.c:350) ==12310== Address 0xbff290c is 2,508 bytes inside a block of size 8,192 alloc'd ==12310== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==12310== by 0x81B7FA: AllocSetAlloc (aset.c:853) ==12310== by 0x81D257: palloc (mcxt.c:907) ==12310== by 0x4B6F65: RelationGetIndexScan (genam.c:94) ==12310== by 0x4C135D: btbeginscan (nbtree.c:431) ==12310== by 0x4B7A5C: index_beginscan_internal (indexam.c:279) ==12310== by 0x4B7C5A: index_beginscan (indexam.c:222) ==12310== by 0x4B73D1: systable_beginscan (genam.c:379) ==12310== by 0x7E8CF9: ScanPgRelation (relcache.c:341) ==12310== by 0x7EB3C4: RelationBuildDesc (relcache.c:951) ==12310== by 0x7ECD35: RelationIdGetRelation (relcache.c:1800) ==12310== by 0x4A4D37: relation_open (heapam.c:1118) ==12310== { <insert_a_suppression_name_here> Memcheck:Addr4 fun:match_clause_to_indexcol fun:match_clause_to_index fun:match_clauses_to_index fun:match_restriction_clauses_to_index fun:create_index_paths fun:set_plain_rel_pathlist fun:set_rel_pathlist fun:set_base_rel_pathlists fun:make_one_rel fun:query_planner fun:grouping_planner fun:subquery_planner fun:standard_planner fun:planner fun:pg_plan_query fun:ExplainOneQuery } Separately, I tried "make installcheck-tests TESTS=index_including" from Postgres + Valgrind, with Valgrind's --track-origins option enabled (as it was above). I recommend installing Valgrind, and making sure that the patch shows no errors. I didn't actually find a Valgrind issue from just using your regression tests (nor did I find an issue from separately running the regression tests with CLOBBER_CACHE_ALWAYS, FWIW). -- Peter Geoghegan
06.04.2016 03:05, Peter Geoghegan: > On Tue, Apr 5, 2016 at 1:31 PM, Peter Geoghegan<pg@heroku.com> wrote: >> My new understanding: The extra "included" columns are stored in the >> index, but do not affect its sort order at all. They are no more part >> of the key than, say, the heap TID that the key points to. They are >> just "payload". It was really long and complicated discussion. I'm glad that finally we are in agreement about the patch. Anyway, I think all mentioned questions will be very helpful for the future work on b-tree. > Noticed a few issues following another pass: > > * tuplesort.c should handle the CLUSTER case in the same way as the > btree case. No? Yes, I just missed that cluster uses index sort. Fixed. > * Why have a RelationGetNumberOfAttributes(indexRel) call in > tuplesort_begin_index_btree() at all now? Fixed. > * This critical section is unnecessary, because this happens during > index builds: > > + if (indnkeyatts != indnatts && P_ISLEAF(opageop)) > + { > + /* > + * It's essential to truncate High key here. > + * The purpose is not just to save more space > on this particular page, > + * but to keep whole b-tree structure > consistent. Subsequent insertions > + * assume that hikey is already truncated, and > so they should not > + * worry about it, when copying the high key > into the parent page > + * as a downlink. > + * NOTE It is not crutial for reliability in present, > + * but maybe it will be that in the future. > + * NOTE this code will be changed by the > "btree compression" patch, > + * which is in progress now. > + */ > + keytup = index_reform_tuple(wstate->index, oitup, > + > indnatts, indnkeyatts); > + > + /* delete "wrong" high key, insert keytup as > P_HIKEY. */ > + START_CRIT_SECTION(); > + PageIndexTupleDelete(opage, P_HIKEY); > + > + if (!_bt_pgaddtup(opage, > IndexTupleSize(keytup), keytup, P_HIKEY)) > + elog(ERROR, "failed to rewrite > compressed item in index \"%s\"", > + RelationGetRelationName(wstate->index)); > + END_CRIT_SECTION(); > + } > > Note that START_CRIT_SECTION() promotes any ERROR to PANIC, which > isn't useful here, because we have no buffer lock held, and nothing > must be WAL-logged. > > * Think you forgot to update spghandler(). (You did not add a test for > just that one AM, either) Fixed. > * I wonder why this restriction needs to exist: > > + else > + elog(ERROR, "Expressions are not supported in > included columns."); > > What does not supporting it buy us? Was it just that the pg_index > representation is more complicated, and you wanted to put it off? > > An error like this should use ereport(ERROR, > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED ..., btw. Yes, you get it right. It was a bit complicated to implement and I decided to delay it to the next patch. errmsg is fixed. > * I would like to see index_reform_tuple() assert that the new, > truncated index tuple is definitely <= the original (I worry about the > 1/3 page restriction issue). Maybe you should also change the name of > index_reform_tuple(), per David. Is it possible that the new tuple, containing less attributes than the old one, will have a greater size? Maybe you can give an example? I think that Assert(indnkeyatts <= indnatts); covers this kind of errors. I do not mind to rename this function, but what name would be better? index_truncate_tuple()? > * There is some stray whitespace within RelationGetIndexAttrBitmap(). > I think you should have updated it with code, though. I don't think > it's necessary for HOT updates to work, but I think it could be > necessary so that we don't need to get a row lock that blocks > non-conflict foreign key locking (see heap_update() callers). I think > you need to be careful for non-key columns within the loop in > RelationGetIndexAttrBitmap(), basically, because it seems to still go > through all columns. UPSERT also must call this code, FWIW. > > * I think that a similar omission is also made for the replica > identity stuff in RelationGetIndexAttrBitmap(). Some thought is needed > on how this patch interacts with logical decoding, I guess. Good point. Indexes are everywhere in the code. I missed that RelationGetIndexAttrBitmap() is used not only for REINDEX. I'll discuss it with Theodor and send an updated patch tomorrow. > * Valgrind shows an error with an aggregate statement I tried: > > 2016-04-05 17:01:31.129 PDT 12310 LOG: statement: explain analyze > select count(*) from ab where b > 5 group by a, b; > ==12310== Invalid read of size 4 > ==12310== at 0x656615: match_clause_to_indexcol (indxpath.c:2226) > ==12310== by 0x656615: match_clause_to_index (indxpath.c:2144) > ==12310== by 0x656DBC: match_clauses_to_index (indxpath.c:2115) > ==12310== by 0x658054: match_restriction_clauses_to_index (indxpath.c:2026) > ==12310== by 0x658054: create_index_paths (indxpath.c:269) > ==12310== by 0x64D1DB: set_plain_rel_pathlist (allpaths.c:649) > ==12310== by 0x64D1DB: set_rel_pathlist (allpaths.c:427) > ==12310== by 0x64D93B: set_base_rel_pathlists (allpaths.c:299) > ==12310== by 0x64D93B: make_one_rel (allpaths.c:170) > ==12310== by 0x66876C: query_planner (planmain.c:246) > ==12310== by 0x669FBA: grouping_planner (planner.c:1666) > ==12310== by 0x66D0C9: subquery_planner (planner.c:751) > ==12310== by 0x66D3DA: standard_planner (planner.c:300) > ==12310== by 0x66D714: planner (planner.c:170) > ==12310== by 0x6FD692: pg_plan_query (postgres.c:798) > ==12310== by 0x59082D: ExplainOneQuery (explain.c:350) > ==12310== Address 0xbff290c is 2,508 bytes inside a block of size 8,192 alloc'd > ==12310== at 0x4C2AB80: malloc (in > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > ==12310== by 0x81B7FA: AllocSetAlloc (aset.c:853) > ==12310== by 0x81D257: palloc (mcxt.c:907) > ==12310== by 0x4B6F65: RelationGetIndexScan (genam.c:94) > ==12310== by 0x4C135D: btbeginscan (nbtree.c:431) > ==12310== by 0x4B7A5C: index_beginscan_internal (indexam.c:279) > ==12310== by 0x4B7C5A: index_beginscan (indexam.c:222) > ==12310== by 0x4B73D1: systable_beginscan (genam.c:379) > ==12310== by 0x7E8CF9: ScanPgRelation (relcache.c:341) > ==12310== by 0x7EB3C4: RelationBuildDesc (relcache.c:951) > ==12310== by 0x7ECD35: RelationIdGetRelation (relcache.c:1800) > ==12310== by 0x4A4D37: relation_open (heapam.c:1118) > ==12310== > { > <insert_a_suppression_name_here> > Memcheck:Addr4 > fun:match_clause_to_indexcol > fun:match_clause_to_index > fun:match_clauses_to_index > fun:match_restriction_clauses_to_index > fun:create_index_paths > fun:set_plain_rel_pathlist > fun:set_rel_pathlist > fun:set_base_rel_pathlists > fun:make_one_rel > fun:query_planner > fun:grouping_planner > fun:subquery_planner > fun:standard_planner > fun:planner > fun:pg_plan_query > fun:ExplainOneQuery > } > > Separately, I tried "make installcheck-tests TESTS=index_including" > from Postgres + Valgrind, with Valgrind's --track-origins option > enabled (as it was above). I recommend installing Valgrind, and making > sure that the patch shows no errors. I didn't actually find a Valgrind > issue from just using your regression tests (nor did I find an issue > from separately running the regression tests with > CLOBBER_CACHE_ALWAYS, FWIW). > Thank you for advice. Another miss of index->ncolumns to index->nkeycolumns replacement in match_clause_to_index. Fixed. I also updated couple of typos in documentation. Thank you again for the detailed review. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Attachment
06.04.2016 16:15, Anastasia Lubennikova : > 06.04.2016 03:05, Peter Geoghegan: >> * There is some stray whitespace within RelationGetIndexAttrBitmap(). >> I think you should have updated it with code, though. I don't think >> it's necessary for HOT updates to work, but I think it could be >> necessary so that we don't need to get a row lock that blocks >> non-conflict foreign key locking (see heap_update() callers). I think >> you need to be careful for non-key columns within the loop in >> RelationGetIndexAttrBitmap(), basically, because it seems to still go >> through all columns. UPSERT also must call this code, FWIW. >> >> * I think that a similar omission is also made for the replica >> identity stuff in RelationGetIndexAttrBitmap(). Some thought is needed >> on how this patch interacts with logical decoding, I guess. > > Good point. Indexes are everywhere in the code. > I missed that RelationGetIndexAttrBitmap() is used not only for REINDEX. > I'll discuss it with Theodor and send an updated patch tomorrow. As promised, updated patch is in attachments. But, I'm not an expert in this area, so it needs a 'critical look'. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Apr 6, 2016 at 6:15 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: >> * I would like to see index_reform_tuple() assert that the new, >> truncated index tuple is definitely <= the original (I worry about the >> 1/3 page restriction issue). Maybe you should also change the name of >> index_reform_tuple(), per David. > > Is it possible that the new tuple, containing less attributes than the old > one, will have a greater size? > Maybe you can give an example? > I think that Assert(indnkeyatts <= indnatts); covers this kind of errors. I don't think it is possible, because you aren't e.g. making an attribute's value NULL where it wasn't NULL before (making the IndexTuple contain a NULL bitmap where it didn't before). But that's kind of subtle, and it certainly seems worth an assertion. It could change tomorrow, when someone optimizes heap_deform_tuple(), which has been proposed more than once. Personally, I like documenting assertions, and will sometimes write assertions that the compiler could easily optimize away. Maybe going *that* far is more a matter of personal style, but I think an assertion about the new index tuple size being <= the old one is just a good idea. It's not about a problem in your code at all. > I do not mind to rename this function, but what name would be better? > index_truncate_tuple()? That seems better, yes. -- Peter Geoghegan
On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote: > Personally, I like documenting assertions, and will sometimes write > assertions that the compiler could easily optimize away. Maybe going > *that* far is more a matter of personal style, but I think an > assertion about the new index tuple size being <= the old one is just > a good idea. It's not about a problem in your code at all. You should make index_truncate_tuple()/index_reform_tuple() promise to always do this in its comments/contract with caller as part of this, IMV. -- Peter Geoghegan
06.04.2016 23:52, Peter Geoghegan: > On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Personally, I like documenting assertions, and will sometimes write >> assertions that the compiler could easily optimize away. Maybe going >> *that* far is more a matter of personal style, but I think an >> assertion about the new index tuple size being <= the old one is just >> a good idea. It's not about a problem in your code at all. > You should make index_truncate_tuple()/index_reform_tuple() promise to > always do this in its comments/contract with caller as part of this, > IMV. > Mentioned issues are fixed. Patch is attached. I'd like to remind you that the commitfest will be closed very-very soon, so I'd like to get your final resolution about the patch. Not to have it in the 9.6 release will be very disappointing. I agree that b-tree is a crucial subsystem. But it seems to me, that we have lack of improvements in this area not only because of the algorithm's complexity but also because of lack of enthusiasts to work on it and struggle through endless discussions. But it's off-topic here. Attention to these development difficulties will be one of the messages of my pgcon talk. You know, we lost a lot of time discussing various b-tree problems. Besides that, I am sure that the patch is really in a good shape. It hasn't any open problems to fix. And possible subtle bugs can be found at the testing stage of the release. Looking forward to your reply. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
> On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Personally, I like documenting assertions, and will sometimes write >> assertions that the compiler could easily optimize away. Maybe going >> *that* far is more a matter of personal style, but I think an >> assertion about the new index tuple size being <= the old one is just >> a good idea. It's not about a problem in your code at all. > > You should make index_truncate_tuple()/index_reform_tuple() promise to > always do this in its comments/contract with caller as part of this, > IMV. > Some notices: - index_truncate_tuple(Relation idxrel, IndexTuple olditup, int indnatts, int indnkeyatts) Why weneed indnatts/indnkeyatts? They are presented in idxrel struct already - follow code where index_truncate_tuple() is called, it should never called in case where indnatts == indnkeyatts. So,indnkeyatts should be strictly less than indnatts, pls, change assertion. If they are equal the this function becomescomplicated variant of CopyIndexTuple() -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
08.04.2016 15:06, Teodor Sigaev: >> On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote: >>> Personally, I like documenting assertions, and will sometimes write >>> assertions that the compiler could easily optimize away. Maybe going >>> *that* far is more a matter of personal style, but I think an >>> assertion about the new index tuple size being <= the old one is just >>> a good idea. It's not about a problem in your code at all. >> >> You should make index_truncate_tuple()/index_reform_tuple() promise to >> always do this in its comments/contract with caller as part of this, >> IMV. >> > Some notices: > - index_truncate_tuple(Relation idxrel, IndexTuple olditup, int indnatts, > int indnkeyatts) > Why we need indnatts/indnkeyatts? They are presented in idxrel struct > already > - follow code where index_truncate_tuple() is called, it should never > called in > case where indnatts == indnkeyatts. So, indnkeyatts should be > strictly less > than indnatts, pls, change assertion. If they are equal the this > function > becomes complicated variant of CopyIndexTuple() Good point. These attributes seem to be there since previous versions of the function. But now they are definitely unnecessary. Updated patch is attached -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
08.04.2016 15:45, Anastasia Lubennikova: > 08.04.2016 15:06, Teodor Sigaev: >>> On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <pg@heroku.com> wrote: >>>> Personally, I like documenting assertions, and will sometimes write >>>> assertions that the compiler could easily optimize away. Maybe going >>>> *that* far is more a matter of personal style, but I think an >>>> assertion about the new index tuple size being <= the old one is just >>>> a good idea. It's not about a problem in your code at all. >>> >>> You should make index_truncate_tuple()/index_reform_tuple() promise to >>> always do this in its comments/contract with caller as part of this, >>> IMV. >>> >> Some notices: >> - index_truncate_tuple(Relation idxrel, IndexTuple olditup, int >> indnatts, >> int indnkeyatts) >> Why we need indnatts/indnkeyatts? They are presented in idxrel struct >> already >> - follow code where index_truncate_tuple() is called, it should never >> called in >> case where indnatts == indnkeyatts. So, indnkeyatts should be >> strictly less >> than indnatts, pls, change assertion. If they are equal the this >> function >> becomes complicated variant of CopyIndexTuple() > > Good point. These attributes seem to be there since previous versions > of the function. > But now they are definitely unnecessary. Updated patch is attached One more improvement - note about expressions into documentation. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
http://postgresql.nabble.com/pgsql-CREATE-INDEX-INCLUDING-column-td5897653.html
Sooner or later, I'd like to see this patch finished.
For now, it has two complaints:
- support of expressions as included columns.
Frankly, I don't understand, why it's a problem of the patch.
The patch is already big enough and it will be much easier to add expressions support in the following patch, after the first one will be stable.
I wonder, if someone has objections to that?
Yes, it's a kind of delayed feature. But should we wait for every patch when it will be entirely completed?
- lack of review and testing
Obviously I did as much testing as I could.
So, if reviewers have any concerns about the patch, I'm waiting forward to see them.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Apr 12, 2016 at 9:14 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Sooner or later, I'd like to see this patch finished. Me, too. > For now, it has two complaints: > - support of expressions as included columns. > Frankly, I don't understand, why it's a problem of the patch. > The patch is already big enough and it will be much easier to add > expressions support in the following patch, after the first one will be > stable. > I wonder, if someone has objections to that? Probably. If we limit the scope of something, it's always in a way that limits the functionality available to users, rather than limits how generalized the new functionality is, and so cutting scope sometimes isn't possible. There is a very high value placed on features working well together. A user ought to be able to rely on the intuition that features work well together. Preserving that general ability for users to guess correctly what will work based on what they already know is seen as important. For example, notice that the INSERT documentation allows UPSERT unique index inference to optionally accept an opclass or collation. So far, the need for this functionality is totally theoretical (in practice all B-Tree opclasses have the same idea about equality across a given type, and we have no case insensitive collations), but it's still there. Making that work was not a small effort (there was a follow-up bugfix commit just for that, too). This approach is mostly about making the implementation theoretically sound (or demonstrating that it is) by considering edge-cases up-front. Often, there will be benefits to a maximally generalized approach that were not initially anticipated by the patch author, or anyone else. I agree that it is difficult to uphold this standard at all times, but there is something to be said for it. Postgres development must have a very long term outlook, and this approach tends to make things easier for future patch authors by making the code more maintainable. Even if this is the wrong thing in specific cases, it's sometimes easier to just do it than to convince others that their concern is misplaced in this one instance. > Yes, it's a kind of delayed feature. But should we wait for every patch when > it will be entirely completed? I think that knowing where and how to cut scope is an important skill. If this question is asked as a general question, then the answer must be "yes". I suggest asking a more specific question. :-) > - lack of review and testing > Obviously I did as much testing as I could. > So, if reviewers have any concerns about the patch, I'm waiting forward to > see them. For what it's worth, I agree that you put a great deal of effort into this patch, and it did not get in to 9.6 because of a collective failure to focus minds on the patch. Your patch was a credible attempt, which is impressive when you consider that the B-Tree code is so complicated. There is also the fact that there is now a very small list of credible reviewers for B-Tree patches; you must have noticed that not even amcheck was committed, even though I was asked to produce a polished version in February during the FOSDEM dev meeting, and even though it's just a contrib module that is totally orientated around finding bugs and so on. I'm not happy about that either, but that's just something I have to swallow. I fancy myself as am expert on the B-Tree code, but I've never managed to make an impact in improving its performance at all (I've never made a serious effort, but have had many ideas). So, in case it needs to be said, I'll say it: You've chosen a very ambitious set of projects to work on, by any standard. I think it's a good thing that you've been ambitious, and I don't suggest changing that, since I think that you have commensurate skill. But, in order to be successful in these projects, patience and resolve are very important. -- Peter Geoghegan
On 4/27/16 5:08 PM, Peter Geoghegan wrote: > So, in case it needs to be > said, I'll say it: You've chosen a very ambitious set of projects to > work on, by any standard. I think it's a good thing that you've been > ambitious, and I don't suggest changing that, since I think that you > have commensurate skill. But, in order to be successful in these > projects, patience and resolve are very important. +1. This is very exciting work and I look forward to seeing it continue. The patch was perhaps not a good fit for the last CF of 9.6 but that doesn't mean it can't have a bright future. -- -David david@pgmasters.net
On Wed, Apr 27, 2016 at 5:47 PM, David Steele <david@pgmasters.net> wrote: > On 4/27/16 5:08 PM, Peter Geoghegan wrote: >> So, in case it needs to be >> said, I'll say it: You've chosen a very ambitious set of projects to >> work on, by any standard. I think it's a good thing that you've been >> ambitious, and I don't suggest changing that, since I think that you >> have commensurate skill. But, in order to be successful in these >> projects, patience and resolve are very important. > > +1. > > This is very exciting work and I look forward to seeing it continue. > The patch was perhaps not a good fit for the last CF of 9.6 but that > doesn't mean it can't have a bright future. +1. Totally agreed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, failed Spec compliant: tested, passed Documentation: tested, passed Hi hackers! I've read the patch and here is my code review. ==========PURPOSE============ I've used this feature from time to time with MS SQL. From my experience INCLUDE is a 'sugar on top' feature. Some MS SQL classes do not even mention INCLUDE despite it's there from 2005 (though classes do not mention lots of importantthings, so it's not kind of valuable indicator). But those who use it, use it whenever possible. For example, system view with recommended indices rarely list one withoutINCLUDE columns. So, this feature is very important from perspective of converting MS SQL DBAs to PostgreSQL. This is how I see it. ========SUGGESTIONS========== 0. Index build is broken. This script https://github.com/x4m/pggistopt/blob/8ad65d2e305e98c836388a07909af5983dba9c73/test.sqlSEGFAULTs and may cause situationwhen you cannot insert anything into table (I think drop of index would help, but didn't tested this) 1. I think MS SQL syntax INCLUDE instead of INCLUDING would be better (for a purpose listed above) 2. Empty line added in ruleutils.c. Is it for a reason? 3. Now we have indnatts and indnkeyatts instead of indnatts. I think it is worth considering renaming indnatts to somethingdifferent from old name. Someone somewhere could still suppose it's a number of keys. ========PERFORMANCE========== Due to suggestion number 0 I could not measure performance of index build. Index crashes when there's more than 1.1 millionof rows in a table. Performance test script is here https://github.com/x4m/pggistopt/blob/f206b4395baa15a2fa42897eeb27bd555619119a/test.sql Test scenario is following: 1. Create table, then create index, then add data. 2. Make a query touching data in INCLUDING columns. This scenario is tested against table with: A. Table with index, that do not contain touched columns, just PK. B. Index with all columns in index. C. Index with PK in keys and INCLUDING all other columns. Tests were executed 5 times on Ubuntu VM under Hyper-V i5 2500 CPU, 16 Gb of RAM, SSD disk. Time to insert 10M rows: A. AVG 110 seconds STD 4.8 B. AVG 121 seconds STD 2.0 C. AVG 111 seconds STD 5.7 Inserts to INCLUDING index is almost as fast as inserts to index without extra columns. Time to run SELECT query: A. AVG 2864 ms STD 794 B. AVG 2329 ms STD 84 C. AVG 2293 ms STD 58 Selects with INCLUDING columns is almost as fast as with full index. Index size (deterministic measure, STD = 0) A. 317 MB B. 509 MB C. 399 MB Index size is in the middle between full index and minimal index. I think this numbers agree with expectation from the feature. ========CONCLUSION========== This patch brings useful and important feature. Build shall be repaired; other my suggestions are only suggestions. Best regards, Andrey Borodin, Octonica & Ural Federal University. The new status of this patch is: Waiting on Author
14.08.2016 20:11, Andrey Borodin: > The following review has been posted through the commitfest application: > make installcheck-world: tested, passed > Implements feature: tested, failed > Spec compliant: tested, passed > Documentation: tested, passed > > Hi hackers! > > I've read the patch and here is my code review. > > ==========PURPOSE============ > I've used this feature from time to time with MS SQL. From my experience INCLUDE is a 'sugar on top' feature. > Some MS SQL classes do not even mention INCLUDE despite it's there from 2005 (though classes do not mention lots of importantthings, so it's not kind of valuable indicator). > But those who use it, use it whenever possible. For example, system view with recommended indices rarely list one withoutINCLUDE columns. > So, this feature is very important from perspective of converting MS SQL DBAs to PostgreSQL. This is how I see it. Thank you for the review, I hope this feature will be useful for many people. > ========SUGGESTIONS========== > 0. Index build is broken. This script https://github.com/x4m/pggistopt/blob/8ad65d2e305e98c836388a07909af5983dba9c73/test.sqlSEGFAULTs and may cause situationwhen you cannot insert anything into table (I think drop of index would help, but didn't tested this) Thank you for reporting. That was a bug caused by high key truncation, that occurs when index has more than 3 levels. Fixed. See attached file. > 1. I think MS SQL syntax INCLUDE instead of INCLUDING would be better (for a purpose listed above) I've chosen this particular name to avoid using of new keyword. We already have INCLUDING in postgres in a context of inheritance that will never intersect with covering indexes. I'm sure it won't be a big problem of migration from MsSQL. > 2. Empty line added in ruleutils.c. Is it for a reason? No, just a missed line. Fixed. > 3. Now we have indnatts and indnkeyatts instead of indnatts. I think it is worth considering renaming indnatts to somethingdifferent from old name. Someone somewhere could still suppose it's a number of keys. I agree that naming became vague after this patch. I've already suggested to replace "indkeys[]" with more specific name, and AFAIR there was no reaction. So I didn't do that. But I don't sure about your suggestion regarding indnatts. Old queries (and old indexes) can still use it correctly. I don't see a reason to break compatibility for all users. Those who will use this new feature, should ensure that their queries to pg_index behave as expected. > ========PERFORMANCE========== > Due to suggestion number 0 I could not measure performance of index build. Index crashes when there's more than 1.1 millionof rows in a table. > Performance test script is here https://github.com/x4m/pggistopt/blob/f206b4395baa15a2fa42897eeb27bd555619119a/test.sql > Test scenario is following: > 1. Create table, then create index, then add data. > 2. Make a query touching data in INCLUDING columns. > This scenario is tested against table with: > A. Table with index, that do not contain touched columns, just PK. > B. Index with all columns in index. > C. Index with PK in keys and INCLUDING all other columns. > > Tests were executed 5 times on Ubuntu VM under Hyper-V i5 2500 CPU, 16 Gb of RAM, SSD disk. > Time to insert 10M rows: > A. AVG 110 seconds STD 4.8 > B. AVG 121 seconds STD 2.0 > C. AVG 111 seconds STD 5.7 > Inserts to INCLUDING index is almost as fast as inserts to index without extra columns. > > Time to run SELECT query: > A. AVG 2864 ms STD 794 > B. AVG 2329 ms STD 84 > C. AVG 2293 ms STD 58 > Selects with INCLUDING columns is almost as fast as with full index. > > Index size (deterministic measure, STD = 0) > A. 317 MB > B. 509 MB > C. 399 MB > Index size is in the middle between full index and minimal index. > > I think this numbers agree with expectation from the feature. > > ========CONCLUSION========== > This patch brings useful and important feature. Build shall be repaired; other my suggestions are only suggestions. > > > > Best regards, Andrey Borodin, Octonica & Ural Federal University. > > The new status of this patch is: Waiting on Author > -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
> That was a bug caused by high key truncation, that occurs when index has more than 3 levels. Fixed. Affirmative. I've tested index construction with 100M rows and subsequent execution of select queries using index, works fine. Best regards, Andrey Borodin, Octonica & Ural Federal University.
On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: @@ -590,7 +622,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) if (last_off == P_HIKEY) { Assert(state->btps_minkey == NULL); - state->btps_minkey = CopyIndexTuple(itup); + /* + * Truncate the tuple that we're going to insert + * into the parent page as a downlink + */ + if (indnkeyatts != indnatts && P_ISLEAF(pageop)) + state->btps_minkey = index_truncate_tuple(wstate->index, itup); + else + state->btps_minkey = CopyIndexTuple(itup); It seems that above code always ensure that for leaf pages, high key is a truncated tuple. What is less clear is if that is true, why you need to re-ensure it again for the old page in below code: @@ -510,6 +513,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) { .. + if (indnkeyatts != indnatts && P_ISLEAF(opageop)) + { + /* + * It's essential to truncate High key here. + * The purpose is not just to save more space on this particular page, + * but to keep whole b-tree structure consistent. Subsequent insertions + * assume that hikey is already truncated, and so they should not + * worry about it, when copying the high key into the parent page + * as a downlink. + * NOTE It is not crutial for reliability in present, + * but maybe it will be that in the future. + */ + keytup = index_truncate_tuple(wstate->index, oitup); + + /* delete "wrong" high key, insert keytup as P_HIKEY. */ + PageIndexTupleDelete(opage, P_HIKEY); + + if (!_bt_pgaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY)) + elog(ERROR, "failed to rewrite compressed item in index \"%s\"", + RelationGetRelationName(wstate->index)); + } + .. .. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
28.08.2016 09:13, Amit Kapila:<br /><blockquote cite="mid:CAA4eK1+0jLkz8P7MVsUAj+RHmXm4BWBjV8Cc+_-Hsyhgth9ELA@mail.gmail.com"type="cite"><pre wrap="">On Mon, Aug 15, 2016at 8:15 PM, Anastasia Lubennikova <a class="moz-txt-link-rfc2396E" href="mailto:a.lubennikova@postgrespro.ru"><a.lubennikova@postgrespro.ru></a> wrote: @@ -590,7 +622,14 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) if (last_off == P_HIKEY) { Assert(state->btps_minkey == NULL); - state->btps_minkey = CopyIndexTuple(itup); + /* + * Truncate the tuple that we're going to insert + * into the parent page as a downlink + */ + if (indnkeyatts != indnatts && P_ISLEAF(pageop)) + state->btps_minkey = index_truncate_tuple(wstate->index, itup); + else + state->btps_minkey = CopyIndexTuple(itup); It seems that above code always ensure that for leaf pages, high key is a truncated tuple. What is less clear is if that is true, why you need to re-ensure it again for the old page in below code: </pre></blockquote><br /> Thank you for the question. Investigation took a long time)<br /><br /> As far as I understand,the code above only applies to<br /> the first tuple of each level. While the code you have quoted below<br />truncates high keys for all other pages.<br /><br /> There is a comment that clarifies situation:<br /> /*<br /> * If the new item is the first for its page, stash a copy for later. Note<br /> * this will only happen for thefirst item on a level; on later pages,<br /> * the first item for a page is copied from the prior page in the code<br/> * above.<br /> */<br /><br /><br /> So the patch is correct.<br /> We can go further and remove thisindex_truncate_tuple() call, because<br /> the first key of any inner (or root) page doesn't need any key at all.<br/> It simply points out to the leftmost page of the level below.<br /> But it's not a bug, because truncation of onetuple per level doesn't<br /> add any considerable overhead. So I want to leave the patch in its current state.<br /><br/><blockquote cite="mid:CAA4eK1+0jLkz8P7MVsUAj+RHmXm4BWBjV8Cc+_-Hsyhgth9ELA@mail.gmail.com" type="cite"><pre wrap="">@@-510,6 +513,8 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) { .. + if (indnkeyatts != indnatts && P_ISLEAF(opageop)) + { + /* + * It's essential to truncate High key here. + * The purpose is not just to save more space on this particular page, + * but to keep whole b-tree structure consistent. Subsequent insertions + * assume that hikey is already truncated, and so they should not + * worry about it, when copying the high key into the parent page + * as a downlink. + * NOTE It is not crutial for reliability in present, + * but maybe it will be that in the future. + */ + keytup = index_truncate_tuple(wstate->index, oitup); + + /* delete "wrong" high key, insert keytup as P_HIKEY. */ + PageIndexTupleDelete(opage, P_HIKEY); + + if (!_bt_pgaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY)) + elog(ERROR, "failed to rewrite compressed item in index \"%s\"", + RelationGetRelationName(wstate->index)); + } + .. .. </pre></blockquote><br /><br /><pre class="moz-signature" cols="72">-- Anastasia Lubennikova Postgres Professional: <a class="moz-txt-link-freetext" href="http://www.postgrespro.com">http://www.postgrespro.com</a> The Russian Postgres Company</pre>
One more update. I added ORDER BY clause to regression tests. It was done as a separate bugfix patch by Tom Lane some time ago, but it definitely should be included into the patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 28.08.2016 09:13, Amit Kapila: > > On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: > > > So the patch is correct. > We can go further and remove this index_truncate_tuple() call, because > the first key of any inner (or root) page doesn't need any key at all. > Anyway, I think truncation happens if the page is at leaf level and that is ensured by check, so I think we can't remove this: + if (indnkeyatts != indnatts && P_ISLEAF(pageop)) -- I have one more question regarding this truncate high-key concept. I think if high key is truncated, then during insertion, for cases like below it move to next page, whereas current page needs to be splitted. Assume index on c1,c2,c3 and c2,c3 are including columns. Actual high key on leaf Page X - 3, 2 , 2 Truncated high key on leaf Page X 3 New insertion key 3, 1, 2 Now, I think for such cases during insertion if the page X doesn't have enough space, it will move to next page whereas ideally, it should split current page. Refer function _bt_findinsertloc() for this logic. Is this truncation concept of high key needed for correctness of patch or is it just to save space in index? If you need this, then I think nbtree/Readme needs to be updated. -- I am getting Assertion failure when I use this patch with database created with a build before this patch. However, if I create a fresh database it works fine. Assertion failure details are as below: LOG: database system is ready to accept connections LOG: autovacuum launcher started TRAP: unrecognized TOAST vartag("((bool) 1)", File: "src/backend/access/common/h eaptuple.c", Line: 532) LOG: server process (PID 1404) was terminated by exception 0x80000003 HINT: See C include file "ntstatus.h" for a description of the hexadecimal valu e. LOG: terminating any other active server processes -- @@ -1260,14 +1262,14 @@ RelationInitIndexAccessInfo(Relation relation) * Allocate arrays to hold data */ relation->rd_opfamily= (Oid *) - MemoryContextAllocZero(indexcxt, natts * sizeof(Oid)); + MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid)); relation->rd_opcintype = (Oid *) - MemoryContextAllocZero(indexcxt, natts * sizeof(Oid)); + MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid)); amsupport = relation->rd_amroutine->amsupport; if (amsupport > 0) { - int nsupport = natts * amsupport; + int nsupport = indnatts * amsupport; relation->rd_support = (RegProcedure *) MemoryContextAllocZero(indexcxt, nsupport * sizeof(RegProcedure)); @@ -1281,10 +1283,10 @@ RelationInitIndexAccessInfo(Relation relation) } relation->rd_indcollation = (Oid *) - MemoryContextAllocZero(indexcxt, natts * sizeof(Oid)); + MemoryContextAllocZero(indexcxt, indnatts * sizeof(Oid)); Can you add a comment in above code or some other related place as to why you need some attributes in relcache entry of size indnkeyatts and others of size indnatts? -- @@ -63,17 +63,26 @@ _bt_mkscankey(Relation rel, IndexTuple itup){ ScanKey skey; TupleDesc itupdesc; - int natts; + int indnatts, + indnkeyatts; int16 *indoption; int i; itupdesc = RelationGetDescr(rel); - natts = RelationGetNumberOfAttributes(rel); + indnatts = IndexRelationGetNumberOfAttributes(rel); + indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); indoption = rel->rd_indoption; - skey = (ScanKey) palloc(natts * sizeof(ScanKeyData)); + Assert(indnkeyatts != 0); + Assert(indnkeyatts <= indnatts); Here I think you need to declare indnatts as PG_USED_FOR_ASSERTS_ONLY, otherwise it will give warning on some platforms. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 20, 2016 at 10:51 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> 28.08.2016 09:13, Amit Kapila: >> >> On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova >> <a.lubennikova@postgrespro.ru> wrote: >> >> >> So the patch is correct. >> We can go further and remove this index_truncate_tuple() call, because >> the first key of any inner (or root) page doesn't need any key at all. >> > > Anyway, I think truncation happens if the page is at leaf level and > that is ensured by check, so I think we can't remove this: > + if (indnkeyatts != indnatts && P_ISLEAF(pageop)) > > > -- I have one more question regarding this truncate high-key concept. > I think if high key is truncated, then during insertion, for cases > like below it move to next page, whereas current page needs to be > splitted. > > Assume index on c1,c2,c3 and c2,c3 are including columns. > > Actual high key on leaf Page X - > 3, 2 , 2 > Truncated high key on leaf Page X > 3 > > New insertion key > 3, 1, 2 > > Now, I think for such cases during insertion if the page X doesn't > have enough space, it will move to next page whereas ideally, it > should split current page. Refer function _bt_findinsertloc() for > this logic. > Basically, here I wanted to know is that do we maintain ordering for keys with respect to including columns while storing them (In above example, do we ensure that 3,1,2 is always stored before 3,2,2)? > > > -- I am getting Assertion failure when I use this patch with database > created with a build before this patch. However, if I create a fresh > database it works fine. Assertion failure details are as below: > I have tried this test on my Windows m/c only. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:28.08.2016 09:13, Amit Kapila: On Mon, Aug 15, 2016 at 8:15 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: So the patch is correct. We can go further and remove this index_truncate_tuple() call, because the first key of any inner (or root) page doesn't need any key at all.Anyway, I think truncation happens if the page is at leaf level and that is ensured by check, so I think we can't remove this: + if (indnkeyatts != indnatts && P_ISLEAF(pageop)) -- I have one more question regarding this truncate high-key concept. I think if high key is truncated, then during insertion, for cases like below it move to next page, whereas current page needs to be splitted. Assume index on c1,c2,c3 and c2,c3 are including columns. Actual high key on leaf Page X - 3, 2 , 2 Truncated high key on leaf Page X 3 New insertion key 3, 1, 2 Now, I think for such cases during insertion if the page X doesn't have enough space, it will move to next page whereas ideally, it should split current page. Refer function _bt_findinsertloc() for this logic.
Thank you again for the review.
The problem seems really tricky, but the answer is simple.
We store included columns unordered. It was mentioned somewhere in
this thread. Let me give you an example:
create table t (i int, p point);
create index on (i) including (p);
"point" data type doesn't have any opclass for btree.
Should we insert (0, '(0,2)') before (0, '(1,1)') or after?
We have no idea what is the "correct order" for this attribute.
So the answer is "it doesn't matter". When searching in index,
we know that only key attrs are ordered, so only them can be used
in scankey. Other columns are filtered after retrieving data.
explain select i,p from t where i =0 and p <@ circle '((0,0),2)';
QUERY PLAN
-------------------------------------------------------------------
Index Only Scan using idx on t (cost=0.14..4.20 rows=1 width=20)
Index Cond: (i = 0)
Filter: (p <@ '<(0,0),2>'::circle)
The same approach is used for included columns of any type, even if
their data types have opclass.
Is this truncation concept of high key needed for correctness of patch or is it just to save space in index? If you need this, then I think nbtree/Readme needs to be updated.
Now it's done only for space saving. We never check included attributes
in non-leaf pages, so why store them? Especially if we assume that included
attributes can be quite long.
There is already a note in documentation:
+ It's the same with other constraints (PRIMARY KEY and EXCLUDE). This can
+ also can be used for non-unique indexes as any columns which are not required
+ for the searching or ordering of records can be included in the
+ <literal>INCLUDING</> clause, which can slightly reduce the size of the index,
+ due to storing included attributes only in leaf index pages.
What should I add to README (or to documentation),
to make it more understandable?
-- I am getting Assertion failure when I use this patch with database created with a build before this patch. However, if I create a fresh database it works fine. Assertion failure details are as below: LOG: database system is ready to accept connections LOG: autovacuum launcher started TRAP: unrecognized TOAST vartag("((bool) 1)", File: "src/backend/access/common/h eaptuple.c", Line: 532) LOG: server process (PID 1404) was terminated by exception 0x80000003 HINT: See C include file "ntstatus.h" for a description of the hexadecimal valu e. LOG: terminating any other active server processes
That is expected behavior, because catalog versions are not compatible.
But I wonder why there was no message about that?
I suppose, that's because CATALOG_VERSION_NO was outdated in my
patch. As well as I know, committer will change it before the commit.
Try new patch with updated value. It should fail with a message about
incompatible versions.
If that is not the reason of your Assertion failure, provide please
more information to reproduce the situation.
-- @@ -1260,14 +1262,14 @@ RelationInitIndexAccessInfo(Relation relation) * Allocate arrays to hold data */ relation->rd_opfamily = (Oid *) - MemoryContextAllocZero(indexcxt, natts * sizeof(Oid)); + MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid)); relation->rd_opcintype = (Oid *) - MemoryContextAllocZero(indexcxt, natts * sizeof(Oid)); + MemoryContextAllocZero(indexcxt, indnkeyatts * sizeof(Oid)); amsupport = relation->rd_amroutine->amsupport; if (amsupport > 0) { - int nsupport = natts * amsupport; + int nsupport = indnatts * amsupport; relation->rd_support = (RegProcedure *) MemoryContextAllocZero(indexcxt, nsupport * sizeof(RegProcedure)); @@ -1281,10 +1283,10 @@ RelationInitIndexAccessInfo(Relation relation) } relation->rd_indcollation = (Oid *) - MemoryContextAllocZero(indexcxt, natts * sizeof(Oid)); + MemoryContextAllocZero(indexcxt, indnatts * sizeof(Oid)); Can you add a comment in above code or some other related place as to why you need some attributes in relcache entry of size indnkeyatts and others of size indnatts?
Done. I hope that's enough.
The same logic is used in DefineIndex(), that already has comments.
Fixed. Thank you for advice, I didn't know about this macro before.-- @@ -63,17 +63,26 @@ _bt_mkscankey(Relation rel, IndexTuple itup){ ScanKey skey; TupleDesc itupdesc; - int natts; + int indnatts, + indnkeyatts; int16 *indoption; int i; itupdesc = RelationGetDescr(rel); - natts = RelationGetNumberOfAttributes(rel); + indnatts = IndexRelationGetNumberOfAttributes(rel); + indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); indoption = rel->rd_indoption; - skey = (ScanKey) palloc(natts * sizeof(ScanKeyData)); + Assert(indnkeyatts != 0); + Assert(indnkeyatts <= indnatts); Here I think you need to declare indnatts as PG_USED_FOR_ASSERTS_ONLY, otherwise it will give warning on some platforms.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Sep 21, 2016 at 6:51 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 20.09.2016 08:21, Amit Kapila: > > On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: > > 28.08.2016 09:13, Amit Kapila: > > > The problem seems really tricky, but the answer is simple. > We store included columns unordered. It was mentioned somewhere in > this thread. > Is there any fundamental problem in storing them in ordered way? I mean to say, you need to anyway store all the column values on leaf page, so why can't we find the exact location for the complete key. Basically use truncated key to reach to leaf level and then use the complete key to find the exact location to store the key. I might be missing some thing here, but if we can store them in ordered fashion, we can use them even for queries containing ORDER BY (where ORDER BY contains included columns). > Let me give you an example: > > create table t (i int, p point); > create index on (i) including (p); > "point" data type doesn't have any opclass for btree. > Should we insert (0, '(0,2)') before (0, '(1,1)') or after? > We have no idea what is the "correct order" for this attribute. > So the answer is "it doesn't matter". When searching in index, > we know that only key attrs are ordered, so only them can be used > in scankey. Other columns are filtered after retrieving data. > > explain select i,p from t where i =0 and p <@ circle '((0,0),2)'; > QUERY PLAN > ------------------------------------------------------------------- > Index Only Scan using idx on t (cost=0.14..4.20 rows=1 width=20) > Index Cond: (i = 0) > Filter: (p <@ '<(0,0),2>'::circle) > I think here reason for using Filter is that because we don't keep included columns in scan keys, can't we think of having them in scan keys, but use only key columns in scan key to reach till leaf level and then use complete scan key at leaf level. > > The same approach is used for included columns of any type, even if > their data types have opclass. > > Is this truncation concept of high key needed for correctness of patch > or is it just to save space in index? If you need this, then I think > nbtree/Readme needs to be updated. > > > Now it's done only for space saving. We never check included attributes > in non-leaf pages, so why store them? Especially if we assume that included > attributes can be quite long. > There is already a note in documentation: > > + It's the same with other constraints (PRIMARY KEY and EXCLUDE). > This can > + also can be used for non-unique indexes as any columns which are > not required > + for the searching or ordering of records can be included in the > + <literal>INCLUDING</> clause, which can slightly reduce the size of > the index, > + due to storing included attributes only in leaf index pages. > Okay, thanks for clarification. > What should I add to README (or to documentation), > to make it more understandable? > May be add the data representation like only leaf pages contains all the columns and how the scan works. I think you can see if you can extend "Notes About Data Representation" and or "Other Things That Are Handy to Know" sections in existing README. > -- I am getting Assertion failure when I use this patch with database > created with a build before this patch. However, if I create a fresh > database it works fine. Assertion failure details are as below: > > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > TRAP: unrecognized TOAST vartag("((bool) 1)", File: > "src/backend/access/common/h > eaptuple.c", Line: 532) > LOG: server process (PID 1404) was terminated by exception 0x80000003 > HINT: See C include file "ntstatus.h" for a description of the hexadecimal > valu > e. > LOG: terminating any other active server processes > > > That is expected behavior, because catalog versions are not compatible. > But I wonder why there was no message about that? > I suppose, that's because CATALOG_VERSION_NO was outdated in my > patch. As well as I know, committer will change it before the commit. > Try new patch with updated value. It should fail with a message about > incompatible versions. > Yeah, that must be reason, but lets not change it now, otherwise we will face conflicts while applying patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
24.09.2016 15:36, Amit Kapila: > On Wed, Sep 21, 2016 at 6:51 PM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> 20.09.2016 08:21, Amit Kapila: >> >> On Tue, Sep 6, 2016 at 10:18 PM, Anastasia Lubennikova >> <a.lubennikova@postgrespro.ru> wrote: >> >> 28.08.2016 09:13, Amit Kapila: >> >> >> The problem seems really tricky, but the answer is simple. >> We store included columns unordered. It was mentioned somewhere in >> this thread. >> > Is there any fundamental problem in storing them in ordered way? I > mean to say, you need to anyway store all the column values on leaf > page, so why can't we find the exact location for the complete key. > Basically use truncated key to reach to leaf level and then use the > complete key to find the exact location to store the key. I might be > missing some thing here, but if we can store them in ordered fashion, > we can use them even for queries containing ORDER BY (where ORDER BY > contains included columns). > I'd say that the reason for not using included columns in any operations which require comparisons, is that they don't have opclass. Let's go back to the example of points. This data type don't have any opclass for B-tree, because of fundamental reasons. And we can not apply _bt_compare() and others to this attribute, so we don't include it to scan key. create table t (i int, i2 int, p point); create index idx1 on (i) including (i2); create index idx2 on (i) including (p); create index idx3 on (i) including (i2, p); create index idx4 on (i) including (p, i2); You can keep tuples ordered in idx1, but not for idx2, partially ordered for idx3, but not for idx4. At the very beginning of this thread [1], I suggested to use opclass, where possible. Exactly the same idea, you're thinking about. But after short discussion, we came to conclusion that it would require many additional checks and will be too complicated, at least for the initial patch. >> Let me give you an example: >> >> create table t (i int, p point); >> create index on (i) including (p); >> "point" data type doesn't have any opclass for btree. >> Should we insert (0, '(0,2)') before (0, '(1,1)') or after? >> We have no idea what is the "correct order" for this attribute. >> So the answer is "it doesn't matter". When searching in index, >> we know that only key attrs are ordered, so only them can be used >> in scankey. Other columns are filtered after retrieving data. >> >> explain select i,p from t where i =0 and p <@ circle '((0,0),2)'; >> QUERY PLAN >> ------------------------------------------------------------------- >> Index Only Scan using idx on t (cost=0.14..4.20 rows=1 width=20) >> Index Cond: (i = 0) >> Filter: (p <@ '<(0,0),2>'::circle) >> > I think here reason for using Filter is that because we don't keep > included columns in scan keys, can't we think of having them in scan > keys, but use only key columns in scan key to reach till leaf level > and then use complete scan key at leaf level. >> What should I add to README (or to documentation), >> to make it more understandable? >> > May be add the data representation like only leaf pages contains all > the columns and how the scan works. I think you can see if you can > extend "Notes About Data Representation" and or "Other Things That Are > Handy to Know" sections in existing README. Ok, I'll write it in a few days. [1] https://www.postgresql.org/message-id/55F84DF4.5030207@postgrespro.ru -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Sep 26, 2016 at 11:17 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: >> Is there any fundamental problem in storing them in ordered way? I >> mean to say, you need to anyway store all the column values on leaf >> page, so why can't we find the exact location for the complete key. >> Basically use truncated key to reach to leaf level and then use the >> complete key to find the exact location to store the key. I might be >> missing some thing here, but if we can store them in ordered fashion, >> we can use them even for queries containing ORDER BY (where ORDER BY >> contains included columns). > > I'd say that the reason for not using included columns in any > operations which require comparisons, is that they don't have opclass. > Let's go back to the example of points. > This data type don't have any opclass for B-tree, because of fundamental > reasons. > And we can not apply _bt_compare() and others to this attribute, so > we don't include it to scan key. > > create table t (i int, i2 int, p point); > create index idx1 on (i) including (i2); > create index idx2 on (i) including (p); > create index idx3 on (i) including (i2, p); > create index idx4 on (i) including (p, i2); > > You can keep tuples ordered in idx1, but not for idx2, partially ordered for > idx3, but not for idx4. Yeah, I think we shouldn't go there. I mean, once you start ordering by INCLUDING columns, then you're going to need to include them in leaf pages because otherwise you can't actually guarantee that they are in the right order. And then you have to wonder why an INCLUDING column is any different from a non-INCLUDING column. It seems best to make a firm rule that INCLUDING columns are there only for the values, not for ordering. That rule is simple and clear, which is a good thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Ok, I'll write it in a few days. Marked as returned with feedback per last emails exchanged. -- Michael
03.10.2016 05:22, Michael Paquier: > On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> Ok, I'll write it in a few days. > Marked as returned with feedback per last emails exchanged. The only complaint about this patch was a lack of README, which is fixed now (see the attachment). So, I added it to new CF, marked as ready for committer. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Sep 27, 2016 at 7:51 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Sep 26, 2016 at 11:17 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >>> Is there any fundamental problem in storing them in ordered way? I >>> mean to say, you need to anyway store all the column values on leaf >>> page, so why can't we find the exact location for the complete key. >>> Basically use truncated key to reach to leaf level and then use the >>> complete key to find the exact location to store the key. I might be >>> missing some thing here, but if we can store them in ordered fashion, >>> we can use them even for queries containing ORDER BY (where ORDER BY >>> contains included columns). >> >> I'd say that the reason for not using included columns in any >> operations which require comparisons, is that they don't have opclass. >> Let's go back to the example of points. >> This data type don't have any opclass for B-tree, because of fundamental >> reasons. >> And we can not apply _bt_compare() and others to this attribute, so >> we don't include it to scan key. >> >> create table t (i int, i2 int, p point); >> create index idx1 on (i) including (i2); >> create index idx2 on (i) including (p); >> create index idx3 on (i) including (i2, p); >> create index idx4 on (i) including (p, i2); >> >> You can keep tuples ordered in idx1, but not for idx2, partially ordered for >> idx3, but not for idx4. > > Yeah, I think we shouldn't go there. I mean, once you start ordering > by INCLUDING columns, then you're going to need to include them in > leaf pages because otherwise you can't actually guarantee that they > are in the right order. > I am not sure what you mean by above, because patch already stores INCLUDING columns in leaf pages. > And then you have to wonder why an INCLUDING > column is any different from a non-INCLUDING column. It seems best to > make a firm rule that INCLUDING columns are there only for the values, > not for ordering. That rule is simple and clear, which is a good > thing. > Okay, we can make that firm rule, but I think reasoning behind that should be clear. As far as I get it by reading some of the mails in this thread, it is because some of the other databases doesn't seem to support ordering for included columns or supporting the same can complicate the code. One point, we should keep in mind that suggestion for including many other columns in INCLUDING clause to use Index Only scans by other databases might not hold equally good for PostgreSQL because it can lead to many HOT updates as non-HOT updates. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Oct 4, 2016 at 9:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> I'd say that the reason for not using included columns in any >>> operations which require comparisons, is that they don't have opclass. >>> Let's go back to the example of points. >>> This data type don't have any opclass for B-tree, because of fundamental >>> reasons. >>> And we can not apply _bt_compare() and others to this attribute, so >>> we don't include it to scan key. >>> >>> create table t (i int, i2 int, p point); >>> create index idx1 on (i) including (i2); >>> create index idx2 on (i) including (p); >>> create index idx3 on (i) including (i2, p); >>> create index idx4 on (i) including (p, i2); >>> >>> You can keep tuples ordered in idx1, but not for idx2, partially ordered for >>> idx3, but not for idx4. >> >> Yeah, I think we shouldn't go there. I mean, once you start ordering >> by INCLUDING columns, then you're going to need to include them in >> leaf pages because otherwise you can't actually guarantee that they >> are in the right order. > > I am not sure what you mean by above, because patch already stores > INCLUDING columns in leaf pages. Sorry, I meant non-leaf pages. >> And then you have to wonder why an INCLUDING >> column is any different from a non-INCLUDING column. It seems best to >> make a firm rule that INCLUDING columns are there only for the values, >> not for ordering. That rule is simple and clear, which is a good >> thing. > > Okay, we can make that firm rule, but I think reasoning behind that > should be clear. As far as I get it by reading some of the mails in > this thread, it is because some of the other databases doesn't seem to > support ordering for included columns or supporting the same can > complicate the code. One point, we should keep in mind that > suggestion for including many other columns in INCLUDING clause to use > Index Only scans by other databases might not hold equally good for > PostgreSQL because it can lead to many HOT updates as non-HOT updates. Right. Looking back, the originally articulated rationale for this patch was that you might want a single index that is UNIQUE ON (a) but also INCLUDING (b) rather than two indexes, a unique index on (a) and a non-unique index on (a, b). In that case, the patch is a straight-up win: you get the same number of HOT updates either way, but you don't use as much disk space, or spend as much CPU time and WAL updating your indexes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
03.10.2016 15:29, Anastasia Lubennikova: > 03.10.2016 05:22, Michael Paquier: >> On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova >> <a.lubennikova@postgrespro.ru> wrote: >>> Ok, I'll write it in a few days. >> Marked as returned with feedback per last emails exchanged. > > The only complaint about this patch was a lack of README, > which is fixed now (see the attachment). So, I added it to new CF, > marked as ready for committer. One more fix for pg_upgrade. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Tue, Oct 4, 2016 at 7:50 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Oct 4, 2016 at 9:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>>> I'd say that the reason for not using included columns in any >>>> operations which require comparisons, is that they don't have opclass. >>>> Let's go back to the example of points. >>>> This data type don't have any opclass for B-tree, because of fundamental >>>> reasons. >>>> And we can not apply _bt_compare() and others to this attribute, so >>>> we don't include it to scan key. >>>> >>>> create table t (i int, i2 int, p point); >>>> create index idx1 on (i) including (i2); >>>> create index idx2 on (i) including (p); >>>> create index idx3 on (i) including (i2, p); >>>> create index idx4 on (i) including (p, i2); >>>> >>>> You can keep tuples ordered in idx1, but not for idx2, partially ordered for >>>> idx3, but not for idx4. >>> >>> Yeah, I think we shouldn't go there. I mean, once you start ordering >>> by INCLUDING columns, then you're going to need to include them in >>> leaf pages because otherwise you can't actually guarantee that they >>> are in the right order. >> >> I am not sure what you mean by above, because patch already stores >> INCLUDING columns in leaf pages. > > Sorry, I meant non-leaf pages. > Okay, but in that case I think we don't need to store including columns in non-leaf pages to get the exact ordering. As mentioned upthread, we can use truncated scan key to reach to leaf level and then use the complete key to find the exact location to store the key. This is only possible if there exists an opclass for columns that are covered as part of including clause. So, we can allow "order by" to use index scan only if the columns covered in included clause have opclass for btree. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 5, 2016 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Okay, but in that case I think we don't need to store including > columns in non-leaf pages to get the exact ordering. As mentioned > upthread, we can use truncated scan key to reach to leaf level and > then use the complete key to find the exact location to store the key. > This is only possible if there exists an opclass for columns that are > covered as part of including clause. So, we can allow "order by" to > use index scan only if the columns covered in included clause have > opclass for btree. But what if there are many pages full of keys that have the same values for the non-INCLUDING columns? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 5, 2016 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Okay, but in that case I think we don't need to store including >> columns in non-leaf pages to get the exact ordering. As mentioned >> upthread, we can use truncated scan key to reach to leaf level and >> then use the complete key to find the exact location to store the key. >> This is only possible if there exists an opclass for columns that are >> covered as part of including clause. So, we can allow "order by" to >> use index scan only if the columns covered in included clause have >> opclass for btree. > But what if there are many pages full of keys that have the same > values for the non-INCLUDING columns? I concur with Robert that INCLUDING columns should be just dead weight as far as the index is concerned. Even if opclass information is available for them, it's overcomplication for too little return. We do not need three classes of columns in an index. regards, tom lane
On 10/4/16 10:47 AM, Anastasia Lubennikova wrote: > 03.10.2016 15:29, Anastasia Lubennikova: >> 03.10.2016 05:22, Michael Paquier: >>> On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova >>> <a.lubennikova@postgrespro.ru> wrote: >>>> Ok, I'll write it in a few days. >>> Marked as returned with feedback per last emails exchanged. >> >> The only complaint about this patch was a lack of README, >> which is fixed now (see the attachment). So, I added it to new CF, >> marked as ready for committer. > > One more fix for pg_upgrade. Latest patch doesn't apply. See also review by Brad DeJong. I'm setting it back to Waiting. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 10/4/16 10:47 AM, Anastasia Lubennikova wrote:
> 03.10.2016 15:29, Anastasia Lubennikova:
>> 03.10.2016 05:22, Michael Paquier:
>>> On Tue, Sep 27, 2016 at 12:17 AM, Anastasia Lubennikova
>>> <a.lubennikova@postgrespro.ru> wrote:
>>>> Ok, I'll write it in a few days.
>>> Marked as returned with feedback per last emails exchanged.
>>
>> The only complaint about this patch was a lack of README,
>> which is fixed now (see the attachment). So, I added it to new CF,
>> marked as ready for committer.
>
> One more fix for pg_upgrade.
Latest patch doesn't apply. See also review by Brad DeJong. I'm
setting it back to Waiting.
documentation updates and a paragraph in nbtree/README.
Syntax was changed - keyword is INCLUDE now as in other databases.
Below you can see the answers to the latest review by Brad DeJong.
Given "create table foo (a int, b int, c int, d int)" and "create unique index foo_a_b on foo (a, b) including (c)".index only? heap tuple needed?select a, b, c from foo where a = 1 yes noselect a, b, d from foo where a = 1 no yesselect a, b from foo where a = 1 and c = 1 ? ?
select a, b from foo where a = 1 and c = 1 yes no
As you can see in EXPLAIN this query doesn't need heap tuple. We can fetch tuple using index-only scan strategy,
because btree never use lossy data representation (i.e stores the same data as in heap). Afterward we apply
Filter (c=1) to the fetched tuple.
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Only Scan using foo_a_b on foo (cost=0.28..4.30 rows=1 width=8) (actual time=0.021..0.022 rows=1 loops=1)
Index Cond: (a = 1)
Filter: (c = 1)
Heap Fetches: 0
Planning time: 0.344 ms
Execution time: 0.073 ms
Are included columns counted against the 32 column and 2712 byte index limits? I did not see either explicitly mentioned in the discussion or the documentation. I only ask because in SQL Server the limits are different for include columns.
This limit remains unchanged since included attributes are stored in the very same way as regular index attributes.
1. syntax - on 2016-08-14, Andrey Borodin wrote "I think MS SQL syntax INCLUDE instead of INCLUDING would be better". I would go further than that. This feature is already supported by 2 of the top 5 SQL databases and they both use INCLUDE. Using different syntax because of an internal implementation detail seems short sighted.
Done.
Thank you. All issues are fixed.4. documentation - minor items (these are not actual diffs)
5. codingparse_utilcmd.c@@ -1334,6 +1334,38 @@ ...The loop is handling included columns separately.The loop adds the collation name for each included column if it is not the default.Q: Given that the create index/create constraint syntax does not allow a collation to be specified for included columns, how can you ever have a non-default collation?@@ -1776,6 +1816,7 @@The comment here says "NOTE that exclusion constraints don't support included nonkey attributes". However, the paragraph on INCLUDING in create_index.sgml says "It's the same for the other constraints (PRIMARY KEY and EXCLUDE)".
Good point.
In this version I added syntax for EXCLUDE and INCLUDE compatibility.
Though names look weird, it works as well as other constraints. So documentation is correct now.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 2017-01-09 16:02, Anastasia Lubennikova wrote: > include_columns_10.0_v1.patch The patch applies, compiles, and make check is OK. It yields nice perfomance gains and I haven't been able to break anything (yet). Some edits of the sgml-changes are attached. Thank you for this very useful improvement. Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Mon, Jan 9, 2017 at 8:32 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Updated version of the patch is attached. Besides code itself, it contains > new regression test, > documentation updates and a paragraph in nbtree/README. > The latest patch doesn't apply cleanly. Few assorted comments: 1. @@ -4806,16 +4810,25 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) { .. + /* + * Since we have covering indexes with non-key columns, + * we must handle them accurately here. non-key columns + * must be added into indexattrs, since they are in index, + * and HOT-update shouldn't miss them. + * Obviously, non-key columns couldn't be referenced by + * foreign key or identity key. Hence we do not include + * them into uindexattrs and idindexattrs bitmaps. + */ if (attrnum != 0) { indexattrs = bms_add_member(indexattrs, attrnum - FirstLowInvalidHeapAttributeNumber); - if (isKey) + if (isKey && i < indexInfo->ii_NumIndexKeyAttrs) uindexattrs = bms_add_member(uindexattrs, attrnum - FirstLowInvalidHeapAttributeNumber); - if (isIDKey) + if (isIDKey && i < indexInfo->ii_NumIndexKeyAttrs) idindexattrs = bms_add_member(idindexattrs, attrnum - FirstLowInvalidHeapAttributeNumber); .. } Can included columns be part of primary key? If not, then won't you need a check similar to above for Primary keys? 2. + int indnkeyattrs; /* number of index key attributes*/ + int indnattrs; /* total number of index attributes*/ + Oid *indkeys; /* In spite of the name 'indkeys' this field + * contains both key and nonkey attributes*/ Before the end of the comment, one space is needed. 3. } - /* * For UNIQUE and PR IMARY KEY, we just have a list of column names. * Looks like spurious line removal. 4. + IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITSINITIALLY INLINE_P INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER INTERSECT INTERVAL INTO INVOKERIS ISNULL ISOLATION @@ -3431,17 +3433,18 @@ ConstraintElem: n->initially_valid = !n->skip_validation; $$ = (Node *)n; } - | UNIQUE '(' columnList ')' opt_definition OptConsTableSpace + | UNIQUE '(' columnList ')' opt_c_including opt_definition OptConsTableSpace If we want to use INCLUDE in syntax, then it might be better to keep the naming reflect the same. For ex. instead of opt_c_including we should use opt_c_include. 5. +opt_c_including: INCLUDE optcincluding { $$ = $2; } + | /* EMPTY */ { $$ = NIL; } + ; + +optcincluding : '(' columnList ')' { $$ = $2; } + ; + It seems optcincluding is redundant, why can't we directly specify along with INCLUDE? If there was some other use of optcincluding or if there is a complicated definition of the same then it would have made sense to define it separately. We have a lot of similar usage in gram.y, refer opt_in_database. 6. +optincluding : '(' index_including_params ')' { $$ = $2; } + ; +opt_including: INCLUDE optincluding { $$ = $2; } + | /* EMPTY */ { $$ = NIL; } + ; Here the ordering of above clauses seems to be another way. Also, the naming of both seems to be confusing. I think either we can eliminate *optincluding* by following suggestion similar to the previous point or name them somewhat clearly (like opt_include_clause and opt_include_params/opt_include_list). 7. Can you include doc fixes suggested by Erik Rijkers [1]? I have checked them and they seem to be better than what is there in the patch. [1] - https://www.postgresql.org/message-id/3863bca17face15c6acd507e0173a6dc%40xs4all.nl -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
14.02.2017 15:46, Amit Kapila: > On Mon, Jan 9, 2017 at 8:32 PM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> Updated version of the patch is attached. Besides code itself, it contains >> new regression test, >> documentation updates and a paragraph in nbtree/README. >> > The latest patch doesn't apply cleanly. Fixed. > Few assorted comments: > 1. > @@ -4806,16 +4810,25 @@ RelationGetIndexAttrBitmap(Relation relation, > IndexAttrBitmapKind attrKind) > { > .. > + /* > + * Since we have covering indexes with non-key columns, > + * we must handle them accurately here. non-key columns > + * must be added into indexattrs, since they are in index, > + * and HOT-update shouldn't miss them. > + * Obviously, non-key columns couldn't be referenced by > + * foreign key or identity key. Hence we do not include > + * them into uindexattrs and idindexattrs bitmaps. > + */ > if (attrnum != 0) > { > indexattrs = bms_add_member(indexattrs, > attrnum - > FirstLowInvalidHeapAttributeNumber); > > - if (isKey) > + if (isKey && i < indexInfo->ii_NumIndexKeyAttrs) > uindexattrs = bms_add_member(uindexattrs, > attrnum - > FirstLowInvalidHeapAttributeNumber); > > - if (isIDKey) > + if (isIDKey && i < indexInfo->ii_NumIndexKeyAttrs) > idindexattrs = bms_add_member(idindexattrs, > attrnum - > FirstLowInvalidHeapAttributeNumber); > .. > } > > Can included columns be part of primary key? If not, then won't you > need a check similar to above for Primary keys? No, they cannot be a part of any constraint, so I fixed a check. > 2. > + int indnkeyattrs; /* number of index key attributes*/ > + int indnattrs; /* total number of index attributes*/ > + Oid *indkeys; /* In spite of the name 'indkeys' this field > + * contains both key and nonkey > attributes*/ > > Before the end of the comment, one space is needed. > > 3. > } > - > /* > * For UNIQUE and PR > IMARY KEY, we just have a list of column names. > * > > Looks like spurious line removal. Both are fixed. > 4. > + IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE > INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P > INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER > INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION > @@ -3431,17 +3433,18 @@ ConstraintElem: > n->initially_valid = !n->skip_validation; > $$ = (Node *)n; > } > - | UNIQUE '(' columnList ')' opt_definition OptConsTableSpace > + | UNIQUE '(' columnList ')' opt_c_including opt_definition OptConsTableSpace > > If we want to use INCLUDE in syntax, then it might be better to keep > the naming reflect the same. For ex. instead of opt_c_including we > should use opt_c_include. > > 5. > +opt_c_including: INCLUDE optcincluding { $$ = $2; } > + | /* EMPTY */ { $$ > = NIL; } > + ; > + > +optcincluding : '(' columnList ')' { $$ = $2; } > + ; > + > > It seems optcincluding is redundant, why can't we directly specify > along with INCLUDE? If there was some other use of optcincluding or > if there is a complicated definition of the same then it would have > made sense to define it separately. We have a lot of similar usage in > gram.y, refer opt_in_database. > > 6. > +optincluding : '(' index_including_params ')' { $$ = $2; } > + ; > +opt_including: INCLUDE optincluding { $$ = $2; } > + | /* EMPTY */ { $$ = NIL; } > + ; > > Here the ordering of above clauses seems to be another way. Also, the > naming of both seems to be confusing. I think either we can eliminate > *optincluding* by following suggestion similar to the previous point > or name them somewhat clearly (like opt_include_clause and > opt_include_params/opt_include_list). Thank you for this suggestion. I've just wrote the code looking at examples around, but optincluding and optcincluding clauses seem to be redundant. I've cleaned up the code. > 7. Can you include doc fixes suggested by Erik Rijkers [1]? I have > checked them and they seem to be better than what is there in the > patch. Yes, I've included them in the last version of the patch. > [1] - https://www.postgresql.org/message-id/3863bca17face15c6acd507e0173a6dc%40xs4all.nl -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Feb 16, 2017 at 6:43 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > 14.02.2017 15:46, Amit Kapila: > > >> 4. >> + IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P >> INCLUDE >> INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P >> INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER >> INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION >> @@ -3431,17 +3433,18 @@ ConstraintElem: >> n->initially_valid = !n->skip_validation; >> $$ = (Node *)n; >> } >> - | UNIQUE '(' columnList ')' opt_definition OptConsTableSpace >> + | UNIQUE '(' columnList ')' opt_c_including opt_definition >> OptConsTableSpace >> >> If we want to use INCLUDE in syntax, then it might be better to keep >> the naming reflect the same. For ex. instead of opt_c_including we >> should use opt_c_include. >> > > > Thank you for this suggestion. I've just wrote the code looking at examples > around, > but optincluding and optcincluding clauses seem to be redundant. > I've cleaned up the code. > I think you have cleaned only in gram.y as I could see the references to 'including' in other parts of code. For ex, see below code: @@ -2667,6 +2667,7 @@ _copyConstraint(const Constraint *from) COPY_NODE_FIELD(raw_expr); COPY_STRING_FIELD(cooked_expr);COPY_NODE_FIELD(keys); + COPY_NODE_FIELD(including); COPY_NODE_FIELD(exclusions); COPY_NODE_FIELD(options); COPY_STRING_FIELD(indexname); @@ -3187,6 +3188,7 @@ _copyIndexStmt(const IndexStmt *from) COPY_STRING_FIELD(accessMethod); COPY_STRING_FIELD(tableSpace);COPY_NODE_FIELD(indexParams); + COPY_NODE_FIELD(indexIncludingParams); @@ -425,6 +425,13 @@ ConstructTupleDescriptor(Relation heapRelation, /* + * Code below is concerned to the opclasses which are not used + * with the included columns. + */ + if (i >= indexInfo->ii_NumIndexKeyAttrs) + continue; + There seems to be code below the above check which is not directly related to opclasses, so not sure if you have missed that or is there any other reason to ignore that. I am referring to following code in the same function after the above check: /* * If a key type different from the heap value is specified, update * the type-related fields in the index tupdesc. */ if (OidIsValid(keyType) && keyType != to->atttypid) -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2/16/17 08:13, Anastasia Lubennikova wrote: > @@ -629,7 +630,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); > > HANDLER HAVING HEADER_P HOLD HOUR_P > > - IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P > + IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE > INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P > INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER > INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION I think your syntax would read no worse, possibly even better, if you just used the existing INCLUDING keyword. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 16, 2017 at 6:43 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:14.02.2017 15:46, Amit Kapila:4. + IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE INCLUDING INCREMENT INDEX INDEXES INHERIT INHERITS INITIALLY INLINE_P INNER_P INOUT INPUT_P INSENSITIVE INSERT INSTEAD INT_P INTEGER INTERSECT INTERVAL INTO INVOKER IS ISNULL ISOLATION @@ -3431,17 +3433,18 @@ ConstraintElem: n->initially_valid = !n->skip_validation; $$ = (Node *)n; } - | UNIQUE '(' columnList ')' opt_definition OptConsTableSpace + | UNIQUE '(' columnList ')' opt_c_including opt_definition OptConsTableSpace If we want to use INCLUDE in syntax, then it might be better to keep the naming reflect the same. For ex. instead of opt_c_including we should use opt_c_include.Thank you for this suggestion. I've just wrote the code looking at examples around, but optincluding and optcincluding clauses seem to be redundant. I've cleaned up the code.I think you have cleaned only in gram.y as I could see the references to 'including' in other parts of code. For ex, see below code: @@ -2667,6 +2667,7 @@ _copyConstraint(const Constraint *from) COPY_NODE_FIELD(raw_expr); COPY_STRING_FIELD(cooked_expr); COPY_NODE_FIELD(keys); + COPY_NODE_FIELD(including); COPY_NODE_FIELD(exclusions); COPY_NODE_FIELD(options); COPY_STRING_FIELD(indexname); @@ -3187,6 +3188,7 @@ _copyIndexStmt(const IndexStmt *from) COPY_STRING_FIELD(accessMethod); COPY_STRING_FIELD(tableSpace); COPY_NODE_FIELD(indexParams); + COPY_NODE_FIELD(indexIncludingParams);
There is a lot of variables like 'including*' in the patch.
Frankly, I don't see a reason to rename them. It's clear that they
refers to included attributes, whatever we call them "include", "included" or "including".
Good point,@@ -425,6 +425,13 @@ ConstructTupleDescriptor(Relation heapRelation, /* + * Code below is concerned to the opclasses which are not used + * with the included columns. + */ + if (i >= indexInfo->ii_NumIndexKeyAttrs) + continue; + There seems to be code below the above check which is not directly related to opclasses, so not sure if you have missed that or is there any other reason to ignore that. I am referring to following code in the same function after the above check: /* * If a key type different from the heap value is specified, update * the type-related fields in the index tupdesc. */ if (OidIsValid(keyType) && keyType != to->atttypid)
I skip some steps that should be executed for all attributes.
It is harmless though, since for btree (and other access methods, except hash) amkeytype is always invalid.
But I agree that the code can be clarified.
New patch with minor changes is attached.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Patch rebased to the current master is in attachments. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: tested, passed This patch looks good to me. As I understand we have both a complete feature and a consensus in a thread here. If there areno objection, I'm marking this patch as "Ready for Commiter". The new status of this patch is: Ready for Committer
>> - IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P >> + IDENTITY_P IF_P ILIKE IMMEDIATE IMMUTABLE IMPLICIT_P IMPORT_P IN_P INCLUDE > I think your syntax would read no worse, possibly even better, if you > just used the existing INCLUDING keyword. It was a discussion in this thread about naming and both databases, which support covering indexes, use INCLUDE keyword. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
I had a look on patch and played with it, seems, it looks fine. I splitted it to two patches: core changes (+bloom index fix) and btree itself. All docs are left in first patch - I'm too lazy to rewrite documentation which is changed in second patch. Any objection from reviewers to push both patches? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
Hi Teodor, > I had a look on patch and played with it, seems, it looks fine. I splitted > it to two patches: core changes (+bloom index fix) and btree itself. All > docs are left in first patch - I'm too lazy to rewrite documentation which > is changed in second patch. > Any objection from reviewers to push both patches? These patches look OK. Definitely no objections from me. -- Best regards, Aleksander Alekseev
On Thu, Mar 30, 2017 at 11:26 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > I had a look on patch and played with it, seems, it looks fine. I splitted > it to two patches: core changes (+bloom index fix) and btree itself. All > docs are left in first patch - I'm too lazy to rewrite documentation which > is changed in second patch. > Any objection from reviewers to push both patches? Has this really had enough review and testing? The last time it was pushed, it didn't go too well. And laziness is not a very good excuse for not dividing up patches properly. It seems highly surprising to me that CheckIndexCompatible() only gets a one line change in this patch. That seems unlikely to be correct. Has anybody done some testing of this patch with the WAL consistency checker? Like, create some tables with indexes that have INCLUDE columns, set up a standby, enable consistency checking, pound the master, and see if the standby bails? Has anybody tested this patch with amcheck? Does it break amcheck? A few minor comments: - foreach(lc, constraint->keys) + else foreach(lc, constraint->keys) That doesn't look like a reasonable way of formatting the code. + /* Here is some code duplication. But we do need it. */ That is not a very informative comment. + * NOTE It is not crutial for reliability in present, Spelling, punctuation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-03-30 18:26:05 +0300, Teodor Sigaev wrote: > Any objection from reviewers to push both patches? > diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c > index f2eda67..59029b9 100644 > --- a/contrib/bloom/blutils.c > +++ b/contrib/bloom/blutils.c > @@ -120,6 +120,7 @@ blhandler(PG_FUNCTION_ARGS) > amroutine->amclusterable = false; > amroutine->ampredlocks = false; > amroutine->amcanparallel = false; > + amroutine->amcaninclude = false; That name doesn't strike me as very descriptive. > + <term><literal>INCLUDE</literal></term> > + <listitem> > + <para> > + An optional <literal>INCLUDE</> clause allows a list of columns to be > + specified which will be included in the non-key portion of the index. > + Columns which are part of this clause cannot also exist in the > + key columns portion of the index, and vice versa. The > + <literal>INCLUDE</> columns exist solely to allow more queries to benefit > + from <firstterm>index-only scans</> by including certain columns in the > + index, the value of which would otherwise have to be obtained by reading > + the table's heap. Having these columns in the <literal>INCLUDE</> clause > + in some cases allows <productname>PostgreSQL</> to skip the heap read > + completely. This also allows <literal>UNIQUE</> indexes to be defined on > + one set of columns, which can include another set of columns in the > + <literal>INCLUDE</> clause, on which the uniqueness is not enforced. > + It's the same with other constraints (PRIMARY KEY and EXCLUDE). This can > + also can be used for non-unique indexes as any columns which are not required > + for the searching or ordering of records can be used in the > + <literal>INCLUDE</> clause, which can slightly reduce the size of the index. > + Currently, only the B-tree access method supports this feature. > + Expressions as included columns are not supported since they cannot be used > + in index-only scans. > + </para> > + </listitem> > + </varlistentry> This could use some polishing. > +/* > + * Reform index tuple. Truncate nonkey (INCLUDE) attributes. > + */ > +IndexTuple > +index_truncate_tuple(Relation idxrel, IndexTuple olditup) > +{ > + TupleDesc itupdesc = RelationGetDescr(idxrel); > + Datum values[INDEX_MAX_KEYS]; > + bool isnull[INDEX_MAX_KEYS]; > + IndexTuple newitup; > + int indnatts = IndexRelationGetNumberOfAttributes(idxrel); > + int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(idxrel); > + > + Assert(indnatts <= INDEX_MAX_KEYS); > + Assert(indnkeyatts > 0); > + Assert(indnkeyatts < indnatts); > + > + index_deform_tuple(olditup, itupdesc, values, isnull); > + > + /* form new tuple that will contain only key attributes */ > + itupdesc->natts = indnkeyatts; > + newitup = index_form_tuple(itupdesc, values, isnull); > + newitup->t_tid = olditup->t_tid; > + > + itupdesc->natts = indnatts; Uh, isn't this a *seriously* bad idea? If index_form_tuple errors out, this'll corrupt the tuple descriptor. Maybe also rename the function to index_build_key_tuple()? > * Construct a string describing the contents of an index entry, in the > * form "(key_name, ...)=(key_value, ...)". This is currently used > - * for building unique-constraint and exclusion-constraint error messages. > + * for building unique-constraint and exclusion-constraint error messages, > + * so only key columns of index are checked and printed. s/index/the index/ > @@ -368,7 +370,7 @@ systable_beginscan(Relation heapRelation, > { > int j; > > - for (j = 0; j < irel->rd_index->indnatts; j++) > + for (j = 0; j < IndexRelationGetNumberOfAttributes(irel); j++) > { > if (key[i].sk_attno == irel->rd_index->indkey.values[j]) > { > @@ -376,7 +378,7 @@ systable_beginscan(Relation heapRelation, > break; > } > } > - if (j == irel->rd_index->indnatts) > + if (j == IndexRelationGetNumberOfAttributes(irel)) > elog(ERROR, "column is not in index"); > } Not that it matters overly much, but why are we doing this for all attributes, rather than just key attributes? > --- a/src/backend/bootstrap/bootstrap.c > +++ b/src/backend/bootstrap/bootstrap.c > @@ -600,7 +600,7 @@ boot_openrel(char *relname) > relname, (int) ATTRIBUTE_FIXED_PART_SIZE); > > boot_reldesc = heap_openrv(makeRangeVar(NULL, relname, -1), NoLock); > - numattr = boot_reldesc->rd_rel->relnatts; > + numattr = RelationGetNumberOfAttributes(boot_reldesc); > for (i = 0; i < numattr; i++) > { > if (attrtypes[i] == NULL) That seems a bit unrelated. > @@ -2086,7 +2086,8 @@ StoreRelCheck(Relation rel, char *ccname, Node *expr, > is_validated, > RelationGetRelid(rel), /* relation */ > attNos, /* attrs in the constraint */ > - keycount, /* # attrs in the constraint */ > + keycount, /* # key attrs in the constraint */ > + keycount, /* # total attrs in the constraint */ > InvalidOid, /* not a domain constraint */ > InvalidOid, /* no associated index */ > InvalidOid, /* Foreign key fields */ It doesn't quite seem right to me to store this both in pg_index and pg_constraint. > @@ -340,14 +341,27 @@ DefineIndex(Oid relationId, > numberOfAttributes = list_length(stmt->indexParams); > - if (numberOfAttributes <= 0) > - ereport(ERROR, > - (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), > - errmsg("must specify at least one column"))); > + Huh, why's that check gone? > > +opt_c_include: INCLUDE '(' columnList ')' { $$ = $3; } > + | /* EMPTY */ { $$ = NIL; } > + ; > +opt_include: INCLUDE '(' index_including_params ')' { $$ = $3; } > + | /* EMPTY */ { $$ = NIL; } > + ; > + > +index_including_params: index_elem { $$ = list_make1($1); } > + | index_including_params ',' index_elem { $$ = lappend($1, $3); } > + ; > + Why do we have multiple different definitions of this? > @@ -1979,6 +2017,48 @@ transformIndexConstraint(Constraint *constraint, CreateStmtContext *cxt) > index->indexParams = lappend(index->indexParams, iparam); > } > > + /* Here is some code duplication. But we do need it. */ Aha? > + foreach(lc, constraint->including) > + { > + char *key = strVal(lfirst(lc)); > + bool found = false; > + ColumnDef *column = NULL; > + ListCell *columns; > + IndexElem *iparam; > + > + foreach(columns, cxt->columns) > + { > + column = (ColumnDef *) lfirst(columns); > + Assert(IsA(column, ColumnDef)); > + if (strcmp(column->colname, key) == 0) > + { > + found = true; > + break; > + } > + } > + > + /* > + * In the ALTER TABLE case, don't complain about index keys not > + * created in the command; they may well exist already. DefineIndex > + * will complain about them if not, and will also take care of marking > + * them NOT NULL. > + */ Uh. Why should they be marked as NOT NULL? ISTM the comment has been copied here without adjustments. > @@ -1275,6 +1275,21 @@ pg_get_indexdef_worker(Oid indexrelid, int colno, > Oid keycoltype; > Oid keycolcollation; > > + /* > + * attrsOnly flag is used for building unique-constraint and > + * exclusion-constraint error messages. Included attrs are > + * meaningless there, so do not include them in the message. > + */ > + if (attrsOnly && keyno >= idxrec->indnkeyatts) > + break; Sounds like the parameter should be renamed then. > +Included attributes in B-tree indexes > +------------------------------------- > + > +Since 10.0 there is an optional INCLUDE clause, that allows to add 10.0 isn't right, since that's the "patch" version now. > +a portion of non-key attributes to index. They exist to allow more queries > +to benefit from index-only scans. We never use included attributes in > +ScanKeys, neither for search nor for inserts. That allows us to include > +into B-tree any datatypes, even those which don't have suitable opclass. > +Included columns only stored in regular items on leaf pages. All inner > +keys and high keys are truncated and contain only key attributes. > +That helps to reduce the size of index. s/index/the index/ > @@ -537,6 +542,28 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) > ItemIdSetUnused(ii); /* redundant */ > ((PageHeader) opage)->pd_lower -= sizeof(ItemIdData); > > + if (indnkeyatts != indnatts && P_ISLEAF(opageop)) > + { > + /* > + * It's essential to truncate High key here. > + * The purpose is not just to save more space on this particular page, > + * but to keep whole b-tree structure consistent. Subsequent insertions > + * assume that hikey is already truncated, and so they should not > + * worry about it, when copying the high key into the parent page > + * as a downlink. s/should/need/ > + * NOTE It is not crutial for reliability in present, s/crutial/crucial/ > + * but maybe it will be that in the future. > + */ "it's essential" ... "it is not crutial" -- that's contradictory. > + keytup = index_truncate_tuple(wstate->index, oitup); The code in _bt_split previously claimed that it's the only place doing truncation... > + /* delete "wrong" high key, insert keytup as P_HIKEY. */ > + PageIndexTupleDelete(opage, P_HIKEY); > + if (!_bt_pgaddtup(opage, IndexTupleSize(keytup), keytup, P_HIKEY)) > + elog(ERROR, "failed to rewrite compressed item in index \"%s\"", > + RelationGetRelationName(wstate->index)); Hm... - Andres
Hi Robert, > Has anybody done some testing of this patch with the WAL consistency > checker? Like, create some tables with indexes that have INCLUDE > columns, set up a standby, enable consistency checking, pound the > master, and see if the standby bails? I've decided to run such a test. It looks like there is a bug indeed. Steps to reproduce: 0. Apply a patch. 1. Build PostgreSQL using quick-build.sh [1] 2. Install master and replica using install.sh [2] 3. Download test.sql [3] 4. Run: `cat test.sql | psql` 5. In replica's logfile: ``` FATAL: inconsistent page found, rel 1663/16384/16396, forknum 0, blkno 1 ``` > Has anybody tested this patch with amcheck? Does it break amcheck? Amcheck doesn't complain. [1] https://github.com/afiskon/pgscripts/blob/master/quick-build.sh [2] https://github.com/afiskon/pgscripts/blob/master/install.sh [3] http://afiskon.ru/s/88/93c544e6cf_test.sql -- Best regards, Aleksander Alekseev
30.03.2017 19:49, Robert Haas: > On Thu, Mar 30, 2017 at 11:26 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> I had a look on patch and played with it, seems, it looks fine. I splitted >> it to two patches: core changes (+bloom index fix) and btree itself. All >> docs are left in first patch - I'm too lazy to rewrite documentation which >> is changed in second patch. >> Any objection from reviewers to push both patches? > Has this really had enough review and testing? The last time it was > pushed, it didn't go too well. And laziness is not a very good excuse > for not dividing up patches properly. Well, I don't know how can we estimate the quality of the review or testing. The patch was reviewed by many people. Here are those who marked themselves as reviewers on this and previous committfests: Stephen Frost (sfrost), Andrew Dunstan (adunstan), Aleksander Alekseev (a.alekseev), Amit Kapila (amitkapila), Andrey Borodin (x4m), Peter Geoghegan (pgeoghegan), David Rowley (davidrowley). For me it looks serious enough. These people, as well as many others, shared their thoughts on this topic and pointed out various mistakes. I fixed all the issues as soon as I could. And I'm not going to disappear when it will be committed. Personally, I always thought that we have Alpha and Beta releases for integration testing. Speaking of the feature itself, it is included into our fork of PostgreSQL 9.6 since it was released. And as far as I know, there were no complaints from users. It makes me believe that there are no critical bugs there. While there may be conflicts with some other features of v10.0. > It seems highly surprising to me that CheckIndexCompatible() only gets > a one line change in this patch. That seems unlikely to be correct. What makes you think so? CheckIndexCompatible() only cares about possible opclasses' changes. For covering indexes opclasses are only applicable to indnkeyatts. And that is exactly what was changed in this patch. Do you think it needs some other changes? > Has anybody done some testing of this patch with the WAL consistency > checker? Like, create some tables with indexes that have INCLUDE > columns, set up a standby, enable consistency checking, pound the > master, and see if the standby bails? Good point. I missed this feature, I wish someone mentioned this issue a bit earlier. And as Alexander's test shows there is some problem with my patch, indeed. I'll fix it and send updated patch. > Has anybody tested this patch with amcheck? Does it break amcheck? Yes, it breaks amcheck. Amcheck should be patched in order to work with covering indexes. We've discussed it with Peter before and I even wrote small patch. I'll attach it in the following message. > A few minor comments: > > - foreach(lc, constraint->keys) > + else foreach(lc, constraint->keys) > > That doesn't look like a reasonable way of formatting the code. > > + /* Here is some code duplication. But we do need it. */ > > That is not a very informative comment. > > + * NOTE It is not crutial for reliability in present, > > Spelling, punctuation. > Will be fixed as well. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Mar 30, 2017 at 5:22 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Well, > I don't know how can we estimate the quality of the review or testing. > The patch was reviewed by many people. > Here are those who marked themselves as reviewers on this and previous > committfests: Stephen Frost (sfrost), Andrew Dunstan (adunstan), Aleksander > Alekseev (a.alekseev), Amit Kapila (amitkapila), Andrey Borodin (x4m), Peter > Geoghegan (pgeoghegan), David Rowley (davidrowley). Sure, but the amount of in-depth review seems to have been limited. Just because somebody put their name down in the CommitFest application doesn't mean that they did a detailed review of all the code. >> It seems highly surprising to me that CheckIndexCompatible() only gets >> a one line change in this patch. That seems unlikely to be correct. > > What makes you think so? CheckIndexCompatible() only cares about possible > opclasses' changes. > For covering indexes opclasses are only applicable to indnkeyatts. And that > is exactly what was changed in this patch. > Do you think it needs some other changes? Probably. I mean, for an INCLUDE column, it wouldn't matter if a collation or opclass change happened, but if the base data type had changed, you'd still need to rebuild the index. So presumably CheckIndexCompatible() ought to be comparing some things, but not everything, for INCLUDE columns. >> Has anybody tested this patch with amcheck? Does it break amcheck? > > Yes, it breaks amcheck. Amcheck should be patched in order to work with > covering indexes. > We've discussed it with Peter before and I even wrote small patch. > I'll attach it in the following message. Great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Any objection from reviewers to push both patches?
First of all, I want to thank you and Robert for reviewing this patch.
Your expertise in postgres subsystems is really necessary for features like this.
I just wonder, why don't you share your thoughts and doubts till the "last call".
The feature is "index with included columns", it uses keyword "INCLUDE".diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c index f2eda67..59029b9 100644 --- a/contrib/bloom/blutils.c +++ b/contrib/bloom/blutils.c @@ -120,6 +120,7 @@ blhandler(PG_FUNCTION_ARGS) amroutine->amclusterable = false; amroutine->ampredlocks = false; amroutine->amcanparallel = false; + amroutine->amcaninclude = false;That name doesn't strike me as very descriptive.
So the name looks good to me.
Any suggestions?
Definitely. But do you have any specific proposals?+ <term><literal>INCLUDE</literal></term> + <listitem> + <para> + An optional <literal>INCLUDE</> clause allows a list of columns to be + specified which will be included in the non-key portion of the index. + Columns which are part of this clause cannot also exist in the + key columns portion of the index, and vice versa. The + <literal>INCLUDE</> columns exist solely to allow more queries to benefit + from <firstterm>index-only scans</> by including certain columns in the + index, the value of which would otherwise have to be obtained by reading + the table's heap. Having these columns in the <literal>INCLUDE</> clause + in some cases allows <productname>PostgreSQL</> to skip the heap read + completely. This also allows <literal>UNIQUE</> indexes to be defined on + one set of columns, which can include another set of columns in the + <literal>INCLUDE</> clause, on which the uniqueness is not enforced. + It's the same with other constraints (PRIMARY KEY and EXCLUDE). This can + also can be used for non-unique indexes as any columns which are not required + for the searching or ordering of records can be used in the + <literal>INCLUDE</> clause, which can slightly reduce the size of the index. + Currently, only the B-tree access method supports this feature. + Expressions as included columns are not supported since they cannot be used + in index-only scans. + </para> + </listitem> + </varlistentry>This could use some polishing.
Initial reasoning was something like this:+/* + * Reform index tuple. Truncate nonkey (INCLUDE) attributes. + */ +IndexTuple +index_truncate_tuple(Relation idxrel, IndexTuple olditup) +{ + TupleDesc itupdesc = RelationGetDescr(idxrel); + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + IndexTuple newitup; + int indnatts = IndexRelationGetNumberOfAttributes(idxrel); + int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(idxrel); + + Assert(indnatts <= INDEX_MAX_KEYS); + Assert(indnkeyatts > 0); + Assert(indnkeyatts < indnatts); + + index_deform_tuple(olditup, itupdesc, values, isnull); + + /* form new tuple that will contain only key attributes */ + itupdesc->natts = indnkeyatts; + newitup = index_form_tuple(itupdesc, values, isnull); + newitup->t_tid = olditup->t_tid; + + itupdesc->natts = indnatts;Uh, isn't this a *seriously* bad idea? If index_form_tuple errors out, this'll corrupt the tuple descriptor.
> Maybe it would be better to modify index_form_tuple() to accept a new > argument with a number of attributes, then you can just Assert that > this number is never higher than the number of attributes in the > TupleDesc. Good point. I agree that this function is a bit strange. I have to set tupdesc->nattrs to support compatibility with index_form_tuple(). I didn't want to add neither a new field to tupledesc nor a new parameter to index_form_tuple(), because they are used widely.
But I haven't considered the possibility of index_form_tuple() failure.
Fixed in this version of the patch. Now it creates a copy of tupledesc to pass it to index_form_tuple.
We'd discussed with other reviewers, they suggested index_truncate_tuple() instead of index_reform_tuple().Maybe also rename the function to index_build_key_tuple()?
I think that this name reflects the essence of the function clear enough and don't feel like renaming it again.
Since we don't use included columns for system indexes, there is no difference. I've just tried to minimize code changes here.@@ -368,7 +370,7 @@ systable_beginscan(Relation heapRelation, { int j; - for (j = 0; j < irel->rd_index->indnatts; j++) + for (j = 0; j < IndexRelationGetNumberOfAttributes(irel); j++){ if (key[i].sk_attno == irel->rd_index->indkey.values[j]) { @@ -376,7 +378,7 @@ systable_beginscan(Relation heapRelation, break; } } - if (j == irel->rd_index->indnatts) + if (j == IndexRelationGetNumberOfAttributes(irel)) elog(ERROR, "column is not in index"); }Not that it matters overly much, but why are we doing this for all attributes, rather than just key attributes?
I've replaced all the references to relnatts with macro, primarily to ensure that I won't miss anything that should use only key attributes.--- a/src/backend/bootstrap/bootstrap.c +++ b/src/backend/bootstrap/bootstrap.c @@ -600,7 +600,7 @@ boot_openrel(char *relname) relname, (int) ATTRIBUTE_FIXED_PART_SIZE); boot_reldesc = heap_openrv(makeRangeVar(NULL, relname, -1), NoLock); - numattr = boot_reldesc->rd_rel->relnatts; + numattr = RelationGetNumberOfAttributes(boot_reldesc); for (i = 0; i < numattr; i++) { if (attrtypes[i] == NULL)That seems a bit unrelated.
Initially, I did to provide pg_get_constraintdef_worker() with info about included columns.@@ -2086,7 +2086,8 @@ StoreRelCheck(Relation rel, char *ccname, Node *expr, is_validated, RelationGetRelid(rel), /* relation */ attNos, /* attrs in the constraint */ - keycount, /* # attrs in the constraint */ + keycount, /* # key attrs in the constraint */ + keycount, /* # total attrs in the constraint */ InvalidOid, /* not a domain constraint */ InvalidOid, /* no associated index */ InvalidOid, /* Foreign key fields */It doesn't quite seem right to me to store this both in pg_index and pg_constraint.
Maybe it can be solved in some other way, but for now it is a tested and working implementation.
+opt_c_include: INCLUDE '(' columnList ')' { $$ = $3; } + | /* EMPTY */ { $$ = NIL; } + ;+opt_include: INCLUDE '(' index_including_params ')' { $$ = $3; } + | /* EMPTY */ { $$ = NIL; } + ; + +index_including_params: index_elem { $$ = list_make1($1); } + | index_including_params ',' index_elem { $$ = lappend($1, $3); } + ; +Why do we have multiple different definitions of this?
Hm,
columnList contains entries of columnElem type and index_including_params works with index_elem.
Is there a way they can be combined?
To be exact, it claimed that regarding insertion of new values, not index build.+ keytup = index_truncate_tuple(wstate->index, oitup);The code in _bt_split previously claimed that it's the only place doing truncation...
> It's the only point in insertion process, where we perform truncation
Other comments about code format, spelling and comments are fixed in attached patches.
Thank you again for reviewing.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 2017-03-31 20:40:59 +0300, Anastasia Lubennikova wrote: > 30.03.2017 22:11, Andres Freund > > Any objection from reviewers to push both patches? > > First of all, I want to thank you and Robert for reviewing this patch. > Your expertise in postgres subsystems is really necessary for features like > this. > I just wonder, why don't you share your thoughts and doubts till the "last > call". Because there's a lot of other patches? I only looked because Teodor announced he was thinking about committing - I just don't have the energy to look at all patches before they're ready to commit. Unfortunatly "ready-for-committer" is very frequently not actually that :( > > Maybe it would be better to modify index_form_tuple() to accept a new > > argument with a number of attributes, then you can just Assert that > > this number is never higher than the number of attributes in the > > TupleDesc. > Good point. > I agree that this function is a bit strange. I have to set > tupdesc->nattrs to support compatibility with index_form_tuple(). > I didn't want to add neither a new field to tupledesc nor a new > parameter to index_form_tuple(), because they are used widely. > > > But I haven't considered the possibility of index_form_tuple() failure. > Fixed in this version of the patch. Now it creates a copy of tupledesc to > pass it to index_form_tuple. That seems like it'd actually be a noticeable increase in memory allocator overhead. I think we should just add (as just proposed in separate thread), a _extended version of it that allows to specify the number of columns. - Andres
On Fri, Mar 31, 2017 at 1:40 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > First of all, I want to thank you and Robert for reviewing this patch. > Your expertise in postgres subsystems is really necessary for features like > this. > I just wonder, why don't you share your thoughts and doubts till the "last > call". I haven't done any significant technical work other than review patches in 14 months, and in the last several months I've often worked 10 and 12 hour days to get more review done. I think at one level you've got a fair complaint here - it's hard to get things committed, and this patch probably didn't get as much attention as it deserved. It's not so easy to know how to fix that. I'm pretty sure "tell Andres and Robert to work harder" isn't it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Maybe it would be better to modify index_form_tuple() to accept a new argument with a number of attributes, then you can just Assert that this number is never higher than the number of attributes in the TupleDesc.Good point. I agree that this function is a bit strange. I have to set tupdesc->nattrs to support compatibility with index_form_tuple(). I didn't want to add neither a new field to tupledesc nor a new parameter to index_form_tuple(), because they are used widely. But I haven't considered the possibility of index_form_tuple() failure. Fixed in this version of the patch. Now it creates a copy of tupledesc to pass it to index_form_tuple.That seems like it'd actually be a noticeable increase in memory allocator overhead. I think we should just add (as just proposed in separate thread), a _extended version of it that allows to specify the number of columns.
The function is called not that often. Only once per page split for indexes with included columns.
Doesn't look like dramatic overhead. So I decided that a wrapper function would be more appropriate than refactoring of all index_form_tuple() calls.
But index_form_tuple_extended() looks like a better solution.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> > Other comments about code format, spelling and comments are fixed in > attached patches. One more version. Missed parse_utilcmd.c comment cleanup in previous 0001 patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
31.03.2017 20:57, Robert Haas: > On Fri, Mar 31, 2017 at 1:40 PM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> First of all, I want to thank you and Robert for reviewing this patch. >> Your expertise in postgres subsystems is really necessary for features like >> this. >> I just wonder, why don't you share your thoughts and doubts till the "last >> call". > I haven't done any significant technical work other than review > patches in 14 months, and in the last several months I've often worked > 10 and 12 hour days to get more review done. > > I think at one level you've got a fair complaint here - it's hard to > get things committed, and this patch probably didn't get as much > attention as it deserved. It's not so easy to know how to fix that. > I'm pretty sure "tell Andres and Robert to work harder" isn't it. > *off-topic* No complaints from me, I understand how difficult is reviewing and highly appreciate your work. The problem is that not all developers are qualified enough to do a review. I've tried to make a course about postrges internals. Something like "Deep dive into postgres codebase for hackers". And it turned out to be really helpful for new developers. So, I wonder, maybe we could write some tips for new reviewers and testers as well. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
I had a quick look at this on the flight back from PGConf.US. On Fri, Mar 31, 2017 at 10:40 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > But I haven't considered the possibility of index_form_tuple() failure. > Fixed in this version of the patch. Now it creates a copy of tupledesc to > pass it to index_form_tuple. I think that we need to be 100% sure that index_truncate_tuple() will not generate an IndexTuple that is larger than the original. Otherwise, you could violate the "1/3 of page size exceeded" thing. We need to catch that when the user actually inserts an oversized value. After that, it's always too late. (See my remarks to Tom on other thread about this, too.) > We'd discussed with other reviewers, they suggested index_truncate_tuple() > instead of index_reform_tuple(). > I think that this name reflects the essence of the function clear enough and > don't feel like renaming it again. +1. Feedback so far: * index_truncate_tuple() should have as an argument the number of attributes. No need to "#include utils/rel.h" that way. * I think that we should store this (the number of attributes), and use it directly when comparing, per my remarks to Tom over on that other thread. We should also use the free bit within IndexTupleData.t_info, to indicate that the IndexTuple was truncated, just to make it clear to everyone that might care that that's how these truncated IndexTuples need to be represented. Doing this would have no real impact on your patch, because for you this will be 100% redundant. It will help external tools, and perhaps another, more general suffix truncation patch that comes in the future. We should try very hard to have a future-proof on-disk representation. I think that this is quite possible. * I suggest adding a "can't happen" defensive check + error that checks that the tuple returned by index_truncate_tuple() is sized <= the original. This cannot be allowed to ever happen. (Not that I think it will.) * I see a small bug. You forgot to teach _bt_findsplitloc() about truncation. It does this currently, which you did not update: /* * The first item on the right page becomes the high key of the left page; * therefore it counts against leftspace as well as right space. */ leftfree -= firstrightitemsz; I think that this accounting needs to be fixed. * Note sure about one thing. What's the reason for this change? > - /* Log left page */ > - if (!isleaf) > - { > - /* > - * We must also log the left page's high key, because the right > - * page's leftmost key is suppressed on non-leaf levels. Show it > - * as belonging to the left page buffer, so that it is not stored > - * if XLogInsert decides it needs a full-page image of the left > - * page. > - */ > - itemid = PageGetItemId(origpage, P_HIKEY); > - item = (IndexTuple) PageGetItem(origpage, itemid); > - XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item))); > - } > + /* > + * We must also log the left page's high key, because the right > + * page's leftmost key is suppressed on non-leaf levels. Show it > + * as belonging to the left page buffer, so that it is not stored > + * if XLogInsert decides it needs a full-page image of the left > + * page. > + */ > + itemid = PageGetItemId(origpage, P_HIKEY); > + item = (IndexTuple) PageGetItem(origpage, itemid); > + XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item))); Is this related to the problem that you mentioned to me that you'd fixed when we spoke in person earlier today? You said something about WAL logging, but I don't recall any details. I don't remember seeing this change in prior versions. Anyway, whatever the reason for doing this on the leaf level now, the comments should be updated to explain it. * Speaking of WAL-logging, I think that there is another bug in btree_xlog_split(). You didn't change this existing code at all: /* * On leaf level, the high key of the left page is equal to the first key * on the right page. */ if (isleaf) { ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque)); left_hikey = PageGetItem(rpage, hiItemId); left_hikeysz = ItemIdGetLength(hiItemId); } It seems like this was missed when you changed WAL-logging, since you do something for this on the logging side, but not here, on the replay side. No? That's all I have for now. Maybe I can look again later, or tomorrow. -- Peter Geoghegan
On Fri, Mar 31, 2017 at 4:31 PM, Peter Geoghegan <pg@bowt.ie> wrote: > That's all I have for now. Maybe I can look again later, or tomorrow. I took another look, this time at code used during CREATE INDEX. More feedback: * I see no reason to expose _bt_pgaddtup() (to modify it to not be static, so it can be called during CREATE INDEX for truncated high key). You could call PageAddItem() directly, just as _bt_pgaddtup() itself does, and lose nothing. This is the case because the special steps within _bt_pgaddtup() are only when inserting the first real item (and only on an internal page). You're only ever using _bt_pgaddtup() for the high key offset. Would a raw PageAddItem() call lose anything? I think I see why you've done this -- the existing CREATE INDEX _bt_sortaddtup() routine (which is very similar to _bt_pgaddtup(), a routine used for *insertion*) doesn't do the correct thing were you to use it, because it assumes that the page is always right most (i.e., always has no high key yet). The reason _bt_sortaddtup() exists is explained here: * This is almost like nbtinsert.c's _bt_pgaddtup(), but we can't use* that because it assumes that P_RIGHTMOST() will returnthe correct* answer for the page. Here, we don't know yet if the page will be* rightmost. Offset P_FIRSTKEY is alwaysthe first data key.*/ static void _bt_sortaddtup(Page page, Size itemsize, IndexTuple itup, OffsetNumber itup_off) { ... } (...thinks some more...) So, this difference only matters when you have a non-leaf item, which is never subject to truncation in your patch. So, in fact, it doesn't matter at all. I guess you should just use _bt_pgaddtup() after all, rather than bothering with a raw PageAddItem(), even. But, don't forget to note why this is okay above _bt_sortaddtup(). * Calling PageIndexTupleDelete() within _bt_buildadd(), which memmove()s all other items on the leaf page, seems wasteful in the context of CREATE INDEX. Can we do better? * I also think that calling PageIndexTupleDelete() has a page space accounting bug, because the following thing happens a second time for highkey ItemId when new code does this call: phdr->pd_lower -= sizeof(ItemIdData); (The first time this happens is within _bt_buildadd() itself, just before your patch calls PageIndexTupleDelete().) * I don't think it's okay to let index_truncate_tuple() leak memory within _bt_buildadd(). It's probably okay for nbtinsert.c callers to index_truncate_tuple() to not be too careful, though, since those calls occur in a per-tuple memory context. The same cannot be said for _bt_buildadd()/CREATE INDEX calls. * Speaking of memory management: is this really needed? > @@ -554,7 +580,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) > * Save a copy of the minimum key for the new page. We have to copy > * it off the old page, not the new one, in case we are not at leaf > * level. > + * Despite oitup is already initialized, it's important to get high > + * key from the page, since we could have replaced it with truncated > + * copy. See comment above. > */ > + oitup = (IndexTuple) PageGetItem(opage,PageGetItemId(opage, P_HIKEY)); > state->btps_minkey = CopyIndexTuple(oitup); You didn't modify/truncate oitup in-place -- you effectively made a (truncated) copy by calling index_truncate_tuple(). Maybe you can manage the memory by assigning keytup to state->btps_minkey, in place of a CopyIndexTuple(), just for the truncation case? I haven't studied this in enough detail to be sure that that would be correct, but it seems clear that a better strategy is needed for managing memory within _bt_buildadd(). -- Peter Geoghegan
On Thu, Mar 30, 2017 at 8:26 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Any objection from reviewers to push both patches? I object. Unfortunately, it seems very unlikely that we'll be able to get the patch into shape in the allotted time before feature-freeze, even with the 1 week extension. -- Peter Geoghegan
01.04.2017 02:31, Peter Geoghegan: > > * index_truncate_tuple() should have as an argument the number of > attributes. No need to "#include utils/rel.h" that way. Will fix. > > * I think that we should store this (the number of attributes), and > use it directly when comparing, per my remarks to Tom over on that > other thread. We should also use the free bit within > IndexTupleData.t_info, to indicate that the IndexTuple was truncated, > just to make it clear to everyone that might care that that's how > these truncated IndexTuples need to be represented. > > Doing this would have no real impact on your patch, because for you > this will be 100% redundant. It will help external tools, and perhaps > another, more general suffix truncation patch that comes in the > future. We should try very hard to have a future-proof on-disk > representation. I think that this is quite possible. To be honest, I think that it'll make the patch overcomplified, because this exact patch has nothing to do with suffix truncation. Although, we can add any necessary flags if this work will be continued in the future. > * I suggest adding a "can't happen" defensive check + error that > checks that the tuple returned by index_truncate_tuple() is sized <= > the original. This cannot be allowed to ever happen. (Not that I think > it will.) There is already an assertion. Assert(IndexTupleSize(newitup) <= IndexTupleSize(olditup)); Do you think it is not enough? > * I see a small bug. You forgot to teach _bt_findsplitloc() about > truncation. It does this currently, which you did not update: > > /* > * The first item on the right page becomes the high key of the left page; > * therefore it counts against left space as well as right space. > */ > leftfree -= firstrightitemsz; > > I think that this accounting needs to be fixed. Could you explain, what's wrong with this accounting? We may expect to take more space on the left page, than will be taken after highkey truncation. But I don't see any problem here. > * Note sure about one thing. What's the reason for this change? > >> - /* Log left page */ >> - if (!isleaf) >> - { >> - /* >> - * We must also log the left page's high key, because the right >> - * page's leftmost key is suppressed on non-leaf levels. Show it >> - * as belonging to the left page buffer, so that it is not stored >> - * if XLogInsert decides it needs a full-page image of the left >> - * page. >> - */ >> - itemid = PageGetItemId(origpage, P_HIKEY); >> - item = (IndexTuple) PageGetItem(origpage, itemid); >> - XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item))); >> - } >> + /* >> + * We must also log the left page's high key, because the right >> + * page's leftmost key is suppressed on non-leaf levels. Show it >> + * as belonging to the left page buffer, so that it is not stored >> + * if XLogInsert decides it needs a full-page image of the left >> + * page. >> + */ >> + itemid = PageGetItemId(origpage, P_HIKEY); >> + item = (IndexTuple) PageGetItem(origpage, itemid); >> + XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item))); > Is this related to the problem that you mentioned to me that you'd > fixed when we spoke in person earlier today? You said something about > WAL logging, but I don't recall any details. I don't remember seeing > this change in prior versions. > > Anyway, whatever the reason for doing this on the leaf level now, the > comments should be updated to explain it. This change related to the bug described in this message. https://www.postgresql.org/message-id/20170330192706.GA2565%40e733.localdomain After fix it is not reproducible. I will update comments in the next patch. > * Speaking of WAL-logging, I think that there is another bug in > btree_xlog_split(). You didn't change this existing code at all: > > /* > * On leaf level, the high key of the left page is equal to the first key > * on the right page. > */ > if (isleaf) > { > ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque)); > > left_hikey = PageGetItem(rpage, hiItemId); > left_hikeysz = ItemIdGetLength(hiItemId); > } > > It seems like this was missed when you changed WAL-logging, since you > do something for this on the logging side, but not here, on the replay > side. No? > I changed it. Now we always use highkey saved in xlog. This code don't needed anymore and can be deleted. Thank you for the notice. I will send updated patch today. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Tue, Apr 4, 2017 at 3:07 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: >> * I think that we should store this (the number of attributes), and >> use it directly when comparing, per my remarks to Tom over on that >> other thread. We should also use the free bit within >> IndexTupleData.t_info, to indicate that the IndexTuple was truncated, >> just to make it clear to everyone that might care that that's how >> these truncated IndexTuples need to be represented. >> >> Doing this would have no real impact on your patch, because for you >> this will be 100% redundant. It will help external tools, and perhaps >> another, more general suffix truncation patch that comes in the >> future. We should try very hard to have a future-proof on-disk >> representation. I think that this is quite possible. > > To be honest, I think that it'll make the patch overcomplified, because this > exact patch has nothing to do with suffix truncation. > Although, we can add any necessary flags if this work will be continued in > the future. Yes, doing things that way would mean adding a bit more complexity to your patch, but IMV would be worth it to have the on-disk format be compatible with what a full suffix truncation patch will eventually require. Obviously I disagree with what you say here -- I think that your patch *does* have plenty in common with suffix truncation. But, you don't have to even agree with me on that to see why what I propose is still a good idea. Tom Lane had a specific objection to this patch -- catalog metadata is currently necessary to interpret internal page IndexTuples [1]. However, by storing the true number of columns in the case of truncated tuples, we can make the situation with IndexTuples similar enough to the existing situation with heap tuples, where the number of attributes is available right in the header as "natts". We don't have to rely on something like catalog metadata from a great distance, where some caller may forget to pass through the metadata to a lower level. So, presumably doing things this way addresses Tom's exact objection to the truncation aspect of this patch [2]. We have the capacity to store something like natts "for free" -- let's use it. The lack of any direct source of metadata was called "dangerous". As much as anything else, I want to remove any danger. > There is already an assertion. > Assert(IndexTupleSize(newitup) <= IndexTupleSize(olditup)); > Do you think it is not enough? I think that a "can't happen" check will work better in the future, when user defined code could be involved in truncation. Any extra overhead will be paid relatively infrequently, and will be very low. >> * I see a small bug. You forgot to teach _bt_findsplitloc() about >> truncation. It does this currently, which you did not update: >> >> /* >> * The first item on the right page becomes the high key of the left >> page; >> * therefore it counts against left space as well as right space. >> */ >> leftfree -= firstrightitemsz; >> >> I think that this accounting needs to be fixed. > > Could you explain, what's wrong with this accounting? We may expect to take > more space on the left page, than will be taken after highkey truncation. > But I don't see any problem here. Obviously it would at least be slightly better to have the actual truncated high key size where that's expected -- not the would-be untruncated high key size. The code as it stands might lead to a bad choice of split point in edge-cases. At the very least, you should change comments to note the issue. I think it's highly unlikely that this could ever result in a failure to find a split point, which there are many defenses against already, but I think I would find that difficult to prove. The intent of the code is almost as important as the code, at least in my opinion. [1] postgr.es/m/CAH2-Wz=VMDH8pFAZX9WAH9Bn5Ast5vrnA0xSz+GsfRs12bp_sg@mail.gmail.com [2] postgr.es/m/11895.1490983884%40sss.pgh.pa.us -- Peter Geoghegan
On 4/4/17 2:47 PM, Peter Geoghegan wrote: > > At the very least, you should change comments to note the issue. I > think it's highly unlikely that this could ever result in a failure to > find a split point, which there are many defenses against already, but > I think I would find that difficult to prove. The intent of the code > is almost as important as the code, at least in my opinion. This submission as been Returned with Feedback. Please feel free to resubmit to a future commitfest. -- -David david@pgmasters.net
Use case:
- We have a table (c1, c2, c3, c4);
- We need to have an unique index on (c1, c2).
- We would like to have a covering index on all columns to avoid reading of heap pages.
Old way:
CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2);
CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4);
What's wrong?
Two indexes contain repeated data. Overhead to data manipulation operations and database size.
New way:
CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDE (c3, c4);
To find more about the syntax you can read related documentation patches and also take a look
at the new test - src/test/regress/sql/index_including.sql.
Updated version is attached. It applies to the commit e4fbf22831c2bbcf032ee60a327b871d2364b3f5.
The first patch patch contains changes in general index routines
and the second one contains btree specific changes.
This version contains fixes of the issues mentioned in the thread above and passes all existing tests.
But still it requires review and testing, because the merge was quite uneasy.
Especially I worry about integration with partitioning. I'll add some more tests in the next message.
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi! +1 for pushing this. I'm really looking forward to see this in 11. > 31 окт. 2017 г., в 13:21, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а): > > Updated version is attached. It applies to the commit e4fbf22831c2bbcf032ee60a327b871d2364b3f5. > The first patch patch contains changes in general index routines > and the second one contains btree specific changes. > > This version contains fixes of the issues mentioned in the thread above and passes all existing tests. > But still it requires review and testing, because the merge was quite uneasy. > Especially I worry about integration with partitioning. I'll add some more tests in the next message. I've been doing benchmark tests a year ago, so I skip this part in this review. I've done some stress tests with pgbench, replication etc. Everything was fine until I plugged in amcheck. If I create cluster with this [0] and then ./pgbench -i -s 50 create index on pgbench_accounts (abalance) include (bid,aid,filler); create extension amcheck; --and finally SELECT bt_index_check(c.oid), c.relname, c.relpages FROM pg_index i JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod = am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON c.relnamespace = n.oid WHERE am.amname = 'btree' AND n.nspname = 'public' AND c.relpersistence != 't' AND i.indisready AND i.indisvalid ORDER BY c.relpages DESC LIMIT 100; --just copypasted from amcheck docs with minor corrections Postgres crashes: TRAP: FailedAssertion("!(((const void*)(&isNull) != ((void*)0)) && (scankey->sk_attno) > 0)", File: "nbtsearch.c", Line:466) May be I'm doing something wrong or amcheck support will go with different patch? Few minor nitpicks: 0. PgAdmin fails to understand what is going on [1] . It is clearly problem of PgAdmin, pg_dump works as expected. 1. ISTM index_truncate_tuple() can be optimized. We only need to reset tuple length and infomask. But this should not behotpath, anyway, so I propose ignoring this for current version. 2. I've done grammarly checking :) This comma seems redundant [2] I don't think something of these items require fixing. Thanks for working on this, I believe it is important. Best regards, Andrey Borodin. [0] https://github.com/x4m/pgscripts/blob/master/install.sh [1] https://yadi.sk/i/ro9YKFqo3PcwFT [2] https://github.com/x4m/postgres_g/commit/657c28952d923d8c150e6cabb3bdcbbc44a641b6?diff=unified#diff-640baf2937029728a8d51cccd554c2eeR1291 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Nov 12, 2017 at 8:40 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Postgres crashes: > TRAP: FailedAssertion("!(((const void*)(&isNull) != ((void*)0)) && (scankey->sk_attno) > 0)", File: "nbtsearch.c", Line:466) > > May be I'm doing something wrong or amcheck support will go with different patch? Usually amcheck complaining is a sign of other symptoms. I am marking this patch as returned with feedback for now as no updates have been provided after two weeks. -- Michael
On Tue, Nov 28, 2017 at 6:16 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Nov 12, 2017 at 8:40 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> Postgres crashes: >> TRAP: FailedAssertion("!(((const void*)(&isNull) != ((void*)0)) && (scankey->sk_attno) > 0)", File: "nbtsearch.c", Line:466) >> >> May be I'm doing something wrong or amcheck support will go with different patch? > > Usually amcheck complaining is a sign of other symptoms. I am marking > this patch as returned with feedback for now as no updates have been > provided after two weeks. It looks like amcheck needs to be patched -- a simple oversight. amcheck is probably calling _bt_compare() without realizing that internal pages don't have the extra attributes (just leaf pages, although they should also not participate in comparisons in respect of included/extra columns). There were changes to amcheck at one point in the past. That must have slipped through again. I don't think it's that complicated. BTW, it would probably be a good idea to use the new Github version's "heapallindexed" verification [1] for testing this patch. Anastasia will need to patch the externally maintained amcheck to do this, but it's probably no extra work, because this is already needed for contrib/amcheck, and because the heapallindexed check doesn't actually care about index structure at all. [1] https://github.com/petergeoghegan/amcheck#optional-heapallindexed-verification -- Peter Geoghegan
> 29 нояб. 2017 г., в 8:45, Peter Geoghegan <pg@bowt.ie> написал(а): > > On Tue, Nov 28, 2017 at 6:16 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: >> On Sun, Nov 12, 2017 at 8:40 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: >>> Postgres crashes: >>> TRAP: FailedAssertion("!(((const void*)(&isNull) != ((void*)0)) && (scankey->sk_attno) > 0)", File: "nbtsearch.c", Line:466) >>> >>> May be I'm doing something wrong or amcheck support will go with different patch? >> >> Usually amcheck complaining is a sign of other symptoms. I am marking >> this patch as returned with feedback for now as no updates have been >> provided after two weeks. > > It looks like amcheck needs to be patched -- a simple oversight. > amcheck is probably calling _bt_compare() without realizing that > internal pages don't have the extra attributes (just leaf pages, > although they should also not participate in comparisons in respect of > included/extra columns). There were changes to amcheck at one point in > the past. That must have slipped through again. I don't think it's > that complicated. > There is no doubts that this will be fixed. Therefor I propose move to next CF with Waiting for Author status. Best regards, Andrey Borodin.
Hi, Peter! > 29 нояб. 2017 г., в 8:45, Peter Geoghegan <pg@bowt.ie> написал(а): > > It looks like amcheck needs to be patched -- a simple oversight. > amcheck is probably calling _bt_compare() without realizing that > internal pages don't have the extra attributes (just leaf pages, > although they should also not participate in comparisons in respect of > included/extra columns). There were changes to amcheck at one point in > the past. That must have slipped through again. I don't think it's > that complicated. > > BTW, it would probably be a good idea to use the new Github version's > "heapallindexed" verification [1] for testing this patch. Anastasia > will need to patch the externally maintained amcheck to do this, but > it's probably no extra work, because this is already needed for > contrib/amcheck, and because the heapallindexed check doesn't actually > care about index structure at all. Seems like it was not a big deal of patching, I've fixed those bits (see attachment). I've done only simple tests as for now, but I'm planning to do better testing before next CF. Thanks for mentioning "heapallindexed", I'll use it too. Best regards, Andrey Borodin.
Attachment
> 30 нояб. 2017 г., в 23:07, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > Seems like it was not a big deal of patching, I've fixed those bits (see attachment). > I've done only simple tests as for now, but I'm planning to do better testing before next CF. > Thanks for mentioning "heapallindexed", I'll use it too. I've tested the patch with fixed amcheck (including "heapallindexed" feature), tests included bulk index creation, pgbenchingand amcheck of index itself and WAL-replicated index. Everything worked fine. Spotted one more typo: > Since 10.0 there is an optional INCLUDE clause should be > Since 11.0 there is an optional INCLUDE clause I think that patch set (two patches + 1 amcheck diff) is ready for committer. Best regards, Andrey Borodin.
Hello! The patch does not apply currently. Anastasia, can you, please, rebase the patch? Best regards, Andrey Borodin.
Updated patches are attached. Thank you for your interest to this patch and sorry for the slow reply. 08.01.2018 21:08, Andrey Borodin пишет: > Hello! > > The patch does not apply currently. > Anastasia, can you, please, rebase the patch? > > Best regards, Andrey Borodin. > -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Hi! > 16 янв. 2018 г., в 21:50, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а): > > Updated patches are attached. > Cool, thanks! I've looked into the code, but haven't found anything broken. Since I've tried to rebase patch myself and failed on parse utils, I've spend some cycles trying to break parsing. One minor complain (no need to fix). This is fine x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,upper(filler)); ERROR: expressions are not supported in included columns But why not same error here? Previous message is very descriptive. x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,aid+1); ERROR: syntax error at or near "+" This works. But should not, IMHO x4mmm=# create index on pgbench_accounts (bid) include (aid,aid,aid); CREATE INDEX Do not know what's that... # create index on pgbench_accounts (bid) include (aid desc, aid asc); CREATE INDEX All these things allow foot-shooting with a small caliber, but do not break big things. Unfortunately, amcheck_next does not work currently on HEAD (there are problems with AllocSetContextCreate() signature),but I've tested bt_index_check() before, during and after pgbench, on primary and on slave. Also, I've checkedbt_index_parent_check() on master. During bt_index_check() test from time to time I was observing ERROR: canceling statement due to conflict with recovery DETAIL: User query might have needed to see row versions that must be removed. [install]check[-world] passed :) From my POV, patch is in a good shape. I think it is time to make the patch ready for committer again. Best regards, Andrey Borodin.
17.01.2018 11:45, Andrey Borodin: > Hi! >> 16 янв. 2018 г., в 21:50, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а): >> >> Updated patches are attached. >> > Cool, thanks! > > I've looked into the code, but haven't found anything broken. > Since I've tried to rebase patch myself and failed on parse utils, I've spend some cycles trying to break parsing. > One minor complain (no need to fix). > This is fine > x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,upper(filler)); > ERROR: expressions are not supported in included columns > But why not same error here? Previous message is very descriptive. > x4mmm=# create index on pgbench_accounts (bid) include (aid,filler,aid+1); > ERROR: syntax error at or near "+" > This works. But should not, IMHO > x4mmm=# create index on pgbench_accounts (bid) include (aid,aid,aid); > CREATE INDEX > Do not know what's that... > # create index on pgbench_accounts (bid) include (aid desc, aid asc); > CREATE INDEX > > All these things allow foot-shooting with a small caliber, but do not break big things. > > Unfortunately, amcheck_next does not work currently on HEAD (there are problems with AllocSetContextCreate() signature),but I've tested bt_index_check() before, during and after pgbench, on primary and on slave. Also, I've checkedbt_index_parent_check() on master. What is amcheck_next ? > During bt_index_check() test from time to time I was observing > ERROR: canceling statement due to conflict with recovery > DETAIL: User query might have needed to see row versions that must be removed. > Sorry, I forgot to attach the amcheck fix to the previous message. Now all the patches are in attachment. Could you recheck if the error is still there? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
amcheck_next is external version of amcheck, maintained by Peter G. on his github. It checks one more thing: that every heap tuple has twin in B-tree, so called heapallindexed check.18 янв. 2018 г., в 18:57, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> написал(а):
What is amcheck_next ?
No problem, surely I've fixed that before testing.During bt_index_check() test from time to time I was observing
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
Sorry, I forgot to attach the amcheck fix to the previous message.
No need to do that, I was checking exactly same codebase.Now all the patches are in attachment.
Could you recheck if the error is still there?
On Wed, Jan 17, 2018 at 12:45 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Unfortunately, amcheck_next does not work currently on HEAD (there are problems with AllocSetContextCreate() signature),but I've tested bt_index_check() before, during and after pgbench, on primary and on slave. Also, I've checkedbt_index_parent_check() on master. I fixed that recently. It should be fine now. -- Peter Geoghegan
> 21 янв. 2018 г., в 3:36, Peter Geoghegan <pg@bowt.ie> написал(а): > > On Wed, Jan 17, 2018 at 12:45 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> Unfortunately, amcheck_next does not work currently on HEAD (there are problems with AllocSetContextCreate() signature),but I've tested bt_index_check() before, during and after pgbench, on primary and on slave. Also, I've checkedbt_index_parent_check() on master. > > I fixed that recently. It should be fine now. Oh, sorry, missed that I'm using patched stale amcheck_next. Thanks! Affirmative, amcheck_next works fine. I run pgbench against several covering indexes. Checking before load, during and after, both on master and slave. I do not observe any errors besides infrequent "canceling statement due to conflict with recovery", which is not a sign ofany malfunction. Best regards, Andrey Borodin.
I feel sorry for the noise, switching this patch there and back. But the patch needs rebase again. It still applies with-3, but do not compile anymore. Best regards, Andrey Borodin. The new status of this patch is: Waiting on Author
Thanks for the reminder. Rebased patches are attached. 21.01.2018 17:45, Andrey Borodin пишет: > I feel sorry for the noise, switching this patch there and back. But the patch needs rebase again. It still applies with-3, but do not compile anymore. > > Best regards, Andrey Borodin. > > The new status of this patch is: Waiting on Author -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Fri, Jan 26, 2018 at 3:01 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Thanks for the reminder. Rebased patches are attached. This is a really cool and also difficult feature. Thanks for working on it! Here are a couple of quick comments on the documentation, since I noticed it doesn't build: SGML->XML change: (1) empty closing tags "</>" are no longer accepted, (2) <xref ...> now needs to be written <xref .../> and (3) xref IDs are now case-sensitive. + PRIMARY KEY ( <replaceable class="parameter">column_name</replaceable> [, ... ] ) <replaceable class="parameter">index_parameters</replaceable> <optional>INCLUDE (<replaceable class="parameter">column_name</replaceable> [, ...])</optional> | I hadn't seen that use of "<optional>" before. Almost everywhere else we use explicit [ and ] characters, but I see that there are other examples, and it is rendered as [ and ] in the output. OK, cool, but I think there should be some extra whitespace so that it comes out as: [ INCLUDE ... ] instead of: [INCLUDE ...] to fit with the existing convention. + ... This also allows <literal>UNIQUE</> indexes to be defined on + one set of columns, which can include another set of columns in the + <literal>INCLUDE</> clause, on which the uniqueness is not enforced. + It's the same with other constraints (PRIMARY KEY and EXCLUDE). This can + also can be used for non-unique indexes as any columns which are not required + for the searching or ordering of records can be used in the + <literal>INCLUDE</> clause, which can slightly reduce the size of the index. Can I suggest rewording these three sentences a bit? Just an idea: <literal>UNIQUE</literal> indexes, <literal>PRIMARY KEY</literal> constraints and <literal>EXCLUDE</literal> constraints can be defined with extra columns in an <literal>INCLUDE</literal> clause, in which case uniqueness is not enforced for the extra columns. Moving columns that are not needed for searching, ordering or uniqueness into the <literal>INCLUDE</literal> clause can sometimes reduce the size of the index while retaining the possibility of using a faster index-only scan. -- Thomas Munro http://www.enterprisedb.com
26.01.2018 07:19, Thomas Munro: > On Fri, Jan 26, 2018 at 3:01 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> Thanks for the reminder. Rebased patches are attached. > This is a really cool and also difficult feature. Thanks for working > on it! Here are a couple of quick comments on the documentation, > since I noticed it doesn't build: > > SGML->XML change: (1) empty closing tags "</>" are no longer accepted, > (2) <xref ...> now needs to be written <xref .../> and (3) xref IDs > are now case-sensitive. > > + PRIMARY KEY ( <replaceable > class="parameter">column_name</replaceable> [, ... ] ) <replaceable > class="parameter">index_parameters</replaceable> <optional>INCLUDE > (<replaceable class="parameter">column_name</replaceable> [, > ...])</optional> | > > I hadn't seen that use of "<optional>" before. Almost everywhere else > we use explicit [ and ] characters, but I see that there are other > examples, and it is rendered as [ and ] in the output. OK, cool, but > I think there should be some extra whitespace so that it comes out as: > > [ INCLUDE ... ] > > instead of: > > [INCLUDE ...] > > to fit with the existing convention. > > + ... This also allows <literal>UNIQUE</> indexes to be defined on > + one set of columns, which can include another set of columns in the > + <literal>INCLUDE</> clause, on which the uniqueness is not enforced. > + It's the same with other constraints (PRIMARY KEY and > EXCLUDE). This can > + also can be used for non-unique indexes as any columns which > are not required > + for the searching or ordering of records can be used in the > + <literal>INCLUDE</> clause, which can slightly reduce the > size of the index. Thank you for reviewing. All mentioned issues are fixed. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Thank you for reviewing. All mentioned issues are fixed. == Applying patch 0002-covering-btree_v4.patch... 1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/nbtree/README.rej 1 out of 1 hunk FAILED -- saving rejects to file src/backend/access/nbtree/nbtxlog.c.rej Can we please have a new patch set? -- Thomas Munro http://www.enterprisedb.com
06.03.2018 11:52, Thomas Munro: > On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova > <a.lubennikova@postgrespro.ru> wrote: >> Thank you for reviewing. All mentioned issues are fixed. > == Applying patch 0002-covering-btree_v4.patch... > 1 out of 1 hunk FAILED -- saving rejects to file > src/backend/access/nbtree/README.rej > 1 out of 1 hunk FAILED -- saving rejects to file > src/backend/access/nbtree/nbtxlog.c.rej > > Can we please have a new patch set? Here it is. Many thanks to Andrey Borodin. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
06.03.2018 11:52, Thomas Munro:On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:Thank you for reviewing. All mentioned issues are fixed.== Applying patch 0002-covering-btree_v4.patch...
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/README.rej
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/nbtxlog.c.rej
Can we please have a new patch set?
Here it is.
Many thanks to Andrey Borodin.
Alexander Korotkov
Postgres Professional: http://www.postg
The Russian Postgres Company
On Thu, Mar 8, 2018 at 7:13 PM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote:06.03.2018 11:52, Thomas Munro:On Wed, Jan 31, 2018 at 3:09 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:Thank you for reviewing. All mentioned issues are fixed.== Applying patch 0002-covering-btree_v4.patch...
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/README.rej
1 out of 1 hunk FAILED -- saving rejects to file
src/backend/access/nbtree/nbtxlog.c.rej
Can we please have a new patch set?
Here it is.
Many thanks to Andrey Borodin.I took a look at this patchset. I have some notes about it.* I see patch changes dblink, amcheck and tcl contribs. It would be nice to add correspondingcheck to dblink and amcheck regression tests. It would be good to do the same with tcn contrib.But tcn doesn't have regression tests at all. And it's out of scope of this patch to add regressiontests to tcn. So, it's OK to just check that it's working correctly with covering indexes (I hope it'salready done by other reviewers).* I think that subscription regression tests in src/test/subscription should have some useof covering indexes. Logical decoding and subscription are heavily using primary keys.So they need to be tested to work correctly with covering indexes.* I also think some isolation tests in src/test/isolation need to check covering indexes too.In particular insert-conflict-*.spec and lock-*.spec and probably more.* pg_dump doesn't handle old PostgreSQL versions correctly. If I try to dump databaseof PostgreSQL 9.6, pg_dump gives me following error:pg_dump: [archiver (db)] query failed: ERROR: column i.indnkeyatts does not existLINE 1: ...atalog.pg_get_indexdef(i.indexrelid) AS indexdef, i.indnkeya... ^If fact there is a sequence of "if" ... "else if" blocks in getIndexes() which selectsappropriate query depending on remote server version. And for pre-11 we shoulduse indnatts instead of indnkeyatts.* I would also like all the patches in patchset version to have same version number.* There is minor formatting issue in this part of code. Some spaces need to be replaced with tabs.+IndexTuple+index_truncate_tuple(Relation idxrel, IndexTuple olditup)+{+ TupleDesc itupdesc = CreateTupleDescCopyConstr(RelationGetDescr(idxrel)); + Datum values[INDEX_MAX_KEYS];+ bool isnull[INDEX_MAX_KEYS];+ IndexTuple newitup;* I think this comment needs to be rephrased.+ /*+ * Code below is concerned to the opclasses which are not used+ * with the included columns.+ */I would write something like this: "Code below checks opclass key type. Included columnsdon't have opclasses, and this check is not required for them.". Native english speakerscould provide even better phrasing though.I understand that "Covering-btree" and "Covering-amcheck" have less previousversions than "Covering-core". But it's way easier to identify patches belonging tothe same patchset version if they have same version number. For sure, then somepatches would skip some version numbers, but that doesn't seem to be a problem for me.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Thu, Mar 22, 2018 at 8:23 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> * There is minor formatting issue in this part of code. Some spaces need >> to be replaced with tabs. >> +IndexTuple >> +index_truncate_tuple(Relation idxrel, IndexTuple olditup) >> +{ >> + TupleDesc itupdesc = >> CreateTupleDescCopyConstr(RelationGetDescr(idxrel)); >> + Datum values[INDEX_MAX_KEYS]; >> + bool isnull[INDEX_MAX_KEYS]; >> + IndexTuple newitup; The last time I looked at this patch, in April 2017, I made the point that we should add something like an "nattributes" argument to index_truncate_tuple() [1], rather than always using IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple(). I can see that that change hasn't been made since that time. With that approach, we can avoid relying on catalog metadata to the same degree, which was a specific concern that Tom had around that time. It's easy to do something with t_tid's offset, which is unused in internal page IndexTuples. We do very similar things in GIN, where alternative use of an IndexTuple's t_tid supports all kinds of enhancements, some of which were not originally anticipated. Alexander surely knows more about this than I do, since he wrote that code. Having this index_truncate_tuple() "nattributes" argument, and storing the number of attributes directly improves quite a lot of things: * It makes diagnosing issues in the field quite a bit easier. Tools like pg_filedump can do the right thing (Tom mentioned pg_filedump and amcheck specifically). The nbtree IndexTuple format should not need to be interpreted in a context-sensitive way, if we can avoid it. * It lets you use index_truncate_tuple() for regular suffix truncation in the future. These INCLUDE IndexTuples are really just a special case of suffix truncation. At least, they should be, because otherwise an eventual suffix truncation feature is going to be incompatible with the INCLUDE tuple format. *Not* doing this makes suffix truncation harder. Suffix truncation is a classic technique, first described by Bayer in 1977, and we are very probably going to add it someday. * Once you can tell a truncated IndexTuple from a non-truncated one with little or no context, you can add defensive assertions in various places where they're helpful. Similarly, amcheck can use and expect this as a cross-check against IndexRelationGetNumberOfKeyAttributes(). This will increase confidence in the design, both initially and over time. I must say that I am disappointed that nothing has happened here, especially because this really wasn't very much additional work, and has essentially no downside. I can see that it doesn't work that way in the Postgres Pro fork [2], and diverging from that may inconvenience Postgres Pro, but that's a downside of forking. I don't think that the community should have to absorb that cost. > +Notes to Operator Class Implementors > +------------------------------------ > > Besides I really appreciate it, it seems to be unrelated to the covering > indexes. > I'd like this to be extracted into separate patch and be committed > separately. Commit 3785f7ee, from last month, moved the original "Notes to Operator Class Implementors" section to the SGML docs. It looks like that README section was accidentally reintroduced during rebasing. The new information ("Included attributes in B-tree indexes") should be moved over to the new section of the user docs -- the section that 3785f7ee added. [1] https://postgr.es/m/CAH2-Wzm9y59h2m6iZjM4fpdUP5r4bsRVzGbN2gTRCO1j4nZmtw@mail.gmail.com [2] https://github.com/postgrespro/postgrespro/blob/PGPRO9_5/src/backend/access/common/indextuple.c#L451 -- Peter Geoghegan
On Thu, Mar 22, 2018 at 8:23 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> * There is minor formatting issue in this part of code. Some spaces need
>> to be replaced with tabs.
>> +IndexTuple
>> +index_truncate_tuple(Relation idxrel, IndexTuple olditup)
>> +{
>> + TupleDesc itupdesc =
>> CreateTupleDescCopyConstr(RelationGetDescr(idxrel));
>> + Datum values[INDEX_MAX_KEYS];
>> + bool isnull[INDEX_MAX_KEYS];
>> + IndexTuple newitup;
The last time I looked at this patch, in April 2017, I made the point
that we should add something like an "nattributes" argument to
index_truncate_tuple() [1], rather than always using
IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple().
I can see that that change hasn't been made since that time.
With that approach, we can avoid relying on catalog metadata to the
same degree, which was a specific concern that Tom had around that
time. It's easy to do something with t_tid's offset, which is unused
in internal page IndexTuples. We do very similar things in GIN, where
alternative use of an IndexTuple's t_tid supports all kinds of
enhancements, some of which were not originally anticipated. Alexander
surely knows more about this than I do, since he wrote that code.
Having this index_truncate_tuple() "nattributes" argument, and storing
the number of attributes directly improves quite a lot of things:
* It makes diagnosing issues in the field quite a bit easier. Tools
like pg_filedump can do the right thing (Tom mentioned pg_filedump and
amcheck specifically). The nbtree IndexTuple format should not need to
be interpreted in a context-sensitive way, if we can avoid it.
* It lets you use index_truncate_tuple() for regular suffix truncation
in the future. These INCLUDE IndexTuples are really just a special
case of suffix truncation. At least, they should be, because otherwise
an eventual suffix truncation feature is going to be incompatible with
the INCLUDE tuple format. *Not* doing this makes suffix truncation
harder. Suffix truncation is a classic technique, first described by
Bayer in 1977, and we are very probably going to add it someday.
* Once you can tell a truncated IndexTuple from a non-truncated one
with little or no context, you can add defensive assertions in various
places where they're helpful. Similarly, amcheck can use and expect
this as a cross-check against IndexRelationGetNumberOfKeyAttributes().
This will increase confidence in the design, both initially and over
time.
I must say that I am disappointed that nothing has happened here,
especially because this really wasn't very much additional work, and
has essentially no downside. I can see that it doesn't work that way
in the Postgres Pro fork [2], and diverging from that may
inconvenience Postgres Pro, but that's a downside of forking. I don't
think that the community should have to absorb that cost.
> +Notes to Operator Class Implementors
> +------------------------------------
>
> Besides I really appreciate it, it seems to be unrelated to the covering
> indexes.
> I'd like this to be extracted into separate patch and be committed
> separately.
Commit 3785f7ee, from last month, moved the original "Notes to
Operator Class Implementors" section to the SGML docs. It looks like
that README section was accidentally reintroduced during rebasing. The
new information ("Included attributes in B-tree indexes") should be
moved over to the new section of the user docs -- the section that
3785f7ee added.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Sat, Mar 24, 2018 at 12:39 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > +1, putting "nattributes" to argument of index_truncate_tuple() would > make this function way more universal. Great. > Originally that code was written by Teodor, but I also put my hands there. > In GIN entry tree, item pointers are stored in a posting list which is > located > after index tuple attributes. So, both t_tid block number and offset are > used for GIN needs. Well, you worked on the posting list compression stuff, at least. :-) > That makes sense. Let's store the number of tuple attributes to t_tid. > Assuming that our INDEX_MAX_KEYS is quite small number, we will have > higher bits of t_tid free for latter use. I was going to say that you could just treat the low bit in the t_tid offset as representing "see catalog entry". My first idea was that nothing would have to change about the existing format, since internal page items already have only the low bit set within their offset. However, I now see that that won't really work, because we don't change the offset in high keys when they're copied from a real item during a page split. Whatever we do, it has to work equally well for all "separator keys" -- that is, it must work for both downlinks in internal pages, and all high keys (including high keys at the leaf level). A good solution is to use the unused 13th t_bit. If hash can have a INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK. This avoids a BTREE_VERSION bump, and allows us to deal with the highkey offset issue. Actually, it's even more flexible than that -- it can work with ordinary leaf tuples in the future, too. That is, we can eventually implement prefix truncation and deduplication at the leaf level using this representation, since there is nothing that limits INDEX_ALT_TID_MASK IndexTuples to "separator keys". The main difference between this approach to leaf prefix truncation/compression/deduplication, and the GIN entry tree's posting list representation would be that it wouldn't have to be super-optimized for duplicates, at the expense of more common case for regular nbtree indexes -- having few or no duplicates. A btree_gin index on pgbench_accounts(aid) looks very similar to an equivalent nbtree index if you just compare internal pages from each, but they look quite different at the leaf level, where GIN has 24 byte IndexTuples instead of 16 bytes IndexTuples. Of course, this is because the leaf pages have posting lists that can never be simple heap pointer TIDs. A secondary goal of this INDEX_ALT_TID_MASK representation should be that it won't even be necessary to know that an IndexTuple is contained within a leaf page rather than an index page (again, unlike GIN). I'm pretty confident that we can have a truly universal IndexTuple representation for nbtree, while supporting all of these standard optimizations. Sorry for going off in a tangent, but I think it's somewhat necessary to have a strategy here. Of course, we don't have to get everything right now, but we should be looking in this direction whenever we talk about on-disk nbtree changes. -- Peter Geoghegan
I was going to say that you could just treat the low bit in the t_tid
offset as representing "see catalog entry". My first idea was that
nothing would have to change about the existing format, since internal
page items already have only the low bit set within their offset.
However, I now see that that won't really work, because we don't
change the offset in high keys when they're copied from a real item
during a page split. Whatever we do, it has to work equally well for
all "separator keys" -- that is, it must work for both downlinks in
internal pages, and all high keys (including high keys at the leaf
level).
A good solution is to use the unused 13th t_bit. If hash can have a
INDEX_MOVED_BY_SPLIT_MASK, then nbtree can have a INDEX_ALT_TID_MASK.
This avoids a BTREE_VERSION bump, and allows us to deal with the
highkey offset issue. Actually, it's even more flexible than that --
it can work with ordinary leaf tuples in the future, too. That is, we
can eventually implement prefix truncation and deduplication at the
leaf level using this representation, since there is nothing that
limits INDEX_ALT_TID_MASK IndexTuples to "separator keys".
The main difference between this approach to leaf prefix
truncation/compression/deduplication, and the GIN entry tree's posting
list representation would be that it wouldn't have to be
super-optimized for duplicates, at the expense of more common case for
regular nbtree indexes -- having few or no duplicates. A btree_gin
index on pgbench_accounts(aid) looks very similar to an equivalent
nbtree index if you just compare internal pages from each, but they
look quite different at the leaf level, where GIN has 24 byte
IndexTuples instead of 16 bytes IndexTuples. Of course, this is
because the leaf pages have posting lists that can never be simple
heap pointer TIDs.
A secondary goal of this INDEX_ALT_TID_MASK representation should be
that it won't even be necessary to know that an IndexTuple is
contained within a leaf page rather than an index page (again, unlike
GIN). I'm pretty confident that we can have a truly universal
IndexTuple representation for nbtree, while supporting all of these
standard optimizations.
Sorry for going off in a tangent, but I think it's somewhat necessary
to have a strategy here. Of course, we don't have to get everything
right now, but we should be looking in this direction whenever we talk
about on-disk nbtree changes.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On 3/26/18 6:10 AM, Alexander Korotkov wrote: > > So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag > which would indicate that we're storing something special in the t_tid > offset. And that should help us not only for covering indexes, but also for > further btree enhancements including suffix truncation. What exactly do > you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag > is set? Is it number of attributes in this particular index tuple? It appears that discussion and review of this patch is ongoing so it should not be marked Ready for Committer. I have changed it to Waiting on Author since there are several pending reviews and at least one bug. Regards, -- -David david@pgmasters.net
On Mon, Mar 26, 2018 at 3:10 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > So, as I get you're proposing to introduce INDEX_ALT_TID_MASK flag > which would indicate that we're storing something special in the t_tid > offset. And that should help us not only for covering indexes, but also for > further btree enhancements including suffix truncation. What exactly do > you propose to store into t_tid offset when INDEX_ALT_TID_MASK flag > is set? Is it number of attributes in this particular index tuple? Yes. I think that once INDEX_ALT_TID_MASK is available, we should store the number of attributes in that particular "separator key" tuple (which has undergone suffix truncation), and always work off of that. You could then have status bits in offset as follows: * 1 bit that represents that this is a "separator key" IndexTuple (high key or internal IndexTuple). Otherwise, it's a leaf IndexTuple with an ordinary heap TID. (When INDEX_ALT_TID_MASK isn't set, it's the same as today.) * 3 reserved bits. I think that one of these bits can eventually be used to indicate that the internal IndexTuple actually has a "normalized key" representation [1], which seems like the best way to do suffix truncation, long term. I think that we should support simple suffix truncation, of the kind that this patch implements, alongside normalized key suffix truncation. We need both for various reasons [2]. Not sure what the other two flag bits might be used for, but they seem worth having. * 12 bits for the number of attributes, which should be more than enough, even when INDEX_MAX_KEYS is significantly higher than 32. A static assertion can keep this safe when INDEX_MAX_KEYS is set ridiculously high. I think that this scheme is future-proof. Maybe you have additional ideas on the representation. Please let me know what you think. When we eventually add optimizations that affect IndexTuples on the leaf level, we can start using the block number (bi_hi + bi_lo) itself, much like GIN posting lists. No need to further consider that (the leaf level optimizations) today, because using block number provides us with many more bits. In internal page items, the block number is always a block number, so internal IndexTuples are rather like GIN posting tree pointers in the main entry tree (its leaf level) -- a conventional item pointer block number is used, alongside unconventional use of the offset field, where there are 16 bits available because no real offset is required. [1] https://wiki.postgresql.org/wiki/Key_normalization#Optimizations_enabled_by_key_normalization [2] https://wiki.postgresql.org/wiki/Key_normalization#How_big_can_normalized_keys_get.2C_and_is_it_worth_it.3F -- Peter Geoghegan
> The last time I looked at this patch, in April 2017, I made the point > that we should add something like an "nattributes" argument to > index_truncate_tuple() [1], rather than always using > IndexRelationGetNumberOfKeyAttributes() within index_truncate_tuple(). Agree, it looks logical because a) reading code will be simpler b) function will be use for any future usage. > Having this index_truncate_tuple() "nattributes" argument, and storing > the number of attributes directly improves quite a lot of things: Storing number of attributes in now unused t_tid seems to me not so good idea. a) it could (and suppose, should) separate patch, at least it's not directly connected to covering patch, it could be added even before covering patch. b) I don't like an idea to limiting usage of that field if we can do not that. Future usage could use it, for example, for different compression technics or something else. > > * It makes diagnosing issues in the field quite a bit easier. Tools > like pg_filedump can do the right thing (Tom mentioned pg_filedump and > amcheck specifically). The nbtree IndexTuple format should not need to > be interpreted in a context-sensitive way, if we can avoid it. Both pg_filedump and amcheck could correclty parse any tuple based on BTP_LEAF flags and length of tuple. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> b) I don't like an idea to limiting usage of that field if we can do not that. > Future usage could use it, for example, for different compression technics or > something else.Or even removing t_tid from inner tuples to save 2 bytes in IndexTupleData. Of course, I remember about aligment, but it could be subject to change too in future. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Tue, Mar 27, 2018 at 10:07 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Storing number of attributes in now unused t_tid seems to me not so good > idea. a) it could (and suppose, should) separate patch, at least it's not > directly connected to covering patch, it could be added even before covering > patch. I think that we should do that first. It's not very hard. > b) I don't like an idea to limiting usage of that field if we can do not > that. Future usage could use it, for example, for different compression > technics or something else. The extra status bits that this would leave within the offset field can be used for that in the future. >> * It makes diagnosing issues in the field quite a bit easier. Tools >> like pg_filedump can do the right thing (Tom mentioned pg_filedump and >> amcheck specifically). The nbtree IndexTuple format should not need to >> be interpreted in a context-sensitive way, if we can avoid it. > > Both pg_filedump and amcheck could correclty parse any tuple based on > BTP_LEAF flags and length of tuple. amcheck doesn't just care about the length of the tuple. It would have to rely on catalog metadata about this being an INCLUDE index. -- Peter Geoghegan
On Tue, Mar 27, 2018 at 10:14 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> b) I don't like an idea to limiting usage of that field if we can do not >> that. Future usage could use it, for example, for different compression >> technics or something else.Or even removing t_tid from inner tuples to save >> 2 bytes in IndexTupleData. Of > > course, I remember about aligment, but it could be subject to change too in > future. This is contradictory. You seem to be arguing that we need to preserve on-disk compatibility for an optimization that throws out on-disk compatibility. Saving a single byte per internal IndexTuple is not worth it. We could actually save 2 bytes in *all* nbtree pages, by halving the size of ItemId for nbtree -- we don't need lp_len, which is redundant, and we could reclaim one of the status bits too, to get back a full 16 bits. Also, we could use suffix truncation to save at least one byte in almost all cases, even with the thinnest possible single-integer-attribute IndexTuples. What you describe just isn't going to happen. -- Peter Geoghegan
Hi! > 21 марта 2018 г., в 21:51, Alexander Korotkov <a.korotkov@postgrespro.ru> написал(а): > > > I took a look at this patchset. I have some notes about it. > > * I see patch changes dblink, amcheck and tcl contribs. It would be nice to add corresponding > check to dblink and amcheck regression tests. It would be good to do the same with tcn contrib. > But tcn doesn't have regression tests at all. And it's out of scope of this patch to add regression > tests to tcn. So, it's OK to just check that it's working correctly with covering indexes (I hope it's > already done by other reviewers). > I propose attached tests to amcheck and dblink. Not very extensive tests though, but enough to keep things working. > * I think that subscription regression tests in src/test/subscription should have some use > of covering indexes. Logical decoding and subscription are heavily using primary keys. > So they need to be tested to work correctly with covering indexes. I've attached subscription tests. Unfortunately, they crash publisher with 2018-03-28 15:09:05.953 +05 [81805] 001_rep_changes.pl LOG: statement: DELETE FROM tab_cov WHERE a > 20 2018-03-28 15:09:05.954 +05 [81691] LOG: server process (PID 81805) was terminated by signal 11: Segmentation fault Any of this commands lead to this $node_publisher->safe_psql('postgres', "DELETE FROM tab_cov WHERE a > 20"); $node_publisher->safe_psql('postgres', "UPDATE tab_cov SET a = -a"); I didn't succeed in debugging. Maybe Anastasia can comment on is it bug or is it something wrong with tests? > > * I also think some isolation tests in src/test/isolation need to check covering indexes too. > In particular insert-conflict-*.spec and lock-*.spec and probably more. Currently, I couldn't compose good test scenarios, but I will think a bit about it more. Best regards, Andrey Borodin.
Attachment
Here is the new version of the patch set. All patches are rebased to apply without conflicts. Besides, they contain following fixes: - pg_dump bug is fixed - index_truncate_tuple() now has 3rd argument new_indnatts. - new tests for amcheck, dblink and subscription/t/001_rep_changes.pl - info about opclass implementors and included columns is now in sgml doc -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
On 2018-03-28 16:59, Anastasia Lubennikova wrote: > Here is the new version of the patch set. I can't get these to apply: patch -b -l -F 25 -p 1 < /home/aardvark/download/pgpatches/0110/covering_indexes/20180328/0001-Covering-core-v8.patch 1 out of 19 hunks FAILED -- saving rejects to file src/backend/utils/cache/relcache.c.rej $ cat src/backend/utils/cache/relcache.c.rej --- src/backend/utils/cache/relcache.c +++ src/backend/utils/cache/relcache.c @@ -542,7 +542,7 @@ attp = (Form_pg_attribute) GETSTRUCT(pg_attribute_tuple); if (attp->attnum <= 0 || - attp->attnum > relation->rd_rel->relnatts) + attp->attnum > RelationGetNumberOfAttributes(relation)) elog(ERROR, "invalid attribute number %d for %s", attp->attnum, RelationGetRelationName(relation)); Erik Rijkers
On 1/25/18 23:19, Thomas Munro wrote: > + PRIMARY KEY ( <replaceable > class="parameter">column_name</replaceable> [, ... ] ) <replaceable > class="parameter">index_parameters</replaceable> <optional>INCLUDE > (<replaceable class="parameter">column_name</replaceable> [, > ...])</optional> | > > I hadn't seen that use of "<optional>" before. Almost everywhere else > we use explicit [ and ] characters, but I see that there are other > examples, and it is rendered as [ and ] in the output. I think this will probably not come out right in the generated psql help. Check that please. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 28, 2018 at 7:59 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > Here is the new version of the patch set. > All patches are rebased to apply without conflicts. > > Besides, they contain following fixes: > - pg_dump bug is fixed > - index_truncate_tuple() now has 3rd argument new_indnatts. > - new tests for amcheck, dblink and subscription/t/001_rep_changes.pl > - info about opclass implementors and included columns is now in sgml doc This only changes the arguments given to index_truncate_tuple(), which is a superficial change. It does not actually change anything about the on-disk representation, which is what I sought. Why is that a problem? I don't think it's very complicated. The patch needs a rebase, as Erik mentioned: 1 out of 19 hunks FAILED -- saving rejects to file src/backend/utils/cache/relcache.c.rej (Stripping trailing CRs from patch; use --binary to disable.) I also noticed that you still haven't done anything differently with this code in _bt_checksplitloc(), which I mentioned in April of last year: /* Account for all the old tuples */ leftfree = state->leftspace - olddataitemstoleft; rightfree = state->rightspace - (state->olddataitemstotal - olddataitemstoleft); /* * The first item on the right page becomes the high key of the left page; * therefore it counts against left space as well as right space. */ leftfree -= firstrightitemsz; /* account for the new item */ if (newitemonleft) leftfree -= (int) state->newitemsz; else rightfree -= (int) state->newitemsz; With an extreme enough case, this could result in a failure to find a split point. Or at least, if that isn't true then it's not clear why, and I think it needs to be explained. -- Peter Geoghegan
On Wed, Mar 28, 2018 at 7:59 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Here is the new version of the patch set.
> All patches are rebased to apply without conflicts.
>
> Besides, they contain following fixes:
> - pg_dump bug is fixed
> - index_truncate_tuple() now has 3rd argument new_indnatts.
> - new tests for amcheck, dblink and subscription/t/001_rep_changes.pl
> - info about opclass implementors and included columns is now in sgml doc
This only changes the arguments given to index_truncate_tuple(), which
is a superficial change. It does not actually change anything about
the on-disk representation, which is what I sought. Why is that a
problem? I don't think it's very complicated.
The patch needs a rebase, as Erik mentioned:
1 out of 19 hunks FAILED -- saving rejects to file
src/backend/utils/cache/relcache.c.rej
(Stripping trailing CRs from patch; use --binary to disable.)
I also noticed that you still haven't done anything differently with
this code in _bt_checksplitloc(), which I mentioned in April of last
year:
/* Account for all the old tuples */
leftfree = state->leftspace - olddataitemstoleft;
rightfree = state->rightspace -
(state->olddataitemstotal - olddataitemstoleft);
/*
* The first item on the right page becomes the high key of the left page;
* therefore it counts against left space as well as right space.
*/
leftfree -= firstrightitemsz;
/* account for the new item */
if (newitemonleft)
leftfree -= (int) state->newitemsz;
else
rightfree -= (int) state->newitemsz;
With an extreme enough case, this could result in a failure to find a
split point. Or at least, if that isn't true then it's not clear why,
and I think it needs to be explained.
Alexander Korotkov
Postgres Professional: http://www.
The Russian Postgres Company
Attachment
On Fri, Mar 30, 2018 at 2:33 AM, Peter Geoghegan <pg@bowt.ie> wrote:On Wed, Mar 28, 2018 at 7:59 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> Here is the new version of the patch set.
> All patches are rebased to apply without conflicts.
>
> Besides, they contain following fixes:
> - pg_dump bug is fixed
> - index_truncate_tuple() now has 3rd argument new_indnatts.
> - new tests for amcheck, dblink and subscription/t/001_rep_changes.pl
> - info about opclass implementors and included columns is now in sgml doc
This only changes the arguments given to index_truncate_tuple(), which
is a superficial change. It does not actually change anything about
the on-disk representation, which is what I sought. Why is that a
problem? I don't think it's very complicated.I'll try it. But I'm afraid that it's not as easy as you expect.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
On Fri, Mar 30, 2018 at 4:08 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> I'll try it. But I'm afraid that it's not as easy as you expect. > > > So, I have some implementation of storage of number of attributes inside > index tuple itself. I made it as additional patch on top of previous > patchset. > I attach the whole patchset in order to make commitfest.cputube.org happy. Looks like 0004-* became mangled. Can you send a version that is not mangled, please? > I decided not to use 13th bit of IndexTuple flags. Instead I use only high > bit > of offset which is also always free on regular tuples. In fact, we already > use > assumption that there is at most 11 significant bits of index tuple offset > in > GIN (see ginpostinglist.c). So? GIN doesn't have the same legacy at all. The GIN posting lists *don't* have regular heap TID pointers at all. They started out without them, and still don't have them. > Anastasia also pointed that if we're going to do on-disk changes, they > should be compatible not only with suffix truncation, but also with > duplicate > compression (which was already posted in thread [1]). I definitely agree with that, and I think that Anastasia should push for whatever will make future nbtree enhancements easier, especially her own pending or planned enhancements. > However, I think > there is no problem. We can use one of 3 free bits in offset as flag that > it's tuple with posting list. Duplicates compression needs to store > number of posting list items and their offset in the tuple. Free bits > left in item pointer after reserving 2 bits (1 flag of alternative meaning > of offset and 1 flag of posting list) is far enough for that. The issue that I see is that we could easily make this unambiguous, free of any context, with a tiny bit more work. Why not just do it that way? Maybe it won't actually matter, but I see no reason not to do it, since we can. > However, I find following arguments against implementing this feature > in covering indexes. > > * We write number of attributes present into tuple. But how to prove that > it's correct. I add appropriate checks to amcheck. But I don't think all > the > users runs amcheck frequent enough. Thus, in order to be sure that it's > correct we should check number of attributes is written correct everywhere > in the B-tree code. Use an assertion. Problem solved. I agree that people aren't using amcheck all that much, but give it time. Oracle and SQL Server have had tools like amcheck for 30+ years. We have had amcheck for one year. > Without that we can face the situation that we've > introduced new on-disk representation better to further B-tree enhancements, > but actually it's broken. And that would be much worse than nothing. > In order to check number of attributes everywhere in the B-tree code, we > need to actually implement significant part of suffix compression. And I > really think we shouldn't do this as part as covering indexes patch. I don't think that you need to do that, actually. I'm not asking you to go to those lengths. I have only asked that you make the on-disk representation *compatible* with a future Postgres version that has full suffix truncation (and other such enhancements, too). I care about the on-disk representation more than the code. > * Offset number is used now for parent refind (see BTEntrySame() macro). > In the attached patch, this condition is relaxed. But I don't think I > really like > that. This shoud be thought out very carefully... It's safe, although I admit that that's a bit hard to see. Specifically, look at this code in _bt_insert_parent(): /* * Find the parent buffer and get the parent page. * * Oops - if we were moved right then we need to change stack item! We * want to find parent pointing to where we are, right ? - vadim * 05/27/97 */ ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); Vadim doesn't seem too sure of why he did it that way. What's clear is that the offset on all internal pages is always P_HIKEY (that is, 1), because this is the one and only place where new IndexTuples get generated for internal pages. That's unambiguous. So how could BTEntrySame() truly need to care about offset? How could there ever be an internal page offset that wasn't just P_HIKEY? You can look yourself, using pg_hexedit or pageinspect. The comments above BTTidSame()/BTEntrySame() are actually wrong, including "New Comments". Vadim wanted to make TIDs part of the keyspace [1], beginning in around 1997. The idea was that we'd have truly unique keys by including TID, as L&Y intended, but that never happened. Instead, we got commit 9e85183bf in 2000, which among many other things changed the L&Y invariant to deal with duplicates. I think that Tom should have changed BTTidSame() to not care about offset number in that same commit from 2000. I actually think that Vadim was correct to want to make heap TID a unique-ifier, and that that's the best long term solution [2]. Unfortunately, the code that he committed in the late 1990s didn't really help -- how could it help without including the *entire* heap TID? This BTTidSame() offset thing seems to be related to some weird logic for duplicates that Tom killed in 9e85183bf, if it ever made sense. Note that _bt_getstackbuf(), the only code that uses BTEntrySame(), does not look at the offset directly -- because it's always P_HIKEY. Anyway... > * Now, hikeys are copied together with original t_tid's. That makes it > possible > to find the origin of this hikey. If we override offset in t_tid, that > becomes not > always possible. ....that just leaves the original high key at the leaf level, as you say here. You're right that there is theoretically a loss of forensic information from actually storing something in the offset at the leaf level, and storing something interesting in the offset during the first phase of a page split (not the second, where the aforementioned _bt_insert_parent() function gets called). I don't think it's worth worrying about, though. The fact is that that information can go out of date almost immediately, whereas high keys usually last forever. The only reason that there is a heap TID in the high key is because we'd have to add special code to remove it; not because it has any real value. I find it very hard to imagine it being used in a forensic situation. If you actually wanted to do this, the key itself is probably enough -- you probably wouldn't need the TID. > * When index tuple is truncated, then pageinspect probably shouldn't show > offset for it, because it meaningless. Should it rather show number of > attributes in separate column? Anyway that should be part of suffix > truncation > patch. Not part of covering indexes patch, especially added at the last > moment. Nobody asked you to write a suffix truncation patch. That has complexity above and beyond what the covering index patch needs. I just expect it to be compatible with an eventual suffix truncation patch, which you've now shown is quite possible. It is clearly a complimentary technique. > * I don't really see how does covering indexes without storing number of > index tuple attributes in the tuple itself blocks future work on suffix > truncation. It makes it harder. Your new version gives amcheck a way of determining the expected number of attributes. That's the main reason to have it, more so than the suffix truncation issue. Suffix truncation matters a lot too, though. > So, taking into account the arguments of above, I propose to give up with > idea to stick covering indexes and suffix truncation features together. > That wouldn't accelerate appearance one feature after another, but rather > likely would RIP both of them... I think that the thing that's more likely to kill this patch is the fact that after the first year, it only ever got discussed in the final CF. That's not something that happened because of my choices. I made several offers of my time. I did not create this urgency. [1] https://www.postgresql.org/message-id/18788.963953289@sss.pgh.pa.us [2] https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute -- Peter Geoghegan
On Fri, Mar 30, 2018 at 10:39 PM, Peter Geoghegan <pg@bowt.ie> wrote: > It's safe, although I admit that that's a bit hard to see. > Specifically, look at this code in _bt_insert_parent(): > > /* > * Find the parent buffer and get the parent page. > * > * Oops - if we were moved right then we need to change stack item! We > * want to find parent pointing to where we are, right ? - vadim > * 05/27/97 > */ > ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); > pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); > > Vadim doesn't seem too sure of why he did it that way. What's clear is > that the offset on all internal pages is always P_HIKEY (that is, 1), > because this is the one and only place where new IndexTuples get > generated for internal pages. That's unambiguous. So how could > BTEntrySame() truly need to care about offset? How could there ever be > an internal page offset that wasn't just P_HIKEY? You can look > yourself, using pg_hexedit or pageinspect. Sorry, I meant this code, right before: /* form an index tuple that points at the new right page */ new_item = CopyIndexTuple(ritem); ItemPointerSet(&(new_item->t_tid), rbknum, P_HIKEY); -- Peter Geoghegan
On Fri, Mar 30, 2018 at 6:24 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> With an extreme enough case, this could result in a failure to find a >> split point. Or at least, if that isn't true then it's not clear why, >> and I think it needs to be explained. > > > I don't think this could result in a failure to find a split point. > So, it finds a split point without taking into account that hikey > will be shorter. If such split point exists then split point with > truncated hikey should also exists. If not, then it would be > failure even without covering indexes. I've updated comment > accordingly. You're right. We're going to truncate the unneeded trailing attributes from whatever tuple is to the immediate right of the final split point that we choose (that's the tuple that we'll copy to make a new high key for the left page). Truncation already has to result in a tuple that is less than or equal to the original tuple. I also agree that it isn't worth trying harder to make sure that space is distributed evenly when truncation will go ahead. It will only matter in very rare cases, but the computational overhead of having an accurate high key size for every candidate split point would be significant. -- Peter Geoghegan
On Fri, Mar 30, 2018 at 4:08 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> I'll try it. But I'm afraid that it's not as easy as you expect.
>
>
> So, I have some implementation of storage of number of attributes inside
> index tuple itself. I made it as additional patch on top of previous
> patchset.
> I attach the whole patchset in order to make commitfest.cputube.org happy.
Looks like 0004-* became mangled. Can you send a version that is not
mangled, please?
> I decided not to use 13th bit of IndexTuple flags. Instead I use only high
> bit
> of offset which is also always free on regular tuples. In fact, we already
> use
> assumption that there is at most 11 significant bits of index tuple offset
> in
> GIN (see ginpostinglist.c).
So? GIN doesn't have the same legacy at all. The GIN posting lists
*don't* have regular heap TID pointers at all. They started out
without them, and still don't have them.
> However, I find following arguments against implementing this feature
> in covering indexes.
>
> * We write number of attributes present into tuple. But how to prove that
> it's correct. I add appropriate checks to amcheck. But I don't think all
> the
> users runs amcheck frequent enough. Thus, in order to be sure that it's
> correct we should check number of attributes is written correct everywhere
> in the B-tree code.
Use an assertion. Problem solved.
> * Offset number is used now for parent refind (see BTEntrySame() macro).
> In the attached patch, this condition is relaxed. But I don't think I
> really like
> that. This shoud be thought out very carefully...
It's safe, although I admit that that's a bit hard to see.
Specifically, look at this code in _bt_insert_parent():
/*
* Find the parent buffer and get the parent page.
*
* Oops - if we were moved right then we need to change stack item! We
* want to find parent pointing to where we are, right ? - vadim
* 05/27/97
*/
ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY);
pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
Vadim doesn't seem too sure of why he did it that way. What's clear is
that the offset on all internal pages is always P_HIKEY (that is, 1),
because this is the one and only place where new IndexTuples get
generated for internal pages. That's unambiguous. So how could
BTEntrySame() truly need to care about offset? How could there ever be
an internal page offset that wasn't just P_HIKEY? You can look
yourself, using pg_hexedit or pageinspect.
The comments above BTTidSame()/BTEntrySame() are actually wrong,
including "New Comments". Vadim wanted to make TIDs part of the
keyspace [1], beginning in around 1997. The idea was that we'd have
truly unique keys by including TID, as L&Y intended, but that never
happened. Instead, we got commit 9e85183bf in 2000, which among many
other things changed the L&Y invariant to deal with duplicates. I
think that Tom should have changed BTTidSame() to not care about
offset number in that same commit from 2000.
I actually think that Vadim was correct to want to make heap TID a
unique-ifier, and that that's the best long term solution [2].
Unfortunately, the code that he committed in the late 1990s didn't
really help -- how could it help without including the *entire* heap
TID? This BTTidSame() offset thing seems to be related to some weird
logic for duplicates that Tom killed in 9e85183bf, if it ever made
sense. Note that _bt_getstackbuf(), the only code that uses
BTEntrySame(), does not look at the offset directly -- because it's
always P_HIKEY.
Anyway...
> * Now, hikeys are copied together with original t_tid's. That makes it
> possible
> to find the origin of this hikey. If we override offset in t_tid, that
> becomes not
> always possible.
....that just leaves the original high key at the leaf level, as you
say here. You're right that there is theoretically a loss of forensic
information from actually storing something in the offset at the leaf
level, and storing something interesting in the offset during the
first phase of a page split (not the second, where the aforementioned
_bt_insert_parent() function gets called). I don't think it's worth
worrying about, though.
The fact is that that information can go out of date almost
immediately, whereas high keys usually last forever. The only reason
that there is a heap TID in the high key is because we'd have to add
special code to remove it; not because it has any real value. I find
it very hard to imagine it being used in a forensic situation. If you
actually wanted to do this, the key itself is probably enough -- you
probably wouldn't need the TID.
> * When index tuple is truncated, then pageinspect probably shouldn't show
> offset for it, because it meaningless. Should it rather show number of
> attributes in separate column? Anyway that should be part of suffix
> truncation
> patch. Not part of covering indexes patch, especially added at the last
> moment.
Nobody asked you to write a suffix truncation patch. That has
complexity above and beyond what the covering index patch needs. I
just expect it to be compatible with an eventual suffix truncation
patch, which you've now shown is quite possible. It is clearly a
complimentary technique.
> * I don't really see how does covering indexes without storing number of
> index tuple attributes in the tuple itself blocks future work on suffix
> truncation.
It makes it harder. Your new version gives amcheck a way of
determining the expected number of attributes. That's the main reason
to have it, more so than the suffix truncation issue.
Suffix truncation matters a lot too, though.
> So, taking into account the arguments of above, I propose to give up with
> idea to stick covering indexes and suffix truncation features together.
> That wouldn't accelerate appearance one feature after another, but rather
> likely would RIP both of them...
I think that the thing that's more likely to kill this patch is the
fact that after the first year, it only ever got discussed in the
final CF. That's not something that happened because of my choices. I
made several offers of my time. I did not create this urgency.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> So? GIN doesn't have the same legacy at all. The GIN posting lists >> *don't* have regular heap TID pointers at all. They started out >> without them, and still don't have them. > > > Yes, GIN never stored heap TID pointers in t_tid of index tuple. But GIN > assumes that heap TID pointer has at most 11 significant bits during > posting list encoding. I think that we should avoid assuming things, unless the cost of representing them is too high, which I don't think applies here. The more defensive general purpose code can be, the better. I will admit to being paranoid here. But experience suggests that paranoia is a good thing, if it isn't too expensive. Look at the thread on XFS + fsync() for an example of things being wrong for a very long time without anyone realizing, and despite the best efforts of many smart people. As far as anyone can tell, PostgreSQL on Linux + XFS is kinda, sorta broken, and has been forever. XFS was mature before ext4 was, and is a popular choice, and yet this is the first we're hearing about it being kind of broken. After many years. Look at this check that made it into my amcheck patch, that was committed yesterday: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/amcheck/verify_nbtree.c;h=a15fe21933b9a5b8baefedaa8f38e517d6c91877;hb=7f563c09f8901f6acd72cb8fba7b1bd3cf3aca8e#l745 As it says, nbtree is surprisingly tolerant of corrupt lp_len fields. You may find it an interesting exercise to use pg_hexedit to corrupt many lp_len fields in an index page. What's really interesting about this is that it doesn't appear to break anything at all! We don't get the length from there in most cases, so reads won't break at all. I see that we use ItemIdGetLength() in a couple of rare cases (though even those could be avoided) during a page split. You'd be lucky to notice a problem if lp_len fields were regularly corrupt. When you notice, it will probably have already caused big problems. On a similar note, I've noticed that many of my experimental B-Tree patches (that I never find time to finish) tend to almost work quite early on, sometimes without my really understanding why. The whole L&Y approach of recovering from problems that were detected (detecting concurrent page splits, and moving right) makes the code *very* forgiving. I hope that I don't sound trite, but everyone should try to be modest about what they *don't* know when writing complex system software with concurrency. It is not a platitude, even though it probably seems that way. A tiny mistake can have big consequences, so it's very important that we have a way to easily detect them after the fact. > I don't think we should use assertions, because they are typically disabled > on > production PostgreSQL builds. But we can have some explicit check in some > common path. In the attached patch I've such check to _bt_compare(). > Probably, > together with amcheck, that would be sufficient. Good idea -- a "can't happen" check in _bt_compare seems better, which I see here: > diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c > index 51dca64e13..fcf9832147 100644 > --- a/src/backend/access/nbtree/nbtsearch.c > +++ b/src/backend/access/nbtree/nbtsearch.c > @@ -443,6 +443,17 @@ _bt_compare(Relation rel, > if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) > return 1; > > + /* > + * Check tuple has correct number of attributes. > + */ > + if (!_bt_check_natts(rel, page, offnum)) > + { > + ereport(ERROR, > + (errcode(ERRCODE_INTERNAL_ERROR), > + errmsg("tuple has wrong number of attributes in index \"%s\"", > + RelationGetRelationName(rel)))); > + } > + > itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); It seems like it might be a good idea to make this accept an IndexTuple, though, to possibly save some work. Also, perhaps this should be an unlikely() condition, if only because it makes the intent clearer (might actually matter in a tight loop like this too, though). Do you store an attribute number in the "minus infinity" item (the leftmost one of internal pages)? I guess that that should be zero, because it's totally truncated. > OK, thank for the explanation. I agree that check of offset is redundant > here. Cool. >> The fact is that that information can go out of date almost >> immediately, whereas high keys usually last forever. The only reason >> that there is a heap TID in the high key is because we'd have to add >> special code to remove it; not because it has any real value. I find >> it very hard to imagine it being used in a forensic situation. If you >> actually wanted to do this, the key itself is probably enough -- you >> probably wouldn't need the TID. > > > I don't know, When I wrote my own implementation of B-tree and debug > it, I found saving hikeys "as is" to be very valuable for debugging. I would like to see your implementation at some point. That sounds interesting. > However, B-trees in PostgreSQL are quite mature, and probably > don't need so much debug information. Today, the highkey at the leaf level is an exact copy of the right sibling's first item immediately after the split. The absence of a usable heap TID offset (due to using it for number of attributes in high keys) is unlikely to make it harder to locate that right sibling's first item (to get a full heap TID), which could have moved a lot further right after the split, or even have been removed entirely. It could now be ambiguous where it wouldn't have been before in the event of duplicates, but it's unlikely. And when it does happen, it's unlikely to matter. We can still include the heap block number, I suppose. I think of the highkey as only having one simple job -- separating the keyspace between siblings. We actually have a very neat choke point to check that it does that one job -- when a high key is generated for a page split at the leaf level. If we were doing generic suffix truncation, we'd add a test that made sure that the high key was strictly greater than the last item on the left, and strictly less than the first item on the right. As I said yesterday, I don't like how we allow a highkey to be equal to both sides of the split, which goes against L&Y, and I think that we would at least be strict about < and > for suffix truncation. The highkey's actual value can be invented, provided it does this one simple job, which needs to be assessed only once at our "neat choke point". Everything else falls into place afterwards, since that's where teh downlink actually comes from. You can check it during a leaf page split while debugging (that's the neat choke point). That's why the high key doesn't seem very interesting from a debuggability perspective. >> Nobody asked you to write a suffix truncation patch. That has >> complexity above and beyond what the covering index patch needs. I >> just expect it to be compatible with an eventual suffix truncation >> patch, which you've now shown is quite possible. It is clearly a >> complimentary technique. > > > OK, but change of on-disk tuple format also changes what people > see in pageinspect. Right now, they see "1" as offset for tuples in intenal > page and hikeys. After patch, they would see some large values > (assuming we set some of hi bits) in offset. I'm not sure it's OK. > We probably should change display of index tuples in pageinspect. This reminds me of a discussion I had with Robert Haas about pageinspect + t_infomask bits. Robert thought that we should show the composite bits as single constants, where we do that (with things like HEAP_XMIN_FROZEN). I disagreed, saying I think that we should just show "the bits that are on the page", while also documenting that this situation exists in pageinspect directly. I think something similar here. I think it's okay to just show offset, provided it is documented. We have a number of odd things within nbtree that I actually saw to it were documented, such as the "minus infinity" item on internal pages, which looks odd and out of places. I remember Tatsuo Ishii asked about it before this happened. It seems helpful to show what's really there, and offer guidance on how to interpret it. I actually thought carefully about many things like this for pg_hexedit, which tries to be very consistent and logical, uses color to suggest meaning, and so on. Anyway, that's what I think about it, though I wouldn't really care if I lost that particular argument and we did something special with internal page offset in pageinspect. It seems like a matter of opinion, or aesthetics. > I'm sorry, I do not understand. New version of amcheck determines > the expected number of attributes and compares that to the numer of > attributes stored in the offset number. But I can get *expected* number of > attributes even wihtout storing them also in the offset number... Maybe I was confused. > I'd like to note that I really appreciate your attention to this patch > as well as other patches. Thanks. I would like to thank Anastasia and you for your patience and perseverance, despite what I see as mistakes in how this project was manged. I really want for it to be possible for there to be more patches in the nbtree code, because they're really needed. That was a big part of my motivation for writing amcheck, in fact. It's tedious to link this patch to a bigger picture about what we need to do with nbtree in the next 5 years, but I think that that's what it will take to get this patch in. That's my opinion. -- Peter Geoghegan
On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> So? GIN doesn't have the same legacy at all. The GIN posting lists
>> *don't* have regular heap TID pointers at all. They started out
>> without them, and still don't have them.
>
>
> Yes, GIN never stored heap TID pointers in t_tid of index tuple. But GIN
> assumes that heap TID pointer has at most 11 significant bits during
> posting list encoding.
I think that we should avoid assuming things, unless the cost of
representing them is too high, which I don't think applies here. The
more defensive general purpose code can be, the better.
> I don't think we should use assertions, because they are typically disabled
> on
> production PostgreSQL builds. But we can have some explicit check in some
> common path. In the attached patch I've such check to _bt_compare().
> Probably,
> together with amcheck, that would be sufficient.
Good idea -- a "can't happen" check in _bt_compare seems better, which
I see here:
> diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nb tsearch.c
> index 51dca64e13..fcf9832147 100644
> --- a/src/backend/access/nbtree/nbtsearch.c
> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
> if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
> return 1;
>
> + /*
> + * Check tuple has correct number of attributes.
> + */
> + if (!_bt_check_natts(rel, page, offnum))
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_INTERNAL_ERROR),
> + errmsg("tuple has wrong number of attributes in index \"%s\"",
> + RelationGetRelationName(rel))));
> + }
> +
> itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
It seems like it might be a good idea to make this accept an
IndexTuple, though, to possibly save some work.
lso, perhaps this
should be an unlikely() condition, if only because it makes the intent
clearer (might actually matter in a tight loop like this too, though).
Do you store an attribute number in the "minus infinity" item (the
leftmost one of internal pages)? I guess that that should be zero,
because it's totally truncated.
>> The fact is that that information can go out of date almost
>> immediately, whereas high keys usually last forever. The only reason
>> that there is a heap TID in the high key is because we'd have to add
>> special code to remove it; not because it has any real value. I find
>> it very hard to imagine it being used in a forensic situation. If you
>> actually wanted to do this, the key itself is probably enough -- you
>> probably wouldn't need the TID.
>
>
> I don't know, When I wrote my own implementation of B-tree and debug
> it, I found saving hikeys "as is" to be very valuable for debugging.
I would like to see your implementation at some point. That sounds interesting.
> However, B-trees in PostgreSQL are quite mature, and probably
> don't need so much debug information.
Today, the highkey at the leaf level is an exact copy of the right
sibling's first item immediately after the split. The absence of a
usable heap TID offset (due to using it for number of attributes in
high keys) is unlikely to make it harder to locate that right
sibling's first item (to get a full heap TID), which could have moved
a lot further right after the split, or even have been removed
entirely. It could now be ambiguous where it wouldn't have been before
in the event of duplicates, but it's unlikely. And when it does
happen, it's unlikely to matter.
We can still include the heap block number, I suppose. I think of the
highkey as only having one simple job -- separating the keyspace
between siblings. We actually have a very neat choke point to check
that it does that one job -- when a high key is generated for a page
split at the leaf level. If we were doing generic suffix truncation,
we'd add a test that made sure that the high key was strictly greater
than the last item on the left, and strictly less than the first item
on the right. As I said yesterday, I don't like how we allow a highkey
to be equal to both sides of the split, which goes against L&Y, and I
think that we would at least be strict about < and > for suffix
truncation.
The highkey's actual value can be invented, provided it does this one
simple job, which needs to be assessed only once at our "neat choke
point". Everything else falls into place afterwards, since that's
where teh downlink actually comes from. You can check it during a leaf
page split while debugging (that's the neat choke point). That's why
the high key doesn't seem very interesting from a debuggability
perspective.
>> Nobody asked you to write a suffix truncation patch. That has
>> complexity above and beyond what the covering index patch needs. I
>> just expect it to be compatible with an eventual suffix truncation
>> patch, which you've now shown is quite possible. It is clearly a
>> complimentary technique.
>
>
> OK, but change of on-disk tuple format also changes what people
> see in pageinspect. Right now, they see "1" as offset for tuples in intenal
> page and hikeys. After patch, they would see some large values
> (assuming we set some of hi bits) in offset. I'm not sure it's OK.
> We probably should change display of index tuples in pageinspect.
This reminds me of a discussion I had with Robert Haas about
pageinspect + t_infomask bits. Robert thought that we should show the
composite bits as single constants, where we do that (with things like
HEAP_XMIN_FROZEN). I disagreed, saying I think that we should just
show "the bits that are on the page", while also documenting that this
situation exists in pageinspect directly.
I think something similar here. I think it's okay to just show offset,
provided it is documented. We have a number of odd things within
nbtree that I actually saw to it were documented, such as the "minus
infinity" item on internal pages, which looks odd and out of places. I
remember Tatsuo Ishii asked about it before this happened. It seems
helpful to show what's really there, and offer guidance on how to
interpret it. I actually thought carefully about many things like this
for pg_hexedit, which tries to be very consistent and logical, uses
color to suggest meaning, and so on.
Anyway, that's what I think about it, though I wouldn't really care if
I lost that particular argument and we did something special with
internal page offset in pageinspect. It seems like a matter of
opinion, or aesthetics.
> I'd like to note that I really appreciate your attention to this patch
> as well as other patches.
Thanks. I would like to thank Anastasia and you for your patience and
perseverance, despite what I see as mistakes in how this project was
manged. I really want for it to be possible for there to be more
patches in the nbtree code, because they're really needed. That was a
big part of my motivation for writing amcheck, in fact. It's tedious
to link this patch to a bigger picture about what we need to do with
nbtree in the next 5 years, but I think that that's what it will take
to get this patch in. That's my opinion.
Alexander Korotkov
Postgres Professional: http://www.
The Russian Postgres Company
Attachment
On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > I thought abut that another time and I decided that it would be safer > to use 13th bit in index tuple flags. There are already attempt to > use whole 6 bytes of tid for not heap pointer information [1]. Thus, it > would be safe to use 13th bit for indicating alternative offset meaning > in pivot tuples, because it wouldn't block further work. Revised patchset > in the attachment implements it. This is definitely not the only time someone has talked about this 13th bit -- it's quite coveted. It also came up with UPSERT, and with WARM. That's just the cases that I can personally remember. I'm glad that you found a way to make this work, that will keep things flexible for future patches, and make testing easier. I think that we can find a flexible representation that makes almost everyone happy. > I don't know. We still need an offset number to check expected number > of attributes. Passing index tuple as separate attribute would be > redundant and open door for extra possible errors. You're right. I must have been tired when I wrote that. :-) >> Do you store an attribute number in the "minus infinity" item (the >> leftmost one of internal pages)? I guess that that should be zero, >> because it's totally truncated. > > > Yes, I store zero number of attributes in "minus infinity" item. See this > part of the patch. > However, note that I've to store (number_of_attributes + 1) in the offset > in order to correctly store zero number of attributes. Otherwise, assertion > is faised in ItemPointerIsValid() macro. Makes sense. > Yes. But that depends on how difficulty to adopt patch to big picture > correlate with difficulty, which non-adopted patch makes to that big > picture. My point was that second difficulty isn't high. But we can be > satisfied with implementation in the attached patchset (probably some > small enhancements are still required), then the first difficulty isn't high > too. I think it's possible. I didn't have time to look at this properly today, but I will try to do so tomorrow. Thanks -- Peter Geoghegan
On Mon, Apr 2, 2018 at 4:27 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I thought abut that another time and I decided that it would be safer
> to use 13th bit in index tuple flags. There are already attempt to
> use whole 6 bytes of tid for not heap pointer information [1]. Thus, it
> would be safe to use 13th bit for indicating alternative offset meaning
> in pivot tuples, because it wouldn't block further work. Revised patchset
> in the attachment implements it.
This is definitely not the only time someone has talked about this
13th bit -- it's quite coveted. It also came up with UPSERT, and with
WARM. That's just the cases that I can personally remember.
I'm glad that you found a way to make this work, that will keep things
flexible for future patches, and make testing easier. I think that we
can find a flexible representation that makes almost everyone happy.
I didn't have time to look at this properly today, but I will try to
do so tomorrow.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Great, I'm looking forward your feedback. I took a look at V11 (0001-Covering-core-v11.patch, 0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch, 0004-Covering-natts-v11.patch) today. * What's a pivot tuple? This is the same thing as what I call a "separator key", I think -- you're talking about the set of IndexTuples including all high keys (including leaf level high keys), as well as internal items (downlinks). I think that it's a good idea to have a standard word that describes this set of keys, to formalize the two categories (pivot tuples vs. tuples that point to the heap itself). Your word is just as good as mine, so we can go with that. Let's put this somewhere central. Maybe in the nbtree README, and/or nbtree.h. Also, verify_nbtree.c should probably get some small explanation of pivot tuples. offset_is_negative_infinity() is a nice place to mention pivot tuples, since that already has a bit of high-level commentary about them. * Compiler warning: /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c: In function ‘index_create’: /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:476:45: warning: ‘opclassTup’ may be used uninitialized in this function [-Wmaybe-uninitialized] if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID) ^ /home/pg/postgresql/root/build/../source/src/backend/catalog/index.c:332:19: note: ‘opclassTup’ was declared here Form_pg_opclass opclassTup; ^ * Your new amcheck tests should definitely use the new "heapallindexed" option. There were a number of bugs I can remember seeing in earlier versions of this patch that that would catch (probably not during regression tests, but let's at least do that much). * The modified amcheck contrib regression tests don't actually pass. I see these unexpected errors: 10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple attributes for index "bttest_multi_idx" 10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to index tid=(289,2) page lsn=0/162407A8. 10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple attributes for index "bttest_multi_idx" 10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to index tid=(289,2) page lsn=0/162407A8. * I see that we use "- 1" with attribute number, like this: > +/* Get number of attributes in B-tree index tuple */ > +#define BtreeTupGetNAtts(itup, index) \ > + ( \ > + (itup)->t_info & INDEX_ALT_TID_MASK ? \ > + ( \ > + AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \ > + ItemPointerGetOffsetNumber(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK - 1 \ > + ) \ > + : \ > + IndexRelationGetNumberOfAttributes(index) \ > + ) Is this left behind from before you decided to adopt INDEX_ALT_TID_MASK? Is it your intention here to encode InvalidOffsetNumber() without tripping up assertions? Or is it something else? Maybe we should follow the example of GinItemPointerGetOffsetNumber(), and use ItemPointerGetOffsetNumberNoCheck() instead of ItemPointerGetOffsetNumber(). What do you think? That would allow us to get rid of the -1 thing, which might be nice. Just because we use ItemPointerGetOffsetNumberNoCheck() in places that use an alternative offset representation does not mean we need to use it in existing places. If existing places had a regression tests failure because of this, that would probably be due to a real bug. No? * ISTM that the "points to index tid=(289,2)" part of the message just shown would be a bit clearer if I didn't have to know that 2 actually means 1 when we talk about the pointed-to offset (yeah, it will probably become unclear in the future when we start using the reserved offset status bits, but why not make the low bits of offset simple/logical way?). Your new amcheck error message should spell it out (it should say the number of attributes indicated by the offset, if any) -- regardless of what we do about the "must apply - 1 to offset" question. * "Minus infinity" items do not have the new status bit INDEX_ALT_TID_MASK set in at least some cases. They should. * _bt_sortaddtup() should not do "trunctuple.t_info = sizeof(IndexTupleData)", since that destroys useful information. Maybe that's the reason for the last bug? * Ditto for _bt_pgaddtup(). * Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call it? The only reason we have _bt_sortaddtup() is because we cannot trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real change needed, because _bt_sortaddtup() knows that it's inserting on a non-rightmost page both without this patch, and when this patch needs to truncate and then add the high key back. It's clear that you can just use _bt_sortaddtup() (and leave _bt_pgaddtup() private) because _bt_sortaddtup() is only different to _bt_pgaddtup() when !P_ISLEAF(), but we only call _bt_pgaddtup() when P_ISLEAF(). Or have I missed something? * For inserts, this patch performs an extra truncation step on the same high key that we'd use with a plain (non-covering/include) index. That's pretty clean. But it seems more complicated for nbtsort.c/_bt_buildadd(). I think that a comment should say that we cannot just rearrange item pointers for high key on the old page when we also truncate, because overwriting the P_HIKEY position ItemId with the old page's former final ItemId (whose tuple ended up becoming the first tuple on new/right page) fails to actually save any space. We need to truly shift around IndexTuples on the page in order to save space (both PageIndexTupleDelete() and PageAddItem() end up shifting both the ItemId array and some IndexTuple space). Also, maybe say that the performance here really isn't so bad, because we reclaim IndexTuple space close to the middle of the hole in the page with our PageIndexTupleDelete(), and then use almost the *same* space within PageAddItem(). There is not actually that much physical shifting around for IndexTuples. It turns out that it's not that different. (You can probably find a better, more succinct way of putting this -- I'm tired now.) * I suggest that you teach _bt_check_natts() to expect zero attributes for "minus infinity" items. It looks like amcheck contrib regression tests don't pass because you don't look for that (P_FIRSTDATAKEY() is the "minus infinity" item on internal pages). * bt_target_page_check() should also have a !P_ISLEAF() check, since with a covering index every tuple will have INDEX_ALT_TID_MASK. This should call _bt_check_natts() for each item, including the "minus infinity" items. * "minus infinity" items don't have the right number of attributes set, in at least some cases that I saw. The number matched other internal items, and wasn't 0 or whatever. Maybe the ItemPointerGetOffsetNumberNoCheck() idea would leave things so that it actually could be 0 safely, rather than natts + 1 as you said, which would be nice.) * I would reorder the comment to match the order of the code: > + /* > + * Pivot tuples stored in non-leaf pages and hikeys of leaf pages should > + * have nkeyatts number of attributes. While regular tuples of leaf pages > + * should have natts number of attributes. > + */ > + if (P_ISLEAF(opaque) && offnum >= P_FIRSTDATAKEY(opaque)) > + return (BtreeTupGetNAtts(itup, index) == natts); > + else > + return (BtreeTupGetNAtts(itup, index) == nkeyatts); * Please add BT_N_KEYS_OFFSET_MASK + INDEX_MAX_KEYS static assertion. Maybe add it to _bt_check_natts(). * README-SSI says: * The effects of page splits, overflows, consolidations, and removals must be carefully reviewed to ensure that predicate locks aren't "lost" during those operations, or kept with pages which could get re-used for different parts of the index. Do we need to worry about that here? I guess not, because this is just like having many duplicates. But a note just above the _bt_doinsert() call to CheckForSerializableConflictIn() might be a good idea. That's all I have for today. -- Peter Geoghegan
On Tue, Apr 3, 2018 at 7:02 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Great, I'm looking forward your feedback.
I took a look at V11 (0001-Covering-core-v11.patch,
0002-Covering-btree-v11.patch, 0003-Covering-amcheck-v11.patch,
0004-Covering-natts-v11.patch) today.
* What's a pivot tuple?
This is the same thing as what I call a "separator key", I think --
you're talking about the set of IndexTuples including all high keys
(including leaf level high keys), as well as internal items
(downlinks). I think that it's a good idea to have a standard word
that describes this set of keys, to formalize the two categories
(pivot tuples vs. tuples that point to the heap itself). Your word is
just as good as mine, so we can go with that.
Let's put this somewhere central. Maybe in the nbtree README, and/or
nbtree.h. Also, verify_nbtree.c should probably get some small
explanation of pivot tuples. offset_is_negative_infinity() is a nice
place to mention pivot tuples, since that already has a bit of
high-level commentary about them.
* Compiler warning:
/home/pg/postgresql/root/build/../source/src/backend/catalog /index.c:
In function ‘index_create’:
/home/pg/postgresql/root/build/../source/src/backend/catalog /index.c:476:45:
warning: ‘opclassTup’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
if (keyType == ANYELEMENTOID && opclassTup->opcintype == ANYARRAYOID)
^
/home/pg/postgresql/root/build/../source/src/backend/catalog /index.c:332:19:
note: ‘opclassTup’ was declared here
Form_pg_opclass opclassTup;
^
* Your new amcheck tests should definitely use the new
"heapallindexed" option. There were a number of bugs I can remember
seeing in earlier versions of this patch that that would catch
(probably not during regression tests, but let's at least do that
much).
* The modified amcheck contrib regression tests don't actually pass. I
see these unexpected errors:
10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.
10037/2018-04-03 16:31:12 PDT ERROR: wrong number of index tuple
attributes for index "bttest_multi_idx"
10037/2018-04-03 16:31:12 PDT DETAIL: Index tid=(290,2) points to
index tid=(289,2) page lsn=0/162407A8.
* I see that we use "- 1" with attribute number, like this:
> +/* Get number of attributes in B-tree index tuple */
> +#define BtreeTupGetNAtts(itup, index) \
> + ( \
> + (itup)->t_info & INDEX_ALT_TID_MASK ? \
> + ( \
> + AssertMacro((ItemPointerGetOffsetNumber(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
> + ItemPointerGetOffsetNumber(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK - 1 \
> + ) \
> + : \
> + IndexRelationGetNumberOfAttributes(index) \
> + )
Is this left behind from before you decided to adopt
INDEX_ALT_TID_MASK? Is it your intention here to encode
InvalidOffsetNumber() without tripping up assertions? Or is it
something else?
Maybe we should follow the example of GinItemPointerGetOffsetNumber(),
and use ItemPointerGetOffsetNumberNoCheck() instead of
ItemPointerGetOffsetNumber(). What do you think? That would allow us
to get rid of the -1 thing, which might be nice. Just because we use
ItemPointerGetOffsetNumberNoCheck() in places that use an alternative
offset representation does not mean we need to use it in existing
places. If existing places had a regression tests failure because of
this, that would probably be due to a real bug. No?
* ISTM that the "points to index tid=(289,2)" part of the message just
shown would be a bit clearer if I didn't have to know that 2 actually
means 1 when we talk about the pointed-to offset (yeah, it will
probably become unclear in the future when we start using the reserved
offset status bits, but why not make the low bits of offset
simple/logical way?). Your new amcheck error message should spell it
out (it should say the number of attributes indicated by the offset,
if any) -- regardless of what we do about the "must apply - 1 to
offset" question.
* "Minus infinity" items do not have the new status bit
INDEX_ALT_TID_MASK set in at least some cases. They should.
* _bt_sortaddtup() should not do "trunctuple.t_info =
sizeof(IndexTupleData)", since that destroys useful information. Maybe
that's the reason for the last bug?
* Ditto for _bt_pgaddtup().
* Why expose _bt_pgaddtup() so that nbtsort.c/_bt_buildadd() can call
it? The only reason we have _bt_sortaddtup() is because we cannot
trust P_RIGHTMOST() within _bt_pgaddtup() when called in the context
of CREATE INDEX (from nbtsort.c/_bt_buildadd()). There is no real
change needed, because _bt_sortaddtup() knows that it's inserting on a
non-rightmost page both without this patch, and when this patch needs
to truncate and then add the high key back.
It's clear that you can just use _bt_sortaddtup() (and leave
_bt_pgaddtup() private) because _bt_sortaddtup() is only different to
_bt_pgaddtup() when !P_ISLEAF(), but we only call _bt_pgaddtup() when
P_ISLEAF(). Or have I missed something?
* For inserts, this patch performs an extra truncation step on the
same high key that we'd use with a plain (non-covering/include) index.
That's pretty clean. But it seems more complicated for
nbtsort.c/_bt_buildadd(). I think that a comment should say that we
cannot just rearrange item pointers for high key on the old page when
we also truncate, because overwriting the P_HIKEY position ItemId with
the old page's former final ItemId (whose tuple ended up becoming the
first tuple on new/right page) fails to actually save any space. We
need to truly shift around IndexTuples on the page in order to save
space (both PageIndexTupleDelete() and PageAddItem() end up shifting
both the ItemId array and some IndexTuple space).
Also, maybe say that the performance here really isn't so bad, because
we reclaim IndexTuple space close to the middle of the hole in the
page with our PageIndexTupleDelete(), and then use almost the *same*
space within PageAddItem(). There is not actually that much physical
shifting around for IndexTuples. It turns out that it's not that
different. (You can probably find a better, more succinct way of
putting this -- I'm tired now.)
* I suggest that you teach _bt_check_natts() to expect zero attributes
for "minus infinity" items. It looks like amcheck contrib regression
tests don't pass because you don't look for that (P_FIRSTDATAKEY() is
the "minus infinity" item on internal pages).
* bt_target_page_check() should also have a !P_ISLEAF() check, since
with a covering index every tuple will have INDEX_ALT_TID_MASK. This
should call _bt_check_natts() for each item, including the "minus
infinity" items.
* "minus infinity" items don't have the right number of attributes
set, in at least some cases that I saw. The number matched other
internal items, and wasn't 0 or whatever. Maybe the
ItemPointerGetOffsetNumberNoCheck() idea would leave things so that it
actually could be 0 safely, rather than natts + 1 as you said, which
would be nice.)
* I would reorder the comment to match the order of the code:
> + /*
> + * Pivot tuples stored in non-leaf pages and hikeys of leaf pages should
> + * have nkeyatts number of attributes. While regular tuples of leaf pages
> + * should have natts number of attributes.
> + */
> + if (P_ISLEAF(opaque) && offnum >= P_FIRSTDATAKEY(opaque))
> + return (BtreeTupGetNAtts(itup, index) == natts);
> + else
> + return (BtreeTupGetNAtts(itup, index) == nkeyatts);
* Please add BT_N_KEYS_OFFSET_MASK + INDEX_MAX_KEYS static assertion.
Maybe add it to _bt_check_natts().
* README-SSI says:
* The effects of page splits, overflows, consolidations, and
removals must be carefully reviewed to ensure that predicate locks
aren't "lost" during those operations, or kept with pages which could
get re-used for different parts of the index.
Do we need to worry about that here? I guess not, because this is just
like having many duplicates. But a note just above the _bt_doinsert()
call to CheckForSerializableConflictIn() might be a good idea.
Alexander Korotkov
Postgres Professional: http://www.postg
The Russian Postgres Company
Attachment
On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Thank you for review! Revised patchset is attached. Cool. * btree_xlog_split() still has this code: /* * On leaf level, the high key of the left page is equal to the first key * on the right page. */ if (isleaf) { ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque)); left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId); left_hikeysz = ItemIdGetLength(hiItemId); } However, we never fail to store the high key now, even at the leaf level, because of this change to the corresponding point in _bt_split(): > - /* Log left page */ > - if (!isleaf) > - { > - /* > - * We must also log the left page's high key, because the right > - * page's leftmost key is suppressed on non-leaf levels. Show it > - * as belonging to the left page buffer, so that it is not stored > - * if XLogInsert decides it needs a full-page image of the left > - * page. > - */ > - itemid = PageGetItemId(origpage, P_HIKEY); > - item = (IndexTuple) PageGetItem(origpage, itemid); > - XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item))); > - } > + /* > + * We must also log the left page's high key. There are two reasons > + * for that: right page's leftmost key is suppressed on non-leaf levels, > + * in covering indexes, included columns are truncated from high keys. > + * For simplicity, we don't distinguish these cases, but log the high > + * key every time. Show it as belonging to the left page buffer, so > + * that it is not stored if XLogInsert decides it needs a full-page > + * image of the left page. > + */ > + itemid = PageGetItemId(origpage, P_HIKEY); > + item = (IndexTuple) PageGetItem(origpage, itemid); > + XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item))); So should we remove the first block of code? Note also that this existing comment has been made obsolete: /* don't release the buffer yet; we touch right page's first item below */ /* Now reconstruct left (original) sibling page */ if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO) Maybe we *should* release the right sibling buffer at the point of the comment now? * _bt_mkscankey() should assert that the IndexTuple has the correct number of attributes. I don't expect you to change routines like _bt_mkscankey() so they actually respect the number of attributes from BTreeTupGetNAtts(), rather than just relying on IndexRelationGetNumberOfKeyAttributes(). However, an assertion seems well worthwhile. It's a big reason for having BTreeTupGetNAtts(). This also lets you get rid of at least one assertion from _bt_doinsert(), I think. * _bt_isequal() should assert that the IndexTuple was not truncated. * The order could be switched here: > @@ -443,6 +443,17 @@ _bt_compare(Relation rel, > if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) > return 1; > > + /* > + * Check tuple has correct number of attributes. > + */ > + if (unlikely(!_bt_check_natts(rel, page, offnum))) > + { > + ereport(ERROR, > + (errcode(ERRCODE_INTERNAL_ERROR), > + errmsg("tuple has wrong number of attributes in index \"%s\"", > + RelationGetRelationName(rel)))); > + } In principle, we should also check _bt_check_natts() for "minus infinity" items, just like you did within verify_nbtree.c. Also, there is no need for parenthesis here. * Maybe _bt_truncate_tuple() should assert that the caller has not tried to truncate a tuple that has already been truncated. I'm not sure if our assertion should be quite that strong, but I think that that might be good because in general we only need to truncate on the leaf level -- truncating at any other level on the tree (e.g. doing traditional suffix truncation) is always subtly wrong. What we definitely should do, at a minimum, is make sure that attempting to truncate a tuple to 2 attributes when it already has 0 attributes fails with an assertion failure. Can you try adding the strong assertion (truncate only once) to _bt_truncate_tuple()? Maybe that's not possible, but it seems worth a try. * I suggest we invent a new flag for 0x2000 within itup.h, to replace "/* bit 0x2000 is reserved for index-AM specific usage */". We can call it INDEX_AM_RESERVED_BIT. Then, we can change INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this neater. * We should "use" one of the 4 new status bit that are available from an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf index tuples. Perhaps call it BT_ALT_TID_NONPIVOT. I guess you could say that I want us to reserve one of our 4 reserve bits. * I think that you could add to this: > +++ b/src/backend/access/nbtree/README > @@ -590,6 +590,10 @@ original search scankey is consulted as each index entry is sequentially > scanned to decide whether to return the entry and whether the scan can > stop (see _bt_checkkeys()). > > +We use term "pivot" index tuples to distinguish tuples which don't point > +to heap tuples, but rather used for tree navigation. Pivot tuples includes > +all tuples on non-leaf pages and high keys on leaf pages. I like what you came up with, and where you put it, but I would add another few sentences: "Note that pivot index tuples are only used to represent which part of the key space belongs on each page, and can have attribute values copied from non-pivot tuples that were deleted and killed by VACUUM some time ago. In principle, we could truncate away attributes that are not needed for a page high key during a leaf page split, provided that the remaining attributes distinguish the last index tuple on the post-split left page as belonging on the left page, and the first index tuple on the post-split right page as belonging on the right page. This optimization is sometimes called suffix truncation, and may appear in a future release. Since the high key is subsequently reused as the downlink in the parent page for the new right page, suffix truncation can increase index fan-out considerably by keeping pivot tuples short. INCLUDE indexes similarly truncate away non-key attributes at the time of a leaf page split, increasing fan-out." > Good point. Tests with "heapallindexed" were added. I also find that it's > useful to > check both index built by sorting, and index built by insertions, because > there are > different ways of forming tuples. Right. It's a good cross-check for things like that. We'll have to teach bt_tuple_present_callback() to normalize the representation in some way for the BT_ALT_TID_NONPIVOT case in the future. But it already talks about normalizing for reasons like this, so that's okay. * I think you should add a note about BT_ALT_TID_NONPIVOT to bt_tuple_present_callback(), though. If it cannot be sure that every non-pivot tuple will have the same representation, amcheck will have to normalize to the most flexible representation before hashing. > Ok. I've tried to remove both assertions and "+1" hack. That works > for me. However, I've to touch a lot of places, not sure if that's a > problem. Looks good to me. If it makes an assertion fail, that's probably a good thing, because it would have been broken before anyway. * You missed this comment, which is now not accurate: > + * It's possible that index tuple has zero attributes (leftmost item of > + * iternal page). And we have assertion that offset number is greater or equal > + * to 1. This is why we store (number_of_attributes + 1) in offset number. > + */ I can see that it is actually 0 for a minus infinity item, which is good. > I wrote some comment there. Please, check it. The nbtsort.c comments could maybe do with some tweaks from a native speaker, but look correct. > Regarding !P_ISLEAF(), I think we should check every item on both > leaf and non-leaf pages. I that is how code now works unless I'm missing > something. It does, and should. Thanks. > Thanks for pointing. Since there are now three cases including handling of > "minus infinity" items, comment is now split to three. That looks good. Thanks. Right now, it looks like every B-Tree index could use INDEX_ALT_TID_MASK, regardless of whether or not it's an INCLUDE index. I think that that's fine, but let's say so in the paragraph that introduces INDEX_ALT_TID_MASK. This patch establishes that any nbtree pivot tuple could have INDEX_ALT_TID_MASK set, and that's something that can be expected. It's also something that might not be set when pg_upgrade was used, but that's fine too. > I don't seerelations between this patchset and SSI. We just > change representation of some index tuples in pages. However, > we didn't change the the order of page modification, the order > of page lookup and so on. Yes, we change size of some tuples, > but B-tree already worked with tuples of variable sizes. So, the fact > that tuples now have different size shouldn't affect SSI. Right now, > I'm not sure if CheckForSerializableConflictIn() just above the > _bt_doinsert() is good idea. But even if so, I think that should be > a subject of separate patch. My point was that that nothing changes, because we already use what _bt_doinsert() calls the "first valid" page. Maybe just add: "(This reasoning also applies to INCLUDE indexes, whose extra attributes are not considered part of the key space.)". That's it for today. -- Peter Geoghegan
On 2018-04-05 00:09, Alexander Korotkov wrote: > Hi! > > Thank you for review! Revised patchset is attached. > [0001-Covering-core-v12.patch] > [0002-Covering-btree-v12.patch] > [0003-Covering-amcheck-v12.patch] > [0004-Covering-natts-v12.patch] Really nice performance gains. I read through the docs and made some changes. I hope it can count as improvement. It would probably also be a good idea to add the term "covering index" somewhere, at least in the documentation's index; the term does now not occur anywhere. (This doc-patch does not add it) thanks, Erik Rijkers
Attachment
On 2018-04-05 00:09, Alexander Korotkov wrote:Thank you for review! Revised patchset is attached.
[0001-Covering-core-v12.patch]
[0002-Covering-btree-v12.patch]
[0003-Covering-amcheck-v12.patch]
[0004-Covering-natts-v12.patch]
Really nice performance gains.
I read through the docs and made some changes. I hope it can count as improvement.
It would probably also be a good idea to add the term "covering index" somewhere, at least in the documentation's index; the term does now not occur anywhere. (This doc-patch does not add it)
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you for review! Revised patchset is attached.
Cool.
* btree_xlog_split() still has this code:
/*
* On leaf level, the high key of the left page is equal to the first key
* on the right page.
*/
if (isleaf)
{
ItemId hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));
left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
left_hikeysz = ItemIdGetLength(hiItemId);
}
However, we never fail to store the high key now, even at the leaf
level, because of this change to the corresponding point in
_bt_split():
So should we remove the first block of code?
Note also that this
existing comment has been made obsolete:
/* don't release the buffer yet; we touch right page's first item below */
/* Now reconstruct left (original) sibling page */
if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO)
Maybe we *should* release the right sibling buffer at the point of the
comment now?
* _bt_mkscankey() should assert that the IndexTuple has the correct
number of attributes.
I don't expect you to change routines like _bt_mkscankey() so they
actually respect the number of attributes from BTreeTupGetNAtts(),
rather than just relying on IndexRelationGetNumberOfKeyAttributes().
However, an assertion seems well worthwhile. It's a big reason for
having BTreeTupGetNAtts().
This also lets you get rid of at least one assertion from
_bt_doinsert(), I think.
* _bt_isequal() should assert that the IndexTuple was not truncated.
> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
> if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
> return 1;
>
> + /*
> + * Check tuple has correct number of attributes.
> + */
> + if (unlikely(!_bt_check_natts(rel, page, offnum)))
> + {
> + ereport(ERROR,
> + (errcode(ERRCODE_INTERNAL_ERROR),
> + errmsg("tuple has wrong number of attributes in index \"%s\"",
> + RelationGetRelationName(rel))));
> + }
In principle, we should also check _bt_check_natts() for "minus
infinity" items, just like you did within verify_nbtree.c. Also, there
is no need for parenthesis here.
* Maybe _bt_truncate_tuple() should assert that the caller has not
tried to truncate a tuple that has already been truncated.
I'm not sure if our assertion should be quite that strong, but I think
that that might be good because in general we only need to truncate on
the leaf level -- truncating at any other level on the tree (e.g.
doing traditional suffix truncation) is always subtly wrong. What we
definitely should do, at a minimum, is make sure that attempting to
truncate a tuple to 2 attributes when it already has 0 attributes
fails with an assertion failure.
Can you try adding the strong assertion (truncate only once) to
_bt_truncate_tuple()? Maybe that's not possible, but it seems worth a
try.
* I suggest we invent a new flag for 0x2000 within itup.h, to replace
"/* bit 0x2000 is reserved for index-AM specific usage */".
We can call it INDEX_AM_RESERVED_BIT. Then, we can change
INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the
same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this
neater.
* We should "use" one of the 4 new status bit that are available from
an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf
index tuples. Perhaps call it BT_ALT_TID_NONPIVOT.
I guess you could say that I want us to reserve one of our 4 reserve bits.
* I think that you could add to this:
> +++ b/src/backend/access/nbtree/README
> @@ -590,6 +590,10 @@ original search scankey is consulted as each index entry is sequentially
> scanned to decide whether to return the entry and whether the scan can
> stop (see _bt_checkkeys()).
>
> +We use term "pivot" index tuples to distinguish tuples which don't point
> +to heap tuples, but rather used for tree navigation. Pivot tuples includes
> +all tuples on non-leaf pages and high keys on leaf pages.
I like what you came up with, and where you put it, but I would add
another few sentences: "Note that pivot index tuples are only used to
represent which part of the key space belongs on each page, and can
have attribute values copied from non-pivot tuples that were deleted
and killed by VACUUM some time ago. In principle, we could truncate
away attributes that are not needed for a page high key during a leaf
page split, provided that the remaining attributes distinguish the
last index tuple on the post-split left page as belonging on the left
page, and the first index tuple on the post-split right page as
belonging on the right page. This optimization is sometimes called
suffix truncation, and may appear in a future release. Since the high
key is subsequently reused as the downlink in the parent page for the
new right page, suffix truncation can increase index fan-out
considerably by keeping pivot tuples short. INCLUDE indexes similarly
truncate away non-key attributes at the time of a leaf page split,
increasing fan-out."
> Good point. Tests with "heapallindexed" were added. I also find that it's
> useful to
> check both index built by sorting, and index built by insertions, because
> there are
> different ways of forming tuples.
Right. It's a good cross-check for things like that. We'll have to
teach bt_tuple_present_callback() to normalize the representation in
some way for the BT_ALT_TID_NONPIVOT case in the future. But it
already talks about normalizing for reasons like this, so that's okay.
* I think you should add a note about BT_ALT_TID_NONPIVOT to
bt_tuple_present_callback(), though. If it cannot be sure that every
non-pivot tuple will have the same representation, amcheck will have
to normalize to the most flexible representation before hashing.
> Ok. I've tried to remove both assertions and "+1" hack. That works
> for me. However, I've to touch a lot of places, not sure if that's a
> problem.
Looks good to me. If it makes an assertion fail, that's probably a
good thing, because it would have been broken before anyway.
* You missed this comment, which is now not accurate:
> + * It's possible that index tuple has zero attributes (leftmost item of
> + * iternal page). And we have assertion that offset number is greater or equal
> + * to 1. This is why we store (number_of_attributes + 1) in offset number.
> + */
I can see that it is actually 0 for a minus infinity item, which is good.
> I wrote some comment there. Please, check it.
The nbtsort.c comments could maybe do with some tweaks from a native
speaker, but look correct.
> Regarding !P_ISLEAF(), I think we should check every item on both
> leaf and non-leaf pages. I that is how code now works unless I'm missing
> something.
It does, and should. Thanks.
> Thanks for pointing. Since there are now three cases including handling of
> "minus infinity" items, comment is now split to three.
That looks good. Thanks.
Right now, it looks like every B-Tree index could use
INDEX_ALT_TID_MASK, regardless of whether or not it's an INCLUDE
index. I think that that's fine, but let's say so in the paragraph
that introduces INDEX_ALT_TID_MASK. This patch establishes that any
nbtree pivot tuple could have INDEX_ALT_TID_MASK set, and that's
something that can be expected. It's also something that might not be
set when pg_upgrade was used, but that's fine too.
> I don't seerelations between this patchset and SSI. We just
> change representation of some index tuples in pages. However,
> we didn't change the the order of page modification, the order
> of page lookup and so on. Yes, we change size of some tuples,
> but B-tree already worked with tuples of variable sizes. So, the fact
> that tuples now have different size shouldn't affect SSI. Right now,
> I'm not sure if CheckForSerializableConflictIn() just above the
> _bt_doinsert() is good idea. But even if so, I think that should be
> a subject of separate patch.
My point was that that nothing changes, because we already use what
_bt_doinsert() calls the "first valid" page. Maybe just add: "(This
reasoning also applies to INCLUDE indexes, whose extra attributes are
not considered part of the key space.)".
Alexander Korotkov
Postgres Professional: http://www.postg
The Russian Postgres Company
Attachment
On Thu, Apr 5, 2018 at 7:59 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> * btree_xlog_split() still has this code: > Right, I think there is absolutely no need in this code. It's removed in > the attached patchset. I'm now a bit nervous about always logging the high key, since that could impact performance. I think that there is a good way to only do it when needed. New plan: 1. Add these new fields to split record's set of xl_info fields (it should be placed directly after XLOG_BTREE_SPLIT_R): #define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated highkey */ #define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated highkey */ 2. Within _bt_split(), restore the old "leaf vs. internal" logic, so that the high key is only logged for internal (!isleaf) pages. However, only log it when needed for leaf pages -- only when the new highkey was *actually* truncated (or when its an internal page), since only then will it actually be different to the first item on the right page. Also, set XLOG_BTREE_SPLIT_L_HIGHKEY instead of XLOG_BTREE_SPLIT_L when we must log (or set XLOG_BTREE_SPLIT_R_HIGHKEY instead of XLOG_BTREE_SPLIT_R), so that recovery actually knows that it should restore the truncated highkey. (Sometimes I think it would be nice to be able to do more during recovery, but that's a much bigger issue.) 3. Restore all the master code within btree_xlog_split(), except instead of restoring the high key when !isleaf, do so when the record is XLOG_BTREE_SPLIT_L_HIGHKEY|XLOG_BTREE_SPLIT_R_HIGHKEY. 4. Add an assertion within btree_xlog_split(), that ensures that internal pages never fail to have their high key logged, since there is no reason why that should ever not happen with internal pages. 5. Fix this struct xl_btree_split comment, which commit 0c504a80 from 2017 missed when it reclaimed two xl_info status bits: * Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record. * The _L and _R variants indicate whether the inserted tuple went into the * left or right split page (and thus, whether newitemoff and the new item * are stored or not). The _ROOT variants indicate that we are splitting * the root page, and thus that a newroot record rather than an insert or * split record should follow. Note that a split record never carries a * metapage update --- we'll do that in the parent-level update. 6. Add your own xl_btree_split comment in its place, noting the new usage. Basically, the _ROOT sentence with a similar _HIGHKEY sentence. 7. Don't forget about btree_desc(). I'd say that there is a good change that Anastasia is correct to think that it isn't worth worrying about the extra WAL that her approach implied, and that it is in fact good enough to simply always log the left page's high key. However, it seems easier and lower risk all around to do it this way. It doesn't leave us with ambiguity. In my experience, *ambiguity* on design questions makes a patch miss a release much more frequently than bugs or regressions make that happen. Sorry that I didn't just say this the first time I brought up btree_xlog_split(). I didn't see the opportunity to avoid creating more WAL until now. > OK, I've added asserting that number of tuple attributes shoud be > either natts or nkeyatts, because we call _bt_mkscankey() for > pivot index tuples too. Makes sense. > If you're talking about these assertions > > Assert(IndexRelationGetNumberOfAttributes(rel) != 0); > indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); > Assert(indnkeyatts != 0); Actually, I was just talking about the first one, "Assert(IndexRelationGetNumberOfAttributes(rel) != 0)". I was unclear. Maybe it isn't worth getting rid of even the first one. > then I would rather leave both them. If we know that index tuple > length is either natts or nkeyatts, that doesn't make you sure, that > both natts and nkeyatts are non-zero. I suppose. > I've done so. Tests are passing for me. Great. I'm glad that worked out. One simple, broad rule. > Hmm, we have four bit reserved. But I'm not sure whether we would use > *all* of them for non-pivot tuples. Probably we would use some of them for > pivot tuples. I don't know that in advance. Thus, I propose to don't > rename this. But I've added comment that non-pivot tuples might also > use those bits. Okay. Good enough. > Sorry, I didn't get which particular further use of reserved bits do you > mean? > Did you mean key normalization? I was being unclear. I was just reiterating my point about having a non-pivot bit. It doesn't matter, though. > Thank you for writing that explanation. Looks good. I think that once you realize how INCLUDE indexes don't change pivot tuples, and actually understand what pivot tuples are, the patch seems a lot less scary. > This patchset also incorporates docs enhacements by Erik Rijkers and > sentence which states that indexes with included colums are also called > "covering indexes". Cool. * Use <quote><quote/> here: > + <para> > + Indexes with columns listed in the <literal>INCLUDE</literal> clause > + are also called "covering indexes". > + </para> * Use <literal><literal/> here: > + <para> > + In <literal>UNIQUE</literal> indexes, uniqueness is only enforced > + for key columns. Columns listed in the <literal>INCLUDE</literal> > + clause have no effect on uniqueness enforcement. Other constraints > + (PRIMARY KEY and EXCLUDE) work the same way. > + </para> * Do the regression tests pass with COPY_PARSE_PLAN_TREES? * Running pgindent would be nice. I see a bit of trailing whitespace, and things like that. * Please tweak the indentation here (perhaps a new line): > @@ -927,6 +963,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup) > last_off = P_FIRSTKEY; > } > > + pageop = (BTPageOpaque) PageGetSpecialPointer(npage); > /* * Does the optimizer README's PathKeys section need a sentence or two on this patch? I'm nervous about problems within the optimizer in general, since that is an area that I am not particularly qualified to review. I hope that someone with more experience in that area can take a look at it specifically. I see that there are very few changes in the optimizer, but in my experience that's often the problem when it comes to the optimizer -- it lacks subtle things that it actually needs, rather than having the wrong things. * Does this existing build_index_pathkeys() comment need to be updated? * The result is canonical, meaning that redundant pathkeys are removed; * it may therefore have fewer entries than there are index columns. * * Another reason for stopping early is that we may be able to tell that * an index column's sort order is uninteresting for this query. However, * that test is just based on the existence of an EquivalenceClass and not * on position in pathkey lists, so it's not complete. Caller should call * truncate_useless_pathkeys() to possibly remove more pathkeys. * I don't think that there is much point in having separate 0003 + 0004 patches. For the next revision, please squash those down into 0002. Actually, maybe there should be only one patch for the next revision. Up to you. * Please write commit messages for your patches. I like to make these part of the review process. That's all for now. -- Peter Geoghegan
On Thu, Apr 5, 2018 at 7:59 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> * btree_xlog_split() still has this code:
> Right, I think there is absolutely no need in this code. It's removed in
> the attached patchset.
I'm now a bit nervous about always logging the high key, since that
could impact performance. I think that there is a good way to only do
it when needed. New plan:
1. Add these new fields to split record's set of xl_info fields (it
should be placed directly after XLOG_BTREE_SPLIT_R):
#define XLOG_BTREE_SPLIT_L_HIGHKEY 0x50 /* as above, include truncated
highkey */
#define XLOG_BTREE_SPLIT_R_HIGHKEY 0x60 /* as above, include truncated
highkey */
2. Within _bt_split(), restore the old "leaf vs. internal" logic, so
that the high key is only logged for internal (!isleaf) pages.
However, only log it when needed for leaf pages -- only when the new
highkey was *actually* truncated (or when its an internal page), since
only then will it actually be different to the first item on the right
page. Also, set XLOG_BTREE_SPLIT_L_HIGHKEY instead of
XLOG_BTREE_SPLIT_L when we must log (or set XLOG_BTREE_SPLIT_R_HIGHKEY
instead of XLOG_BTREE_SPLIT_R), so that recovery actually knows that
it should restore the truncated highkey.
(Sometimes I think it would be nice to be able to do more during
recovery, but that's a much bigger issue.)
3. Restore all the master code within btree_xlog_split(), except
instead of restoring the high key when !isleaf, do so when the record
is XLOG_BTREE_SPLIT_L_HIGHKEY|XLOG_BTREE_SPLIT_R_HIGHKEY.
4. Add an assertion within btree_xlog_split(), that ensures that
internal pages never fail to have their high key logged, since there
is no reason why that should ever not happen with internal pages.
5. Fix this struct xl_btree_split comment, which commit 0c504a80 from
2017 missed when it reclaimed two xl_info status bits:
* Note: the four XLOG_BTREE_SPLIT xl_info codes all use this data record.
* The _L and _R variants indicate whether the inserted tuple went into the
* left or right split page (and thus, whether newitemoff and the new item
* are stored or not). The _ROOT variants indicate that we are splitting
* the root page, and thus that a newroot record rather than an insert or
* split record should follow. Note that a split record never carries a
* metapage update --- we'll do that in the parent-level update.
6. Add your own xl_btree_split comment in its place, noting the new
usage. Basically, the _ROOT sentence with a similar _HIGHKEY sentence.
7. Don't forget about btree_desc().
I'd say that there is a good change that Anastasia is correct to think
that it isn't worth worrying about the extra WAL that her approach
implied, and that it is in fact good enough to simply always log the
left page's high key. However, it seems easier and lower risk all
around to do it this way. It doesn't leave us with ambiguity. In my
experience, *ambiguity* on design questions makes a patch miss a
release much more frequently than bugs or regressions make that
happen.
Sorry that I didn't just say this the first time I brought up
btree_xlog_split(). I didn't see the opportunity to avoid creating
more WAL until now.
> If you're talking about these assertions
>
> Assert(IndexRelationGetNumberOfAttributes(rel) != 0);
> indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
> Assert(indnkeyatts != 0);
Actually, I was just talking about the first one,
"Assert(IndexRelationGetNumberOfAttributes(rel) != 0)". I was unclear.
Maybe it isn't worth getting rid of even the first one.
* Use <quote><quote/> here:
> + <para>
> + Indexes with columns listed in the <literal>INCLUDE</literal> clause
> + are also called "covering indexes".
> + </para>
* Use <literal><literal/> here:
> + <para>
> + In <literal>UNIQUE</literal> indexes, uniqueness is only enforced
> + for key columns. Columns listed in the <literal>INCLUDE</literal>
> + clause have no effect on uniqueness enforcement. Other constraints
> + (PRIMARY KEY and EXCLUDE) work the same way.
> + </para>
* Do the regression tests pass with COPY_PARSE_PLAN_TREES?
* Running pgindent would be nice. I see a bit of trailing whitespace,
and things like that.
* Please tweak the indentation here (perhaps a new line):
> @@ -927,6 +963,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
> last_off = P_FIRSTKEY;
> }
>
> + pageop = (BTPageOpaque) PageGetSpecialPointer(npage);
> /*
* Does the optimizer README's PathKeys section need a sentence or two
on this patch?
I'm nervous about problems within the optimizer in general, since that
is an area that I am not particularly qualified to review. I hope that
someone with more experience in that area can take a look at it
specifically. I see that there are very few changes in the optimizer,
but in my experience that's often the problem when it comes to the
optimizer -- it lacks subtle things that it actually needs, rather
than having the wrong things.
* Does this existing build_index_pathkeys() comment need to be updated?
* The result is canonical, meaning that redundant pathkeys are removed;
* it may therefore have fewer entries than there are index columns.
*
* Another reason for stopping early is that we may be able to tell that
* an index column's sort order is uninteresting for this query. However,
* that test is just based on the existence of an EquivalenceClass and not
* on position in pathkey lists, so it's not complete. Caller should call
* truncate_useless_pathkeys() to possibly remove more pathkeys.
* I don't think that there is much point in having separate 0003 +
0004 patches. For the next revision, please squash those down into
0002. Actually, maybe there should be only one patch for the next
revision. Up to you.
* Please write commit messages for your patches. I like to make these
part of the review process.
Alexander Korotkov
Postgres Professional: http://www.postg
The Russian Postgres Company
Attachment
As far I can see, there is no any on-disk representation differece for *existing* indexes. So, pg_upgrade is not need here and there isn't any new code to support "on-fly" modification. Am I right? -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Fri, Apr 6, 2018 at 10:20 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > As far I can see, there is no any on-disk representation differece for > *existing* indexes. So, pg_upgrade is not need here and there isn't any new > code to support "on-fly" modification. Am I right? Yes. I'm going to look at this again today, and will post something within 12 hours. Please hold off on committing until then. -- Peter Geoghegan
On Fri, Apr 6, 2018 at 10:20 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
> As far I can see, there is no any on-disk representation differece for
> *existing* indexes. So, pg_upgrade is not need here and there isn't any new
> code to support "on-fly" modification. Am I right?
Yes.
I'm going to look at this again today, and will post something within
12 hours. Please hold off on committing until then.
Alexander Korotkov
Postgres Professional: http://www.
The Russian Postgres Company
On Fri, Apr 6, 2018 at 10:33 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Thinking about that again, I found that we should relax our requirements to > "minus infinity" items, because pg_upgraded indexes doesn't have any > special bits set for those items. > > What do you think about applying following patch on the top of v14? It's clearly necessary. Looks fine to me. -- Peter Geoghegan
On Fri, Apr 6, 2018 at 10:33 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thinking about that again, I found that we should relax our requirements to
> "minus infinity" items, because pg_upgraded indexes doesn't have any
> special bits set for those items.
>
> What do you think about applying following patch on the top of v14?
It's clearly necessary. Looks fine to me.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
On Fri, Apr 6, 2018 at 11:08 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > OK, incorporated into v15. I've also added sentence about pg_upgrade > to the commit message. I will summarize my feelings on this patch. I endorse committing the patch, because I think that the benefits of committing it now noticeably outweigh the costs. I have various caveats about pushing the patch, but these are manageable. Costs ===== First, there is the question of risks, or costs. I think that this patch has a negligible chance of being problematic in a way that will become memorable. That seems improbable because the patch only really changes the representation of what we're calling "pivot keys" (high keys and internal page downlinks), which is something that VACUUM doesn't care about. I see this patch as a special case of suffix truncation, a technique that has been around since the 1970s. Although you have to look carefully to see it, the amount of extra complexity is pretty small, and the only place where a critical change is made is during leaf page splits. As long as we get that right, everything else should fall into place. There are no risks that I can see that are related to concurrency, or that crop up when doing an anti-wraparound VACUUM. There may be problems, but at least they won't be *pernicious* problems that unravel over a long period of time. The latest amcheck enhancement, and Alexander's recent changes to the patch to make the on-disk representation explicit (not implicit) should change things. We now have the tools to detect any corruption problem that I can think of. For example, if there was some subtle reason why assessing HOT safety broke, then we'd have a way of mechanically detecting that without having to devise a custom test (like the test Pavan happened to be using when the bug fixed by 2aaec654 was originally discovered). The lessons that I applied to designing amcheck were in a few cases from actual experience with real world bugs, including that 2aaec654 bug. I hope that it goes without saying that I've also taken reasonable steps to address all of these risks directly, by auditing code. And, that this remains the first line of defense. Here are the other specific issues that I see with the patch: * It's possible that something was missed in the optimizer. I'm not sure. I share the intuition that very little code is actually needed there, but I'm far from the best person to judge whether or not some subtle detail was missed. * This seems out of date: > + * NOTE: It is not crucial for reliability in present, but maybe > + * it will be that in the future. Now the purpose is just to save > + * more space on inner pages of btree. * CheckIndexCompatible() didn't seem to get the memo about this patch. Maybe just a comment? * It's possible that there are some more bugs in places like relcache.c, or deparsing, or pg_dump, or indexcmds.c; perhaps simple omissions, like the one I just mentioned. If there are, I don't expect them to be particularly serious, or to make me reassess my basic position. But there could be. * I was wrong to suggest _bt_isequal() has an assertion against truncation. It is called for the highkey. Suggest you weaken the assertion, so it only applies when the offset isn't P_HIKEY on non-rightmost page. * Suggest adding a comment above BTStackData, about bts_btentry + offset. * Suggest breaking BTEntrySame() into 3 lines, not 2. * This comment needs to be updated: /* get high key from left page == lowest key on new right page */ Suggest "get high key from left page == lower bound for new right page". * This comment needs to be updated: 13th bit: unused Suggest "13th bit: AM-defined meaning" * Suggest adding a note that the use of P_HIKEY here is historical, since it isn't used to match downlinks: /* * Find the parent buffer and get the parent page. * * Oops - if we were moved right then we need to change stack item! We * want to find parent pointing to where we are, right ? - vadim * 05/27/97 */ ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); * I'm slightly concerned that this patch subtly breaks an optimization within _bt_preprocess_keys(), or _bt_checkkeys(). I cannot find any evidence of that, though, and I consider it unlikely, based on the intuition that the simple Pathkey changes in the optimizer don't provide the executor with a truly new set of constraints for index scans. Also, if there was a problem here, it would be in the less serious category of problems -- those that can't really affect anyone not using the user-visible feature. * The docs need some more polishing. Didn't spend very much time on this at all. Benefits ======== There is also the matter of the benefits of this patch, that I think are considerable, and far greater than they appear. This feature is a great way to begin to add a broad variety of enhancements to nbtree that we really need. * The patch makes index-only scans a lot more compelling. There are a couple of reasons why it's better to create indexes that index perhaps as many as 4 or 7 columns to target index-only scans in other database systems. I think that fan-out may be the main one. The disadvantage that we have around HOT safety compared to other systems seem less likely to be the problem when that many columns are involved, and yet this is something that Oracle/SQL Server people do frequently, and Postgres people don't really do at all. This one thing that suffix truncation improves automatically, but INCLUDE indexes can make that general situation a lot better than truncation alone ever could. If you have an index where most columns are INCLUDE columns, and compare that to an index with the same attributes that are indexed in the conventional way, then I believe that you will have far fewer problems with index bloat in some important cases. Apart from everything else, this provides us with the opportunity to learn how to mitigate index bloat problems in real world conditions, even without INCLUDE indexes. We need to get smarter about problems with index bloat. * Suffix truncation works on the same principle, and is enabled by this work. It's prerequisite to making nbtree use the classic L&Y approach, that assumes that all items in the index are unique. We could just add heap TID to pivot tuples today, as an "extra" column, while sorting on TID at the leaf level. This would make TID a first class part of the key space -- a "unique-ifier", as L&Y intended. But doing so naively would add enormous overhead, which would simply be unacceptable. However, once we have suffix truncation, the overhead is eliminated in virtually all cases. We get to move to the classic L&Y invariant, simplifying the code, and we have a solid basis for adding "retail index tuple deletion", which I believe is almost essentially for zheap. There is a good chance that Postgres B-Trees are the only implementation in the world that doesn't have truly unique keys. The design of nbtree would become a whole lot more elegant if we could restore the classic "Ki < v <= Ki+1" invariant, as Vadim intended over 20 years ago. Somebody has to bite the bullet and start changing the representation of pivot tuples to get these benefits (and many more). This seems like an ideal place to start that process. I think that what we have here addresses concerns from Tom [1], in particular. The patch has been marked "Ready for Committer". While this patch is primarily the responsibility of the committer, presumably Teodor in this case, I will take some of the responsibility for the patch after commit. Certainly, because I see the patch as strategically important, I am willing to spend quite a lot of time after feature freeze, to make sure that it is in good shape. I have a general interest in making sure that amcheck gains acceptance as a way of validating a complicated patch like this one after commit. [1] https://www.postgresql.org/message-id/15195.1490988897%40sss.pgh.pa.us -- Peter Geoghegan
On 2018-04-06 20:08, Alexander Korotkov wrote: > > [0001-Covering-v15.patch] > After some more testing I notice there is also a down-side/slow-down to this patch that is not so bad but more than negligible, and I don't think it has been mentioned (but I may have missed something in this thread that's now been running for 1.5 year, not to mention the tangential btree-thread(s)). I attach my test-program, which compares master (this morning) with covered_indexes (warning: it takes a while to generate the used tables). The test tables are created as: create table $t (c1 int, c2 int, c3 int, c4 int); insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, $rowcount) as x); create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) include (c3, c4); or for HEAD, just: create unique index ${t}unique_idx on $t using btree (c1, c2); Here is typical output (edited a bit to prevent email-mangling): test1: -- explain analyze select c1, c2 from nt0___100000000 where c1 < 10000 -- 250x unpatched 6511: 100M rows Execution Time: (normal/normal) 98 % exec avg: 2.44 patched 6976: 100M rows Execution Time: (covered/normal) 108 % exec avg: 2.67 test1 patched / unpatched: 109.49 % test4: -- explain analyze select c1, c2 from nt0___100000000 where c1 < 10000 and c3 < 20 unpatched 6511: 100M rows Execution Time: (normal/normal) 95 % exec avg: 1.56 patched 6976: 100M rows Execution Time: (covered/normal) 60 % exec avg: 0.95 test4 patched / unpatched: 60.83 % So the main good thing is that 60%, a good improvement -- but that ~109% (a slow-down) is also quite repeatable. (there are a more goodies from the patch (like improved insert-speed) but I just wanted to draw attention to this particular slow-down too) I took all timings from explain analyze versions of the statements, on the assumption that that would be quite comparable to 'normal' querying. (please let me know if that introduces error). # \dti+ nt0___1* List of relations Schema | Name | Type | Owner | Table | Size --------+----------------------------------+-------+----------+-----------------+-------- public | nt0___100000000 | table | aardvark | | 4224 MB public | nt0___100000000uniqueinclude_idx | index | aardvark | nt0___100000000 | 3004 MB (for what it's worth, I'm in favor of getting this patch into v11 although I can't say I followed the technical details too much) thanks, Erik Rijkers
Attachment
Thank you! > create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) > include (c3, c4); > or for HEAD, just: > create unique index ${t}unique_idx on $t using btree (c1, c2); > -- explain analyze select c1, c2 from nt0___100000000 where c1 < 10000 > -- explain analyze select c1, c2 from nt0___100000000 where c1 < 10000 > and c3 < 20 Not so fair comparison, include index twice bigger because of include columns. Try to compare with covering-emulated index: create unique index ${t}unique_idx on $t using btree (c1, c2, c3, c4) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On 2018-04-06 20:08, Alexander Korotkov wrote:
[0001-Covering-v15.patch]
After some more testing I notice there is also a down-side/slow-down to this patch that is not so bad but more than negligible, and I don't think it has been mentioned (but I may have missed something in this thread that's now been running for 1.5 year, not to mention the tangential btree-thread(s)).
I attach my test-program, which compares master (this morning) with covered_indexes (warning: it takes a while to generate the used tables).
The test tables are created as:
create table $t (c1 int, c2 int, c3 int, c4 int);
insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, $rowcount) as x);
create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) include (c3, c4);
or for HEAD, just:
create unique index ${t}unique_idx on $t using btree (c1, c2);
Alexander Korotkov
Postgres Professional: http://www.
The Russian Postgres Company
> First, there is the question of risks, or costs. I think that this I hope that's acceptable risk. > * It's possible that something was missed in the optimizer. I'm not sure. > > I share the intuition that very little code is actually needed there, > but I'm far from the best person to judge whether or not some subtle > detail was missed. Of course, it's possible but some variant of this patch is already used in production environment and we didn't face with planer issues. Of course it could be, but if so then they are so deep that I doubt that they can be found easily. > > * This seems out of date: > >> + * NOTE: It is not crucial for reliability in present, but maybe >> + * it will be that in the future. Now the purpose is just to save >> + * more space on inner pages of btree. removed > > * CheckIndexCompatible() didn't seem to get the memo about this patch. > Maybe just a comment? improved > * I was wrong to suggest _bt_isequal() has an assertion against > truncation. It is called for the highkey. Suggest you weaken the > assertion, so it only applies when the offset isn't P_HIKEY on > non-rightmost page. Fixed > > * Suggest adding a comment above BTStackData, about bts_btentry + offset. see below > > * Suggest breaking BTEntrySame() into 3 lines, not 2. see below > > * This comment needs to be updated: > /* get high key from left page == lowest key on new right page */ > Suggest "get high key from left page == lower bound for new right page". fixed > > * This comment needs to be updated: > 13th bit: unused > > Suggest "13th bit: AM-defined meaning" done > * Suggest adding a note that the use of P_HIKEY here is historical, > since it isn't used to match downlinks: > > /* > * Find the parent buffer and get the parent page. > * > * Oops - if we were moved right then we need to change stack item! We > * want to find parent pointing to where we are, right ? - vadim > * 05/27/97 > */ > ItemPointerSet(&(stack->bts_btentry.t_tid), bknum, P_HIKEY); > pbuf = _bt_getstackbuf(rel, stack, BT_WRITE); On close look, bts_btentry.ip_posid is not used anymore, I change bts_btentry type to BlockNumber. As result, BTEntrySame() is removed. > * The docs need some more polishing. Didn't spend very much time on this at all. Suppose, it should be some native English speaker, definitely not me. I'm not very happy with massive usage of ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to wrap it to macro something like this: #define BTreeInnerTupleGetDownLink(itup) \ ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)) It will be nice to add assert checking in this macro about inner tuple or not, but, as I can see, it's impossible - inner and leaf tuples are indistinguishable. So I add pair BTreeInnerTupleGetDownLink/TreeInnerTupleSetDownLink except a few places. If there isn't strong objection, I intend to push it this evening. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
On 2018-04-07 14:27, Alexander Korotkov wrote: > On Sat, Apr 7, 2018 at 2:57 PM, Erik Rijkers <er@xs4all.nl> wrote: > >> On 2018-04-06 20:08, Alexander Korotkov wrote: >> >>> [0001-Covering-v15.patch] >>> >> After some more testing I notice there is also a down-side/slow-down >> to >> this patch that is not so bad but more than negligible, and I don't >> think >> it has been mentioned (but I may have missed something in this thread >> that's now been running for 1.5 year, not to mention the tangential >> btree-thread(s)). >> >> I attach my test-program, which compares master (this morning) with >> covered_indexes (warning: it takes a while to generate the used >> tables). >> >> The test tables are created as: >> create table $t (c1 int, c2 int, c3 int, c4 int); >> insert into $t (select x, 2*x, 3*x, 4 from generate_series(1, >> $rowcount) >> as x); >> create unique index ${t}uniqueinclude_idx on $t using btree (c1, c2) >> include (c3, c4); >> >> or for HEAD, just: >> create unique index ${t}unique_idx on $t using btree (c1, c2); >> > > Do I understand correctly that you compare unique index on (c1, c2) > with > master to unqiue index on (c1, c2) include (c3, c4) with patched > version? > If so then I think it's wrong to say about down-side/slow-down of this > patch based on this comparison. > Patch *does not* cause slowdown in this case. Patch provides user a > *new > option* which has its advantages and disadvantages. And what you > compare > is advantages and disadvantages of this option, not slow-down of the > patch. > In the case you compare *the same* index on master and patched version, > then it's possible to say about slow-down of the patch. OK, I take your point -- you are right. Although my measurement was (I think) correct, my comparison was not (as Teodor wrote, not quite 'fair'). Sorry, I should have better thought that message through. The somewhat longer time is indeed just a disadvantage of this new option, to be balanced against the advantages that are pretty clear too. Erik Rijkers
I didn't like rel.h being included in itup.h. Do you really need a Relation as argument to index_truncate_tuple? It looks to me like you could pass the tupledesc instead; indnatts could be passed as a separate argument instead of IndexRelationGetNumberOfAttributes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> I didn't like rel.h being included in itup.h. Do you really need a > Relation as argument to index_truncate_tuple? It looks to me like you > could pass the tupledesc instead; indnatts could be passed as a separate > argument instead of IndexRelationGetNumberOfAttributes. > Hm, okay, I understand why, will fix by you suggestion -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> I didn't like rel.h being included in itup.h. Do you really need a > Relation as argument to index_truncate_tuple? It looks to me like you fixed -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
On Sat, Apr 7, 2018 at 5:48 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > On close look, bts_btentry.ip_posid is not used anymore, I change > bts_btentry type to BlockNumber. As result, BTEntrySame() is removed. That seems like a good idea. > I'm not very happy with massive usage of > ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to wrap it to > macro something like this: > #define BTreeInnerTupleGetDownLink(itup) \ > ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)) Agreed. We do that with GIN. -- Peter Geoghegan
Thanks to everyone, pushed. Peter Geoghegan wrote: > On Sat, Apr 7, 2018 at 5:48 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> On close look, bts_btentry.ip_posid is not used anymore, I change >> bts_btentry type to BlockNumber. As result, BTEntrySame() is removed. > > That seems like a good idea. > >> I'm not very happy with massive usage of >> ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)), suggest to wrap it to >> macro something like this: >> #define BTreeInnerTupleGetDownLink(itup) \ >> ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)) > > Agreed. We do that with GIN. > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Thanks to everyone, pushed.
On Sat, Apr 7, 2018 at 1:02 PM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Thanks to everyone, pushed. I'll keep an eye on the buildfarm, since it's late in Russia. -- Peter Geoghegan
> I'll keep an eye on the buildfarm, since it's late in Russia. Thank you very much! Now 23:10 MSK and I'll be able to follow during approximately hour. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote: > Thanks to everyone, pushed. Marked CF entry as committed. Greetings, Andres Freund
>>>>> "Teodor" == Teodor Sigaev <teodor@sigaev.ru> writes: >> I'll keep an eye on the buildfarm, since it's late in Russia. Teodor> Thank you very much! Now 23:10 MSK and I'll be able to follow Teodor> during approximately hour. Support for testing amcaninclude via pg_indexam_has_property(oid,'can_include') seems to be missing? Also the return values of pg_index_column_has_property for included columns seem a bit dubious... should probably be returning NULL for most properties except 'returnable'. I can look at fixing these for you if you like? -- Andrew (irc:RhodiumToad)
Thank you, I looked to buildfarm and completely forget about commitfest site Andres Freund wrote: > On 2018-04-07 23:02:08 +0300, Teodor Sigaev wrote: >> Thanks to everyone, pushed. > > Marked CF entry as committed. > > Greetings, > > Andres Freund > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Sat, Apr 7, 2018 at 1:52 PM, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote: > Support for testing amcaninclude via > pg_indexam_has_property(oid,'can_include') seems to be missing? > > Also the return values of pg_index_column_has_property for included > columns seem a bit dubious... should probably be returning NULL for most > properties except 'returnable'. > > I can look at fixing these for you if you like? I'm happy to accept your help with it, for one. -- Peter Geoghegan
> Support for testing amcaninclude via > pg_indexam_has_property(oid,'can_include') seems to be missing? > > Also the return values of pg_index_column_has_property for included > columns seem a bit dubious... should probably be returning NULL for most > properties except 'returnable'. Damn, you right, it's missed. > I can look at fixing these for you if you like? If you will do that I will be very grateful -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
>>>>> "Teodor" == Teodor Sigaev <teodor@sigaev.ru> writes: >> Support for testing amcaninclude via >> pg_indexam_has_property(oid,'can_include') seems to be missing? >> >> Also the return values of pg_index_column_has_property for included >> columns seem a bit dubious... should probably be returning NULL for most >> properties except 'returnable'. Teodor> Damn, you right, it's missed. >> I can look at fixing these for you if you like? Teodor> If you will do that I will be very grateful OK, I will deal with it. -- Andrew (irc:RhodiumToad)
Thanks to everyone, pushed.
Thank you, fixed Jeff Janes wrote: > On Sat, Apr 7, 2018 at 4:02 PM, Teodor Sigaev <teodor@sigaev.ru > <mailto:teodor@sigaev.ru>> wrote: > > Thanks to everyone, pushed. > > > Indeed thanks, this will be a nice feature. > > It is giving me a compiler warning on non-cassert builds using gcc > (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: > > indextuple.c: In function 'index_truncate_tuple': > indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] > int indnatts = tupleDescriptor->natts; > > Cheers, > > Jeff -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Thanks to everyone, pushed. > > > Indeed thanks, this will be a nice feature. > > It is giving me a compiler warning on non-cassert builds using gcc > (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609: > > indextuple.c: In function 'index_truncate_tuple': > indextuple.c:462:6: warning: unused variable 'indnatts' [-Wunused-variable] > int indnatts = tupleDescriptor->natts; Thank you, fixed -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Thank you, fixed I suggest that we remove some unneeded amcheck tests, as in the attached patch. They don't seem to add anything. -- Peter Geoghegan
Attachment
Thank you, pushed. Peter Geoghegan wrote: > On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> Thank you, fixed > > I suggest that we remove some unneeded amcheck tests, as in the > attached patch. They don't seem to add anything. > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hi, I tested this feature and found a document shortage in the columns added to the pg_constraint catalog. The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog. Regards, Noriyoshi Shinoda -----Original Message----- From: Teodor Sigaev [mailto:teodor@sigaev.ru] Sent: Monday, April 9, 2018 3:20 PM To: Peter Geoghegan <pg@bowt.ie> Cc: Jeff Janes <jeff.janes@gmail.com>; Alexander Korotkov <a.korotkov@postgrespro.ru>; Anastasia Lubennikova <a.lubennikova@postgrespro.ru>;PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org> Subject: Re: WIP: Covering + unique indexes. Thank you, pushed. Peter Geoghegan wrote: > On Sun, Apr 8, 2018 at 11:18 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> Thank you, fixed > > I suggest that we remove some unneeded amcheck tests, as in the > attached patch. They don't seem to add anything. > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
I tested this feature and found a document shortage in the columns added to the pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment
Hi!
Thank you for your response.
I think that it is good with your proposal.
Regards,
Noriyoshi Shinoda
From: Alexander Korotkov [mailto:a.korotkov@postgrespro.ru]
Sent: Monday, April 9, 2018 11:22 PM
To: Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev <teodor@sigaev.ru>; Peter Geoghegan <pg@bowt.ie>; Jeff Janes <jeff.janes@gmail.com>; Anastasia Lubennikova <a.lubennikova@postgrespro.ru>
Subject: Re: WIP: Covering + unique indexes.
Hi!
On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com> wrote:
I tested this feature and found a document shortage in the columns added to the pg_constraint catalog.
The attached patch will add the description of the 'conincluding' column to the manual of the pg_constraint catalog.
Thank you for pointing this!
I think we need more wordy explanation here. My proposal is attached.
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Thanks to both of you, pushed Shinoda, Noriyoshi wrote: > Hi! > > Thank you for your response. > > I think that it is good with your proposal. > > Regards, > > Noriyoshi Shinoda > > *From:*Alexander Korotkov [mailto:a.korotkov@postgrespro.ru] > *Sent:* Monday, April 9, 2018 11:22 PM > *To:* Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com> > *Cc:* PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Teodor Sigaev > <teodor@sigaev.ru>; Peter Geoghegan <pg@bowt.ie>; Jeff Janes > <jeff.janes@gmail.com>; Anastasia Lubennikova <a.lubennikova@postgrespro.ru> > *Subject:* Re: WIP: Covering + unique indexes. > > Hi! > > On Mon, Apr 9, 2018 at 5:07 PM, Shinoda, Noriyoshi <noriyoshi.shinoda@hpe.com > <mailto:noriyoshi.shinoda@hpe.com>> wrote: > > I tested this feature and found a document shortage in the columns added to > the pg_constraint catalog. > The attached patch will add the description of the 'conincluding' column to > the manual of the pg_constraint catalog. > > Thank you for pointing this! > > I think we need more wordy explanation here. My proposal is attached. > > ------ > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/> > The Russian Postgres Company > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Sun, Apr 8, 2018 at 11:19 PM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Thank you, pushed. I noticed a few more issues following another pass-through of the patch: * There is no pfree() within _bt_buildadd() for truncated tuples, even though that's a context where it's clearly not okay. * It might be a good idea to also pfree() the truncated tuple for most other _bt_buildadd() callers. Even though it's arguably okay in other cases, it seems worth being consistent about it (consistent with old nbtree code). * There should probably be some documentation around why it's okay that we call index_truncate_tuple() with an exclusive buffer lock held (during a page split). For example, there should probably be a comment on the VARATT_IS_EXTERNAL() situation. * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited to inner tuples, which might be worth doing something about (perhaps just renaming the macro). I do not have the time to write a patch right away, but I should be able to post one in a few days. I want to avoid sending several small patches. -- Peter Geoghegan
> * There is no pfree() within _bt_buildadd() for truncated tuples, even > though that's a context where it's clearly not okay. Agree > > * It might be a good idea to also pfree() the truncated tuple for most > other _bt_buildadd() callers. Even though it's arguably okay in other > cases, it seems worth being consistent about it (consistent with old > nbtree code). Seems, I don't see other calls to pfree after. > * There should probably be some documentation around why it's okay > that we call index_truncate_tuple() with an exclusive buffer lock held > (during a page split). For example, there should probably be a comment > on the VARATT_IS_EXTERNAL() situation. I havn't objection to improve docs/comments. > > * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited > to inner tuples, which might be worth doing something about (perhaps > just renaming the macro). What is suspicious place for you opinion? > > I do not have the time to write a patch right away, but I should be > able to post one in a few days. I want to avoid sending several small > patches. no problem, we can wait -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Tue, Apr 10, 2018 at 9:03 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited >> to inner tuples, which might be worth doing something about (perhaps >> just renaming the macro). > > What is suspicious place for you opinion? _bt_mark_page_halfdead() looked like it had a problem, but it now looks like I was wrong. I also verified every other BTreeInnerTupleGetDownLink() caller. It now looks like everything is good here. -- Peter Geoghegan
On Tue, Apr 10, 2018 at 1:37 PM, Peter Geoghegan <pg@bowt.ie> wrote: > _bt_mark_page_halfdead() looked like it had a problem, but it now > looks like I was wrong. I did find another problem, though. Looks like the idea to explicitly represent the number of attributes directly has paid off already: pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text); create unique index cov_idx on covering_bug (f1) include(f2); insert into covering_bug select i, i * random() * 1000, i * random() * 100000 from generate_series(0,100000) i; DEBUG: building index "pg_toast_16451_index" on table "pg_toast_16451" serially CREATE TABLE DEBUG: building index "cov_idx" on table "covering_bug" serially CREATE INDEX ERROR: tuple has wrong number of attributes in index "cov_idx" Note that amcheck can detect the issue with the index after the fact, too: pg@~[3711]=# select bt_index_check('cov_idx'); ERROR: wrong number of index tuple attributes for index "cov_idx" DETAIL: Index tid=(3,2) natts=2 points to index tid=(2,92) page lsn=0/170DC88. I don't think that the issue is complicated. Looks like we missed a place that we have to truncate within _bt_split(), located directly after this comment block: /* * If the page we're splitting is not the rightmost page at its level in * the tree, then the first entry on the page is the high key for the * page. We need to copy that to the right half. Otherwise (meaning the * rightmost page case), all the items on the right half will be user * data. */ I believe that the reason that we didn't find this bug prior to commit is that we only have a single index tuple with the wrong number of attributes after an initial root page split through insertions, but the next root page split masks the problems. Not 100% sure that that's why we missed it just yet, though. This bug shouldn't be hard to fix. I'll take care of it as part of that post-commit review patch I'm working on. -- Peter Geoghegan
> _bt_mark_page_halfdead() looked like it had a problem, but it now > looks like I was wrong. I also verified every other > BTreeInnerTupleGetDownLink() caller. It now looks like everything is > good here. Right - it tries to find right page by conlsulting in parent page, by taking of next key. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Tue, Apr 10, 2018 at 5:45 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I did find another problem, though. Looks like the idea to explicitly > represent the number of attributes directly has paid off already: > > pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text); > create unique index cov_idx on covering_bug (f1) include(f2); > insert into covering_bug select i, i * random() * 1000, i * random() * > 100000 from generate_series(0,100000) i; > DEBUG: building index "pg_toast_16451_index" on table "pg_toast_16451" serially > CREATE TABLE > DEBUG: building index "cov_idx" on table "covering_bug" serially > CREATE INDEX > ERROR: tuple has wrong number of attributes in index "cov_idx" Actually, this was an error on my part (though I'd still maintain that the check paid off here!). I'll still add defensive assertions inside _bt_newroot(), and anywhere else that they're needed. There is no reason to not add defensive assertions in all code that handles page splits, and needs to fetch a highkey from some other page. We missed a few of those. I'll add an item to "Decisions to Recheck Mid-Beta" section of the open items page for this patch. We should review the decision to make a call to _bt_check_natts() within _bt_compare(). It might work just as well as an assertion, and it would be unfortunate if workloads that don't use covering indexes had to pay a price for the _bt_check_natts() call, even if it was a small price. I've seen _bt_compare() appear prominently in profiles quite a few times. -- Peter Geoghegan
Peter Geoghegan wrote: > On Tue, Apr 10, 2018 at 5:45 PM, Peter Geoghegan <pg@bowt.ie> wrote: >> I did find another problem, though. Looks like the idea to explicitly >> represent the number of attributes directly has paid off already: >> >> pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text); >> create unique index cov_idx on covering_bug (f1) include(f2); >> insert into covering_bug select i, i * random() * 1000, i * random() * >> 100000 from generate_series(0,100000) i; >> DEBUG: building index "pg_toast_16451_index" on table "pg_toast_16451" serially >> CREATE TABLE >> DEBUG: building index "cov_idx" on table "covering_bug" serially >> CREATE INDEX >> ERROR: tuple has wrong number of attributes in index "cov_idx" > > Actually, this was an error on my part (though I'd still maintain that > the check paid off here!). I'll still add defensive assertions inside > _bt_newroot(), and anywhere else that they're needed. There is no > reason to not add defensive assertions in all code that handles page > splits, and needs to fetch a highkey from some other page. We missed a > few of those. Agree, I prefer to add more Assert, even. may be, more than actually needed. Assert-documented code :) > > I'll add an item to "Decisions to Recheck Mid-Beta" section of the > open items page for this patch. We should review the decision to make > a call to _bt_check_natts() within _bt_compare(). It might work just > as well as an assertion, and it would be unfortunate if workloads that > don't use covering indexes had to pay a price for the > _bt_check_natts() call, even if it was a small price. I've seen > _bt_compare() appear prominently in profiles quite a few times. > Could you show a patch? I think, we need move _bt_check_natts() and its call under USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't pay for unused feature. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hi! > 12 апр. 2018 г., в 21:21, Teodor Sigaev <teodor@sigaev.ru> написал(а): I was adapting tests for GiST covering index and found out that REINDEX test is somewhat not a REINDEX test... I propose following micropatch. Best regards, Andrey Borodin.
Attachment
On Thu, Apr 12, 2018 at 9:21 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Agree, I prefer to add more Assert, even. may be, more than actually needed. > Assert-documented code :) Absolutely. The danger with a feature like this is that we'll miss one place. I suppose that you could say that I am in the Poul-Henning Kamp camp on assertions [1]. >> I'll add an item to "Decisions to Recheck Mid-Beta" section of the >> open items page for this patch. We should review the decision to make >> a call to _bt_check_natts() within _bt_compare(). It might work just >> as well as an assertion, and it would be unfortunate if workloads that >> don't use covering indexes had to pay a price for the >> _bt_check_natts() call, even if it was a small price. I've seen >> _bt_compare() appear prominently in profiles quite a few times. >> > > Could you show a patch? Attached patch makes the changes that I talked about, and a few others. The commit message has full details. The general direction of the patch is that it documents our assumptions, and verifies them in more cases. Most of the changes I've made are clear improvements, though in a few cases I've made changes that are perhaps more debatable. These other, more debatable cases are: * The comments added to _bt_isequal() about suffix truncation may not be to your taste. The same is true of the way that I restored the previous _bt_isequal() function signature. (Yes, I want to change it back despite the fact that I was the person that originally suggested we change _bt_isequal().) * I added BTreeTupSetNAtts() calls to a few places that don't truly need them, such as the point where we generate a dummy 0-attribute high key within _bt_mark_page_halfdead(). I think that we should try to be as consistent as possible about using BTreeTupSetNAtts(), to set a good example. I don't think it's necessary to use BTreeTupSetNAtts() for pivot tuples when the number of key attributes matches indnatts (it seems inconvenient to have to palloc() our own scratch buffer to do this when we don't have to), but that doesn't apply to these now-covered cases. I imagine that you'll have no problem with the other changes in the patch, which is why I haven't mentioned them here. Let me know what you think. > I think, we need move _bt_check_natts() and its call under > USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't pay > for unused feature. I eventually decided that you were right about this, and made the _bt_compare() call to _bt_check_natts() a simple assertion without waiting to hear more opinions on the matter. Concurrency isn't a factor here, so adding a check to standard release builds isn't particularly likely to detect bugs. Besides, there is really only a small number of places that need to do truncation for themselves. And, if you want to be sure that the structure is consistent in the field, there is always amcheck, which can check _bt_check_natts() (while also checking other things that we care about just as much). Note that I removed some dead code from _bt_insertonpg() that wasn't added by the INCLUDE patch. It confused matters for this patch, since we don't want to consider what's supposed to happen when there is a retail insertion of a new, second negative infinity item -- clearly, that should simply never happen (I thought about adding a BTreeTupSetNAtts() call, but then decided to just remove the dead code and add a new "can't happen" elog error). Finally, I made sure that we don't drop all tables in the regression tests, so that we have some pg_dump coverage for INCLUDE indexes, per a request from Tom. [1] https://queue.acm.org/detail.cfm?id=2220317 -- Peter Geoghegan
Attachment
Attached patch makes the changes that I talked about, and a few
others. The commit message has full details. The general direction of
the patch is that it documents our assumptions, and verifies them in
more cases. Most of the changes I've made are clear improvements,
though in a few cases I've made changes that are perhaps more
debatable.
These other, more debatable cases are:
* The comments added to _bt_isequal() about suffix truncation may not
be to your taste. The same is true of the way that I restored the
previous _bt_isequal() function signature. (Yes, I want to change it
back despite the fact that I was the person that originally suggested
we change _bt_isequal().)
* I added BTreeTupSetNAtts() calls to a few places that don't truly
need them, such as the point where we generate a dummy 0-attribute
high key within _bt_mark_page_halfdead(). I think that we should try
to be as consistent as possible about using BTreeTupSetNAtts(), to set
a good example. I don't think it's necessary to use BTreeTupSetNAtts()
for pivot tuples when the number of key attributes matches indnatts
(it seems inconvenient to have to palloc() our own scratch buffer to
do this when we don't have to), but that doesn't apply to these
now-covered cases.
> I think, we need move _bt_check_natts() and its call under
> USE_ASSERT_CHECKING to prevent performance degradation. Users shouldn't pay
> for unused feature.
I eventually decided that you were right about this, and made the
_bt_compare() call to _bt_check_natts() a simple assertion without
waiting to hear more opinions on the matter. Concurrency isn't a
factor here, so adding a check to standard release builds isn't
particularly likely to detect bugs. Besides, there is really only a
small number of places that need to do truncation for themselves. And,
if you want to be sure that the structure is consistent in the field,
there is always amcheck, which can check _bt_check_natts() (while also
checking other things that we care about just as much).
Note that I removed some dead code from _bt_insertonpg() that wasn't
added by the INCLUDE patch. It confused matters for this patch, since
we don't want to consider what's supposed to happen when there is a
retail insertion of a new, second negative infinity item -- clearly,
that should simply never happen (I thought about adding a
BTreeTupSetNAtts() call, but then decided to just remove the dead code
and add a new "can't happen" elog error).
Finally, I made sure that we
don't drop all tables in the regression tests, so that we have some
pg_dump coverage for INCLUDE indexes, per a request from Tom.
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: > Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc > argument, not relation> It anyway doesn't need number of key attributes, > only total number of attributes. Then _bt_isequal() would be able to use > BTreeTupGetNAtts(). That would make the BTreeTupGetNAtts() assertions quite a bit more verbose, since there is usually no existing tuple descriptor variable, but there is almost always a "rel" variable. The coverage within _bt_isequal() does not seem important, because we only use it with the page high key in rare cases, where _bt_moveright() will already have tested the highkey. > I think it's completely OK to fix broken things when you've to touch > them. Probably, Teodor would decide to make that by separate commit. > So, it's up to him. You're right to say that this old negative infinity tuple code within _bt_insertonpg() is broken code, and not just dead code. The code doesn't just confuse things (e.g. see recent commit 2a67d644). It also seems like it could actually be harmful. This is code that could only ever corrupt your database. I'm fine if Teodor wants to commit that change separately, of course. -- Peter Geoghegan
I mostly agree with your patch, nice work, but I have some notices for your patch: 1) bt_target_page_check(): if (!P_RIGHTMOST(topaque) && !_bt_check_natts(state->rel, state->target, P_HIKEY)) Seems not very obvious: it looks like we don't need to check nattrs on rightmost page. Okay, I remember that on rightmost page there is no hikey at all, but at least comment should added. Implicitly bt_target_page_check() already takes into account 'is page rightmost or not?' by using P_FIRSTDATAKEY, so, may be better to move rightmost check into bt_target_page_check() with some refactoring if-logic: if (offset > maxoff) return true; //nothing to check, also covers empty rightmost page if (P_ISLEAF) { if (offnum >= P_FIRSTDATAKEY) ... else /* if (offnum == P_HIKEY) */ ... } else // !P_ISLEAF { if (offnum == P_FIRSTDATAKEY) ... else if (offnum > P_FIRSTDATAKEY) ... else /* if (offnum == P_HIKEY) */ ... } I see it's possible only 3 nattrs value: 0, nkey and nkey+nincluded, but collapsing if-clause to three branches causes difficulties for code readers. Let compiler optimize that. Sorry for late notice, but it takes my attention only when I noticed (!P_RIGHTMOST && !_bt_check_natts) condition. 2) Style notice: ItemPointerSetInvalid(&trunctuple.t_tid); + BTreeTupSetNAtts(&trunctuple, 0); if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), P_HIKEY, It's better to have blank line between BTreeTupSetNAtts() and if clause. 3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use full Tuple word in dowlink macroses, here we use just Tup. Seems, better to have Tuple in both cases. Or Tup, but still in both cases. 4) BTreeTupSetNAtts - seems, it's better to add check of nattrs to fits to BT_N_KEYS_OFFSET_MASK mask, and it should not touch BT_RESERVED_OFFSET_MASK bits, now it will overwrite that bits. Attached patch is rebased to current head and contains some comment improvement in index_truncate_tuple() - you save some amount of memory with TupleDescCopy() call but didn't explain why pfree is enough to free all allocated memory. Peter Geoghegan wrote: > On Tue, Apr 17, 2018 at 3:12 AM, Alexander Korotkov > <a.korotkov@postgrespro.ru> wrote: >> Hmm, what do you think about making BTreeTupGetNAtts() take tupledesc >> argument, not relation> It anyway doesn't need number of key attributes, >> only total number of attributes. Then _bt_isequal() would be able to use >> BTreeTupGetNAtts(). > > That would make the BTreeTupGetNAtts() assertions quite a bit more > verbose, since there is usually no existing tuple descriptor variable, > but there is almost always a "rel" variable. The coverage within > _bt_isequal() does not seem important, because we only use it with the > page high key in rare cases, where _bt_moveright() will already have > tested the highkey. > >> I think it's completely OK to fix broken things when you've to touch >> them. Probably, Teodor would decide to make that by separate commit. >> So, it's up to him. > > You're right to say that this old negative infinity tuple code within > _bt_insertonpg() is broken code, and not just dead code. The code > doesn't just confuse things (e.g. see recent commit 2a67d644). It also > seems like it could actually be harmful. This is code that could only > ever corrupt your database. > > I'm fine if Teodor wants to commit that change separately, of course. > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
(() On Wed, Apr 18, 2018 at 10:10 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: > I mostly agree with your patch, nice work, but I have some notices for your > patch: Thanks. > 1) > bt_target_page_check(): > if (!P_RIGHTMOST(topaque) && > !_bt_check_natts(state->rel, state->target, P_HIKEY)) > > Seems not very obvious: it looks like we don't need to check nattrs on > rightmost page. Okay, I remember that on rightmost page there is no hikey at > all, but at least comment should added. Implicitly bt_target_page_check() > already takes into account 'is page rightmost or not?' by using > P_FIRSTDATAKEY, so, may be better to move rightmost check into > bt_target_page_check() with some refactoring if-logic: I don't understand. We do check the number of attributes on rightmost pages, but we do so separately, in the main loop. For every item that isn't the high key. This code appears before the main bt_target_page_check() loop because we're checking the high key itself, on its own, which is a new thing. The high key is also involved in the loop (on non-rightmost pages), but that's only because we check real items *against* the high key (we assume the high key is good and that the item might be bad). The high key is involved in every iteration of the main loop (on non-rightmost pages), rather than getting its own loop. That said, I am quite happy if you want to put a comment about this being the rightmost page at the beginning of the check. > 2) > Style notice: > ItemPointerSetInvalid(&trunctuple.t_tid); > + BTreeTupSetNAtts(&trunctuple, 0); > if (PageAddItem(page, (Item) &trunctuple, sizeof(IndexTupleData), > P_HIKEY, > It's better to have blank line between BTreeTupSetNAtts() and if clause. Sure. > 3) Naming BTreeTupGetNAtts/BTreeTupSetNAtts - several lines above we use > full Tuple word in dowlink macroses, here we use just Tup. Seems, better to > have Tuple in both cases. Or Tup, but still in both cases. +1 > 4) BTreeTupSetNAtts - seems, it's better to add check of nattrs to fits to > BT_N_KEYS_OFFSET_MASK mask, and it should not touch BT_RESERVED_OFFSET_MASK > bits, now it will overwrite that bits. An assertion sounds like it would be an improvement, though I don't see that in the patch you posted. > Attached patch is rebased to current head and contains some comment > improvement in index_truncate_tuple() - you save some amount of memory with > TupleDescCopy() call but didn't explain why pfree is enough to free all > allocated memory. Makes sense. -- Peter Geoghegan
> I don't understand. We do check the number of attributes on rightmost > pages, but we do so separately, in the main loop. For every item that > isn't the high key. Comment added, pls, verify. And refactored _bt_check_natts(), I hope, now it's a bit more readable. >> 4) BTreeTupSetNAtts - seems, it's better to add check of nattrs to fits to >> BT_N_KEYS_OFFSET_MASK mask, and it should not touch BT_RESERVED_OFFSET_MASK >> bits, now it will overwrite that bits. > > An assertion sounds like it would be an improvement, though I don't > see that in the patch you posted. I didn't do that in v1, sorry, I was unclear. Attached patch contains all changes suggested in my previous email. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Attachment
On Wed, Apr 18, 2018 at 1:32 PM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> I don't understand. We do check the number of attributes on rightmost >> pages, but we do so separately, in the main loop. For every item that >> isn't the high key. > > Comment added, pls, verify. And refactored _bt_check_natts(), I hope, now > it's a bit more readable. The new comment looks good. Now I understand what you meant about _bt_check_natts(). And, I agree that this is an improvement -- the extra verbosity is worth it. > I didn't do that in v1, sorry, I was unclear. Attached patch contains all > changes suggested in my previous email. Looks new BTreeTupSetNAtts () assertion good to me. I suggest committing this patch as-is. Thank you -- Peter Geoghegan
On Wed, Apr 18, 2018 at 1:45 PM, Peter Geoghegan <pg@bowt.ie> wrote: > I suggest committing this patch as-is. Actually, I see one tiny issue with extra '*' characters here: > + * The number of attributes won't be explicitly represented if the > + * negative infinity tuple was generated during a page split that > + * occurred with a version of Postgres before v11. There must be a > + * problem when there is an explicit representation that is > + * non-zero, * or when there is no explicit representation and the > + * tuple is * evidently not a pre-pg_upgrade tuple. I also suggest fixing this indentation before commit: > + /* > + *Cannot leak memory here, TupleDescCopy() doesn't allocate any > + * inner structure, so, plain pfree() should clean all allocated memory > + */ -- Peter Geoghegan
Thank you, pushed. > Actually, I see one tiny issue with extra '*' characters here: > >> + * The number of attributes won't be explicitly represented if the >> + * negative infinity tuple was generated during a page split that >> + * occurred with a version of Postgres before v11. There must be a >> + * problem when there is an explicit representation that is >> + * non-zero, * or when there is no explicit representation and the >> + * tuple is * evidently not a pre-pg_upgrade tuple. > > I also suggest fixing this indentation before commit: > >> + /* >> + *Cannot leak memory here, TupleDescCopy() doesn't allocate any >> + * inner structure, so, plain pfree() should clean all allocated memory >> + */ fixed -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaev <teodor@sigaev.ru> wrote: > Thank you, pushed. Thanks. I saw another preexisting issue, this time one that has been around since 2007. Commit bc292937 forgot to remove a comment above _bt_insertonpg() (the 'afteritem' stuff ended up being moved to the bottom of _bt_findinsertloc(), where it remains today). The attached patch fixes this, and in passing mentions the fact that _bt_insertonpg() only performs retail insertions, and specifically never inserts high key items. I don't think it's necessary to add something about negative infinity items to the same comment block. While it's true that _bt_insertonpg() cannot truncate downlinks to make new minus infinity items, I see that as a narrower issue. -- Peter Geoghegan
Attachment
Thank you, pushed Peter Geoghegan wrote: > On Wed, Apr 18, 2018 at 10:47 PM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> Thank you, pushed. > > Thanks. > > I saw another preexisting issue, this time one that has been around > since 2007. Commit bc292937 forgot to remove a comment above > _bt_insertonpg() (the 'afteritem' stuff ended up being moved to the > bottom of _bt_findinsertloc(), where it remains today). The attached > patch fixes this, and in passing mentions the fact that > _bt_insertonpg() only performs retail insertions, and specifically > never inserts high key items. > > I don't think it's necessary to add something about negative infinity > items to the same comment block. While it's true that _bt_insertonpg() > cannot truncate downlinks to make new minus infinity items, I see that > as a narrower issue. > -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
I'm wondering what's the genesis of this coninclude column actually. As far as I can tell, the only reason this column is there, is to be able to print the INCLUDE clause in a UNIQUE/PK constraint in ruleutils ... but surely the same list can be obtained from the pg_index.indkey instead? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services