Thread: pgsql: Make heap TID a tiebreaker nbtree index column.

pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
Make heap TID a tiebreaker nbtree index column.

Make nbtree treat all index tuples as having a heap TID attribute.
Index searches can distinguish duplicates by heap TID, since heap TID is
always guaranteed to be unique.  This general approach has numerous
benefits for performance, and is prerequisite to teaching VACUUM to
perform "retail index tuple deletion".

Naively adding a new attribute to every pivot tuple has unacceptable
overhead (it bloats internal pages), so suffix truncation of pivot
tuples is added.  This will usually truncate away the "extra" heap TID
attribute from pivot tuples during a leaf page split, and may also
truncate away additional user attributes.  This can increase fan-out,
especially in a multi-column index.  Truncation can only occur at the
attribute granularity, which isn't particularly effective, but works
well enough for now.  A future patch may add support for truncating
"within" text attributes by generating truncated key values using new
opclass infrastructure.

Only new indexes (BTREE_VERSION 4 indexes) will have insertions that
treat heap TID as a tiebreaker attribute, or will have pivot tuples
undergo suffix truncation during a leaf page split (on-disk
compatibility with versions 2 and 3 is preserved).  Upgrades to version
4 cannot be performed on-the-fly, unlike upgrades from version 2 to
version 3.  contrib/amcheck continues to work with version 2 and 3
indexes, while also enforcing stricter invariants when verifying version
4 indexes.  These stricter invariants are the same invariants described
by "3.1.12 Sequencing" from the Lehman and Yao paper.

A later patch will enhance the logic used by nbtree to pick a split
point.  This patch is likely to negatively impact performance without
smarter choices around the precise point to split leaf pages at.  Making
these two mostly-distinct sets of enhancements into distinct commits
seems like it might clarify their design, even though neither commit is
particularly useful on its own.

The maximum allowed size of new tuples is reduced by an amount equal to
the space required to store an extra MAXALIGN()'d TID in a new high key
during leaf page splits.  The user-facing definition of the "1/3 of a
page" restriction is already imprecise, and so does not need to be
revised.  However, there should be a compatibility note in the v12
release notes.

Author: Peter Geoghegan
Reviewed-By: Heikki Linnakangas, Alexander Korotkov
Discussion: https://postgr.es/m/CAH2-WzkVb0Kom=R+88fDFb=JSxZMFvbHVC6Mn9LJ2n=X=kS-Uw@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/dd299df8189bd00fbe54b72c64f43b6af2ffeccd

Modified Files
--------------
contrib/amcheck/expected/check_btree.out     |   5 +-
contrib/amcheck/sql/check_btree.sql          |   5 +-
contrib/amcheck/verify_nbtree.c              | 341 +++++++++++++++++++---
contrib/pageinspect/btreefuncs.c             |   2 +-
contrib/pageinspect/expected/btree.out       |   2 +-
contrib/pgstattuple/expected/pgstattuple.out |  10 +-
doc/src/sgml/indices.sgml                    |  24 +-
src/backend/access/common/indextuple.c       |   6 +-
src/backend/access/nbtree/README             | 124 ++++----
src/backend/access/nbtree/nbtinsert.c        | 407 ++++++++++++++++----------
src/backend/access/nbtree/nbtpage.c          | 206 +++++++++-----
src/backend/access/nbtree/nbtree.c           |   2 +-
src/backend/access/nbtree/nbtsearch.c        | 106 ++++++-
src/backend/access/nbtree/nbtsort.c          |  91 +++---
src/backend/access/nbtree/nbtutils.c         | 410 ++++++++++++++++++++++++---
src/backend/access/nbtree/nbtxlog.c          |  47 +--
src/backend/access/rmgrdesc/nbtdesc.c        |   8 -
src/backend/utils/sort/tuplesort.c           |  13 +-
src/include/access/nbtree.h                  | 213 +++++++++++---
src/include/access/nbtxlog.h                 |  35 +--
src/test/regress/expected/btree_index.out    |  34 +--
src/test/regress/expected/create_index.out   |  13 +-
src/test/regress/expected/dependency.out     |   4 +-
src/test/regress/expected/event_trigger.out  |   4 +-
src/test/regress/expected/foreign_data.out   |   9 +-
src/test/regress/expected/rowsecurity.out    |   4 +-
src/test/regress/sql/btree_index.sql         |  37 +--
src/test/regress/sql/create_index.sql        |  14 +-
src/test/regress/sql/foreign_data.sql        |   2 +-
29 files changed, 1619 insertions(+), 559 deletions(-)


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 10:05 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Make heap TID a tiebreaker nbtree index column.

I see that this has caused SELinux test failures on rhinoceros (the
ddl test fails). It looks like the output order is affected by the
implementation details of nbtree, likely for some system catalog
index. This is something that I've had to deal with in other places
(with a lot of help from Tom).

It's not easy for me to set up SELinux to fix the issue. Is there
somebody comfortable with that area that could confirm I have it
right, and provide me with the actual test output? Or push a fix
themselves?

Thanks
-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Mar 20, 2019 at 10:05 AM Peter Geoghegan <pg@bowt.ie> wrote:
>> Make heap TID a tiebreaker nbtree index column.

> I see that this has caused SELinux test failures on rhinoceros (the
> ddl test fails). It looks like the output order is affected by the
> implementation details of nbtree, likely for some system catalog
> index. This is something that I've had to deal with in other places
> (with a lot of help from Tom).

The diffs all are related to the order of operations in a DROP OWNED BY
command, so I think blaming SELinux is just blaming the messenger.
This comes down to the point that we didn't do anything to ensure
drop order stability in shdepend-driven drops.  Maybe we need to be
honest about that.  Or do you have reason to think that your changes
will result in stability in that drop order anyway?  If so, why?

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Andres Freund
Date:
Hi,

On 2019-03-20 11:00:06 -0700, Peter Geoghegan wrote:
> On Wed, Mar 20, 2019 at 10:05 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > Make heap TID a tiebreaker nbtree index column.
> 
> I see that this has caused SELinux test failures on rhinoceros (the
> ddl test fails). It looks like the output order is affected by the
> implementation details of nbtree, likely for some system catalog
> index. This is something that I've had to deal with in other places
> (with a lot of help from Tom).
> 
> It's not easy for me to set up SELinux to fix the issue. Is there
> somebody comfortable with that area that could confirm I have it
> right, and provide me with the actual test output? Or push a fix
> themselves?

FWIW, I just fix up the tests using the regression output from
rhinoceros when that happens. Sometimes takes more than a single round,
but it builds frequently enough...

Greetings,

Andres Freund


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 11:08 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The diffs all are related to the order of operations in a DROP OWNED BY
> command, so I think blaming SELinux is just blaming the messenger.
> This comes down to the point that we didn't do anything to ensure
> drop order stability in shdepend-driven drops.  Maybe we need to be
> honest about that.

You can certainly make that argument. However, we still wouldn't make
broad guarantees about test stability that ensure that this general
category of flappiness goes away forever (IIRC the role stuff still
has the odd problem). I suppose that you could scope the guarantees as
being about pg_depend and pg_shdepend, but that seems arbitrary to me.

Your work on test stability probably eliminated 98% of the problems.
It's still early, but the buildfarm is mostly fine.

> Or do you have reason to think that your changes
> will result in stability in that drop order anyway?  If so, why?

I strongly suspect that my changes will leave the output about as
stable as before, or will slightly improve matters. Duplicates are now
in ascending order, as opposed to very close to descending order. I
think that it would be fine to rely on the new output ordering, even
though it theoretically isn't any more stable.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 11:09 AM Andres Freund <andres@anarazel.de> wrote:
> FWIW, I just fix up the tests using the regression output from
> rhinoceros when that happens. Sometimes takes more than a single round,
> but it builds frequently enough...

I'll give that a go, provided Tom is okay with it.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 11:30 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Your work on test stability probably eliminated 98% of the problems.
> It's still early, but the buildfarm is mostly fine.

batfish just had a similar failure, this time in foreign_data -- two
lines of "DETAIL" output appear in opposite-of-expected order.
Obviously that's unstable in a way that it wasn't before now, since
every other animal doesn't have that problem.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Mar 20, 2019 at 11:30 AM Peter Geoghegan <pg@bowt.ie> wrote:
>> Your work on test stability probably eliminated 98% of the problems.
>> It's still early, but the buildfarm is mostly fine.

> batfish just had a similar failure, this time in foreign_data -- two
> lines of "DETAIL" output appear in opposite-of-expected order.
> Obviously that's unstable in a way that it wasn't before now, since
> every other animal doesn't have that problem.

Yeah.  My opinion is that we should just qsort the list of targets
during DROP OWNED and be done with this.  I'll post a patch shortly.

In the meantime, would you mind cleaning this up:

nbtxlog.c: In function 'btree_xlog_split':
nbtxlog.c:269: warning: 'newitem' may be used uninitialized in this function
nbtxlog.c:269: note: 'newitem' was declared here
nbtxlog.c:271: warning: 'newitemsz' may be used uninitialized in this function
nbtxlog.c:271: note: 'newitemsz' was declared here


            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 1:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In the meantime, would you mind cleaning this up:
>
> nbtxlog.c: In function 'btree_xlog_split':
> nbtxlog.c:269: warning: 'newitem' may be used uninitialized in this function
> nbtxlog.c:269: note: 'newitem' was declared here
> nbtxlog.c:271: warning: 'newitemsz' may be used uninitialized in this function
> nbtxlog.c:271: note: 'newitemsz' was declared here

Sure. Will do that one next.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 1:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah.  My opinion is that we should just qsort the list of targets
> during DROP OWNED and be done with this.  I'll post a patch shortly.

Sounds good.

Barring any objections, I will get the buildfarm completely green
again shortly by adjusting the output, rather than waiting for your
patch. Whether or not the verbosity hack can be ripped out can be
considered later, in a separate pass.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
I wrote:
> Yeah.  My opinion is that we should just qsort the list of targets
> during DROP OWNED and be done with this.  I'll post a patch shortly.

Here's one way we could do this: add a flag to performMultipleDeletions
to invoke the sort.  A small problem with it is that if we sort the
ObjectAddresses in-place as this does, we really oughta remove the
"const" from that argument.  (I believe compilers would probably not
realize we were cheating if we didn't, but it'd still be cheating.)
That doesn't affect any existing callers, but in future it might
prevent const-ifying something or other.  Another way we could handle
this is to make dependency.c expose a separate function
"void sortObjectAddresses(ObjectAddresses *objects)", and then have
callers call that as a separate action.  Not sure which I like better
... any opinions?

A preliminary check-world run finds no places where the output changes,
but I've not tried this with SELinux yet.

            regards, tom lane

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 2048d71..7a969d1 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -365,9 +365,17 @@ performDeletion(const ObjectAddress *object,
  * list of objects that would be implicitly dropped, for each object to be
  * dropped, is the union of the implicit-object list for all objects.  This
  * makes each check be more relaxed.
+ *
+ * Also, this honors one additional flags bit:
+ *
+ * PERFORM_DELETION_SORTEDLY: sort the objects before beginning deletion.
+ * The major sort key is OID-descending, so that newer objects will be deleted
+ * first in most cases.  This is primarily useful for ensuring stable outputs
+ * from regression tests; it's not recommended if the order of the objects is
+ * determined by user input, such as the order of targets in a DROP command.
  */
 void
-performMultipleDeletions(const ObjectAddresses *objects,
+performMultipleDeletions(ObjectAddresses *objects,
                          DropBehavior behavior, int flags)
 {
     Relation    depRel;
@@ -378,6 +386,12 @@ performMultipleDeletions(const ObjectAddresses *objects,
     if (objects->numrefs <= 0)
         return;

+    /* Presort objects if requested */
+    if ((flags & PERFORM_DELETION_SORTEDLY) && objects->numrefs > 1)
+        qsort((void *) objects->refs, objects->numrefs,
+              sizeof(ObjectAddress),
+              object_address_comparator);
+
     /*
      * We save some cycles by opening pg_depend just once and passing the
      * Relation pointer down to all the recursive deletion steps.
diff --git a/src/backend/catalog/pg_shdepend.c b/src/backend/catalog/pg_shdepend.c
index 1619c1c..dbd1d4b 100644
--- a/src/backend/catalog/pg_shdepend.c
+++ b/src/backend/catalog/pg_shdepend.c
@@ -1266,8 +1266,12 @@ shdepDropOwned(List *roleids, DropBehavior behavior)
         systable_endscan(scan);
     }

-    /* the dependency mechanism does the actual work */
-    performMultipleDeletions(deleteobjs, behavior, 0);
+    /*
+     * The dependency mechanism does the actual work.  For stability of
+     * output, sort the objects into approximate reverse creation order
+     * beforehand.
+     */
+    performMultipleDeletions(deleteobjs, behavior, PERFORM_DELETION_SORTEDLY);

     table_close(sdepRel, RowExclusiveLock);

diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index b235a23..47a4f73 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -136,6 +136,7 @@ typedef enum ObjectClass
 #define PERFORM_DELETION_QUIETLY            0x0004    /* suppress notices */
 #define PERFORM_DELETION_SKIP_ORIGINAL        0x0008    /* keep original obj */
 #define PERFORM_DELETION_SKIP_EXTENSIONS    0x0010    /* keep extensions */
+#define PERFORM_DELETION_SORTEDLY            0x0020    /* presort objects */


 /* in dependency.c */
@@ -143,7 +144,7 @@ typedef enum ObjectClass
 extern void performDeletion(const ObjectAddress *object,
                 DropBehavior behavior, int flags);

-extern void performMultipleDeletions(const ObjectAddresses *objects,
+extern void performMultipleDeletions(ObjectAddresses *objects,
                          DropBehavior behavior, int flags);

 extern void recordDependencyOnExpr(const ObjectAddress *depender,

Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
I wrote:
> A preliminary check-world run finds no places where the output changes,
> but I've not tried this with SELinux yet.

After trying it (against yesterday's sources) on my SELinux-capable
machine, I see no evidence that we need any output ordering changes
at all if we go this route.  This is probably unsurprising considering
that the old btree code used to provide mostly-reverse-insertion-order
scan order.

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 2:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After trying it (against yesterday's sources) on my SELinux-capable
> machine, I see no evidence that we need any output ordering changes
> at all if we go this route.  This is probably unsurprising considering
> that the old btree code used to provide mostly-reverse-insertion-order
> scan order.

That's good. I'm trying to fix it by hand right now, in the way that
Andres suggested. It is both tedious and error-prone.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Mar 20, 2019 at 2:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> After trying it (against yesterday's sources) on my SELinux-capable
>> machine, I see no evidence that we need any output ordering changes
>> at all if we go this route.  This is probably unsurprising considering
>> that the old btree code used to provide mostly-reverse-insertion-order
>> scan order.

> That's good. I'm trying to fix it by hand right now, in the way that
> Andres suggested. It is both tedious and error-prone.

Yeah.  Don't do that.

After further thought I think I'll go with the alternate solution
(separate sortObjectAddresses function) as that could possibly have
other uses, and removing the "const" from performMultipleDeletions
seems a bit bletcherous.  Will push a fix in a few minutes.

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
I wrote:
> Will push a fix in a few minutes.

And done.  Should be possible to revert 7d3bf73ac, if you wish.

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> And done.  Should be possible to revert 7d3bf73ac, if you wish.

Will do.

Thanks!
-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Mar 20, 2019 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > And done.  Should be possible to revert 7d3bf73ac, if you wish.
>
> Will do.

Actually, I'm not sure why it would be fine to revert 7d3bf73ac now.
Might the problem actually be the order in which OIDs are originally
assigned, or something like that?

I neglected to test SEPostgres myself, but if I did I think that I
would have included the changes, avoiding the problem on rhinoceros.
IOW, I believe that it wasn't a test flappiness issue, whereas the
problem on batfish clearly is.


--
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Mar 20, 2019 at 3:11 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> On Wed, Mar 20, 2019 at 3:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> And done.  Should be possible to revert 7d3bf73ac, if you wish.

> Actually, I'm not sure why it would be fine to revert 7d3bf73ac now.
> Might the problem actually be the order in which OIDs are originally
> assigned, or something like that?

No, because then things would have been unstable before, no?

I actually think that we should remove most or all of the
cascade-drop-hiding hacks that are in the regression tests now,
not only that one.  They should not be necessary any more, and
they might be hiding things we need to know about, now or in
the future.  But I haven't got round to it.

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Wed, Mar 20, 2019 at 8:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Actually, I'm not sure why it would be fine to revert 7d3bf73ac now.
> > Might the problem actually be the order in which OIDs are originally
> > assigned, or something like that?
>
> No, because then things would have been unstable before, no?

Perhaps. The cost of being wrong here is trivial anyway. I have
reverted 7d3bf73ac based on the assumption that it's now unnecessary.

> I actually think that we should remove most or all of the
> cascade-drop-hiding hacks that are in the regression tests now,
> not only that one.  They should not be necessary any more, and
> they might be hiding things we need to know about, now or in
> the future.  But I haven't got round to it.

Seems like a useful goal.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Mar 20, 2019 at 8:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Actually, I'm not sure why it would be fine to revert 7d3bf73ac now.
>>> Might the problem actually be the order in which OIDs are originally
>>> assigned, or something like that?

>> No, because then things would have been unstable before, no?

> Perhaps. The cost of being wrong here is trivial anyway. I have
> reverted 7d3bf73ac based on the assumption that it's now unnecessary.

Apparently, that case is indeed unstable, cf

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fulmar&dt=2019-03-22%2016%3A15%3A14

I picked up on that because I've also seen it happen on my own
devel machine today, but just once --- I then tried to reproduce it,
but couldn't in several dozen tries.

I'm fairly baffled as to why the output order would be unstable
given the sort, and even more as to why the instability didn't
emerge before.  Any thoughts?

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
I wrote:
> I'm fairly baffled as to why the output order would be unstable
> given the sort, and even more as to why the instability didn't
> emerge before.  Any thoughts?

Well, after actually looking at the code, the answer to the first
part of that is obvious: the printed dependency list comes from
checkSharedDependencies(), which is not where I put the sort.
(That is, I fixed DROP OWNED BY, but this error is coming from
DROP ROLE which is not the same code path.)

One might reasonably ask why it's not the same code path, perhaps,
but I'm not really excited about trying to change that right now.

Anyway, looking through our regression tests, there are a total
of four cases where a DROP ROLE command fails with a DETAIL that
mentions more than one object and so might be vulnerable to this
type of problem.  (There are several more where we already used
verbosity-reduction to suppress the DETAIL.)

Options:

1. Adjust checkSharedDependencies to sort before constructing the
message.  This might have issues if there are lots and lots of
dependencies, but then again we already would have issues due to
the size of the error message.

1a. As above, but add code to limit to limit the number of
dependencies that are stored/sorted/described, even in the
verbose string sent to the server log.

2. De-revert 7d3bf73ac and hope the other cases aren't problems.

3. Change all of those cases to suppress the DETAIL.

1a seems a bit ambitious but maybe it's worth doing, considering
that right now there's a non-negligible chance of OOM if you
try to drop a role that owns a huge number of objects, or just
plain failure due to the stringinfo buffer for the message getting
past MaxAllocSize.  Sending a gigabyte-sized message to the server
log could be pretty unfriendly in some contexts, too.

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Fri, Mar 22, 2019 at 9:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Apparently, that case is indeed unstable, cf
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fulmar&dt=2019-03-22%2016%3A15%3A14

I had a feeling that that would happen.

> I'm fairly baffled as to why the output order would be unstable
> given the sort, and even more as to why the instability didn't
> emerge before.  Any thoughts?

 DROP ROLE seems to be where all the remaining problems were, and the
order in the regression tests didn't change back to the old order when
you pushed the change to DROP OWNED BY. Something seemed amiss to me
for these reasons.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Fri, Mar 22, 2019 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Options:

Obviously if we were going to de-revert 7d3bf73ac then it wouldn't
quite be a straight de-revert. There would also be a few new comments
about the issue in relevant test files.

> 1a seems a bit ambitious but maybe it's worth doing, considering
> that right now there's a non-negligible chance of OOM if you
> try to drop a role that owns a huge number of objects, or just
> plain failure due to the stringinfo buffer for the message getting
> past MaxAllocSize.  Sending a gigabyte-sized message to the server
> log could be pretty unfriendly in some contexts, too.

All of these options seem acceptable. However, the problem is unlikely
to get any worse, so going to the trouble of option 1 or 1a might not
be the best use of time.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Mar 22, 2019 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 1a seems a bit ambitious but maybe it's worth doing, considering
>> that right now there's a non-negligible chance of OOM if you
>> try to drop a role that owns a huge number of objects, or just
>> plain failure due to the stringinfo buffer for the message getting
>> past MaxAllocSize.  Sending a gigabyte-sized message to the server
>> log could be pretty unfriendly in some contexts, too.

> All of these options seem acceptable. However, the problem is unlikely
> to get any worse, so going to the trouble of option 1 or 1a might not
> be the best use of time.

Yeah.  After thinking a bit more, the OOM hazard is probably pretty
far-fetched --- even a role owning hundreds of thousands of objects
wouldn't accumulate more than a small number of megabytes of DETAIL.
You could still argue that sending such a message to the server log
is a bad idea, perhaps, but I'm willing to let it go until we see
actual field complaints about it.

So that means that the de-revert is probably the best option for
now.  Will you do the honors?

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Fri, Mar 22, 2019 at 10:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So that means that the de-revert is probably the best option for
> now.  Will you do the honors?

I'll take care of that shortly.

Thanks
-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
So here's another failure of the same ilk:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urocryon&dt=2019-03-24%2006%3A12%3A35

How come these results seem to be less stable than they used to be?

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Sun, Mar 24, 2019 at 8:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So here's another failure of the same ilk:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urocryon&dt=2019-03-24%2006%3A12%3A35

I can fix this one the same way as I fixed the first two. That will
mean that three out of the four tests with DROP ROLE statements whose
output needed to be changed as part of commit dd299df81 will have had
their DETAIL output suppressed. It's still possible that the last
instance of such a change (rowsecurity.out) will have a failure like
this too. That will be the end of it, barring some new manifestation
of the problem that we haven't seen already.

Clearly I underestimated the likelihood of problems like this, but at
least it's limited to DROP ROLE's DETAIL output. There just isn't that
many tests that could be destabilized.

> How come these results seem to be less stable than they used to be?

Now that heap TID is just another column internally, the stability of
results like this is dependent on whatever the process is through
which TIDs are generated. They're not going to be placed at the first
offset >= insertion scankey, which was what almost always happened
before dd299df81. I think that pruning could affect the results, for
example. Previously, it was mostly a matter of index insertion order,
which looked like it made heap TIDs among duplicates be in descending
order, though that certainly didn't happen reliably.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Sun, Mar 24, 2019 at 9:47 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Mar 24, 2019 at 8:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > So here's another failure of the same ilk:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urocryon&dt=2019-03-24%2006%3A12%3A35
>
> I can fix this one the same way as I fixed the first two.

I see that locust just had precisely the same dependency.out failure
as urocryon just now:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2019-03-24%2015%3A28%3A48

Looks like the actual output is exactly the same on each animal.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Sun, Mar 24, 2019 at 8:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So here's another failure of the same ilk:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urocryon&dt=2019-03-24%2006%3A12%3A35

> I can fix this one the same way as I fixed the first two. That will
> mean that three out of the four tests with DROP ROLE statements whose
> output needed to be changed as part of commit dd299df81 will have had
> their DETAIL output suppressed. It's still possible that the last
> instance of such a change (rowsecurity.out) will have a failure like
> this too.

At this point I no longer have any faith in the approach of "suppress
DETAIL only where we've actually been burnt".  I think we should
either go ahead and suppress DETAIL in all four places, or bite the
bullet and change the DROP ROLE code as I sketched upthread.

I'm not sure how much test coverage we're really losing if we
suppress DETAIL in all these places.  We would still have test output
from assorted places where DROP ROLE cascades to just one object,
so from a pure code-coverage standpoint doing that probably isn't
going to be too awful.

However, I don't really like the fact that we're setting up a booby
trap for all future tests involving DROP ROLE.  I think if we leave
it like this, people are going to add new test cases that have
slightly unstable output and are going to learn about it the hard
way, just as we're doing now.  When you take the long view of it,
there's definitely something to be said for expending the effort
to make the DROP ROLE results stable.

When I was looking at this on Friday, I thought it wouldn't be that
hard to make the results stable, at least up to whatever cutoff we
want to set on how many objects to sort.  But per previous discussion,
maybe we don't need an explicit limit?  The stringinfo describing the
objects is going to consume a good bit more memory than an ObjectAddress
array in any case, so if we're not going to sweat about OOM for the
message then I'm not sure we need to be paranoid about the sort memory.

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Sun, Mar 24, 2019 at 10:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I can fix this one the same way as I fixed the first two. That will
> > mean that three out of the four tests with DROP ROLE statements whose
> > output needed to be changed as part of commit dd299df81 will have had
> > their DETAIL output suppressed. It's still possible that the last
> > instance of such a change (rowsecurity.out) will have a failure like
> > this too.
>
> At this point I no longer have any faith in the approach of "suppress
> DETAIL only where we've actually been burnt".  I think we should
> either go ahead and suppress DETAIL in all four places, or bite the
> bullet and change the DROP ROLE code as I sketched upthread.

All that it will take to fix dependency.sql is to move an existing
\set VERBOSITY terse a few lines earlier.

I would like to suppress the dependency.sql stability issue right
away. I can also suppress the rowsecurity.out output in the same
commit. I want to fix the problem on the buildfarm first, unless your
well-principled approach won't take very long. Which seems unlikely.
You can revert these commits if and when they become unnecessary.

> However, I don't really like the fact that we're setting up a booby
> trap for all future tests involving DROP ROLE.  I think if we leave
> it like this, people are going to add new test cases that have
> slightly unstable output and are going to learn about it the hard
> way, just as we're doing now.  When you take the long view of it,
> there's definitely something to be said for expending the effort
> to make the DROP ROLE results stable.

That seems like a good argument to me.

> When I was looking at this on Friday, I thought it wouldn't be that
> hard to make the results stable, at least up to whatever cutoff we
> want to set on how many objects to sort.  But per previous discussion,
> maybe we don't need an explicit limit?  The stringinfo describing the
> objects is going to consume a good bit more memory than an ObjectAddress
> array in any case, so if we're not going to sweat about OOM for the
> message then I'm not sure we need to be paranoid about the sort memory.

I agree that it isn't worth worrying about an OOM for the sort here.

-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Sun, Mar 24, 2019 at 10:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> At this point I no longer have any faith in the approach of "suppress
>> DETAIL only where we've actually been burnt".  I think we should
>> either go ahead and suppress DETAIL in all four places, or bite the
>> bullet and change the DROP ROLE code as I sketched upthread.

> I would like to suppress the dependency.sql stability issue right
> away. I can also suppress the rowsecurity.out output in the same
> commit. I want to fix the problem on the buildfarm first, unless your
> well-principled approach won't take very long. Which seems unlikely.

I think I can probably get that done today, but if you don't want to
wait, feel free to put in the detail-suppression for now.

>> When I was looking at this on Friday, I thought it wouldn't be that
>> hard to make the results stable, at least up to whatever cutoff we
>> want to set on how many objects to sort.  But per previous discussion,
>> maybe we don't need an explicit limit?  The stringinfo describing the
>> objects is going to consume a good bit more memory than an ObjectAddress
>> array in any case, so if we're not going to sweat about OOM for the
>> message then I'm not sure we need to be paranoid about the sort memory.

> I agree that it isn't worth worrying about an OOM for the sort here.

If we aren't going to worry about that then I think it probably becomes
a pretty simple change.  Will look in a little bit.

            regards, tom lane


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Sun, Mar 24, 2019 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I would like to suppress the dependency.sql stability issue right
> > away. I can also suppress the rowsecurity.out output in the same
> > commit. I want to fix the problem on the buildfarm first, unless your
> > well-principled approach won't take very long. Which seems unlikely.
>
> I think I can probably get that done today, but if you don't want to
> wait, feel free to put in the detail-suppression for now.

I'll monitor the situation, and proceed with a stopgap
detail-suppression fix this evening if, for whatever reason, it seems
necessary.

Thanks
-- 
Peter Geoghegan


Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Sun, Mar 24, 2019 at 11:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think I can probably get that done today, but if you don't want to
>> wait, feel free to put in the detail-suppression for now.

> I'll monitor the situation, and proceed with a stopgap
> detail-suppression fix this evening if, for whatever reason, it seems
> necessary.

It turns out we can't use the ObjectAddresses infrastructure because
it doesn't store enough information.  So that means an extra qsort
comparator function, which is slightly annoying, but it's still not
a huge amount of code.

This flips one expected result to another order from what it was.
(I experimented with OID-descending sort, but that flips two
other expected results.)

If no objections, I'll push shortly.

            regards, tom lane

diff --git a/src/backend/catalog/pg_shdepend.c b/src/backend/catalog/pg_shdepend.c
index 63a7f48..5b74040 100644
*** a/src/backend/catalog/pg_shdepend.c
--- b/src/backend/catalog/pg_shdepend.c
*************** typedef enum
*** 74,79 ****
--- 74,86 ----
      REMOTE_OBJECT
  } SharedDependencyObjectType;

+ typedef struct
+ {
+     ObjectAddress object;
+     char        deptype;
+     SharedDependencyObjectType objtype;
+ } ShDepObjectInfo;
+
  static void getOidListDiff(Oid *list1, int *nlist1, Oid *list2, int *nlist2);
  static Oid    classIdGetDbId(Oid classId);
  static void shdepChangeDep(Relation sdepRel,
*************** typedef struct
*** 497,502 ****
--- 504,548 ----
  } remoteDep;

  /*
+  * qsort comparator for ShDepObjectInfo items
+  */
+ static int
+ shared_dependency_comparator(const void *a, const void *b)
+ {
+     const ShDepObjectInfo *obja = (const ShDepObjectInfo *) a;
+     const ShDepObjectInfo *objb = (const ShDepObjectInfo *) b;
+
+     /*
+      * Primary sort key is OID ascending.
+      */
+     if (obja->object.objectId < objb->object.objectId)
+         return -1;
+     if (obja->object.objectId > objb->object.objectId)
+         return 1;
+
+     /*
+      * Next sort on catalog ID, in case identical OIDs appear in different
+      * catalogs.  Sort direction is pretty arbitrary here.
+      */
+     if (obja->object.classId < objb->object.classId)
+         return -1;
+     if (obja->object.classId > objb->object.classId)
+         return 1;
+
+     /*
+      * Last, sort on object subId.
+      *
+      * We sort the subId as an unsigned int so that 0 (the whole object) will
+      * come first.
+      */
+     if ((unsigned int) obja->object.objectSubId < (unsigned int) objb->object.objectSubId)
+         return -1;
+     if ((unsigned int) obja->object.objectSubId > (unsigned int) objb->object.objectSubId)
+         return 1;
+     return 0;
+ }
+
+ /*
   * checkSharedDependencies
   *
   * Check whether there are shared dependency entries for a given shared
*************** checkSharedDependencies(Oid classId, Oid
*** 531,536 ****
--- 577,585 ----
      List       *remDeps = NIL;
      ListCell   *cell;
      ObjectAddress object;
+     ShDepObjectInfo *objects;
+     int            numobjects;
+     int            allocedobjects;
      StringInfoData descs;
      StringInfoData alldescs;

*************** checkSharedDependencies(Oid classId, Oid
*** 538,546 ****
--- 587,603 ----
       * We limit the number of dependencies reported to the client to
       * MAX_REPORTED_DEPS, since client software may not deal well with
       * enormous error strings.  The server log always gets a full report.
+      *
+      * For stability of regression test results, we sort local and shared
+      * objects by OID before reporting them.  We don't worry about the order
+      * in which other databases are reported, though.
       */
  #define MAX_REPORTED_DEPS 100

+     allocedobjects = 128;        /* arbitrary initial array size */
+     objects = (ShDepObjectInfo *)
+         palloc(allocedobjects * sizeof(ShDepObjectInfo));
+     numobjects = 0;
      initStringInfo(&descs);
      initStringInfo(&alldescs);

*************** checkSharedDependencies(Oid classId, Oid
*** 580,615 ****

          /*
           * If it's a dependency local to this database or it's a shared
!          * object, describe it.
           *
           * If it's a remote dependency, keep track of it so we can report the
           * number of them later.
           */
!         if (sdepForm->dbid == MyDatabaseId)
!         {
!             if (numReportedDeps < MAX_REPORTED_DEPS)
!             {
!                 numReportedDeps++;
!                 storeObjectDescription(&descs, LOCAL_OBJECT, &object,
!                                        sdepForm->deptype, 0);
!             }
!             else
!                 numNotReportedDeps++;
!             storeObjectDescription(&alldescs, LOCAL_OBJECT, &object,
!                                    sdepForm->deptype, 0);
!         }
!         else if (sdepForm->dbid == InvalidOid)
          {
!             if (numReportedDeps < MAX_REPORTED_DEPS)
              {
!                 numReportedDeps++;
!                 storeObjectDescription(&descs, SHARED_OBJECT, &object,
!                                        sdepForm->deptype, 0);
              }
!             else
!                 numNotReportedDeps++;
!             storeObjectDescription(&alldescs, SHARED_OBJECT, &object,
!                                    sdepForm->deptype, 0);
          }
          else
          {
--- 637,662 ----

          /*
           * If it's a dependency local to this database or it's a shared
!          * object, add it to the objects array.
           *
           * If it's a remote dependency, keep track of it so we can report the
           * number of them later.
           */
!         if (sdepForm->dbid == MyDatabaseId ||
!             sdepForm->dbid == InvalidOid)
          {
!             if (numobjects >= allocedobjects)
              {
!                 allocedobjects *= 2;
!                 objects = (ShDepObjectInfo *)
!                     repalloc(objects,
!                              allocedobjects * sizeof(ShDepObjectInfo));
              }
!             objects[numobjects].object = object;
!             objects[numobjects].deptype = sdepForm->deptype;
!             objects[numobjects].objtype = (sdepForm->dbid == MyDatabaseId) ?
!                 LOCAL_OBJECT : SHARED_OBJECT;
!             numobjects++;
          }
          else
          {
*************** checkSharedDependencies(Oid classId, Oid
*** 648,653 ****
--- 695,727 ----
      table_close(sdepRel, AccessShareLock);

      /*
+      * Sort and report local and shared objects.
+      */
+     if (numobjects > 1)
+         qsort((void *) objects, numobjects,
+               sizeof(ShDepObjectInfo), shared_dependency_comparator);
+
+     for (int i = 0; i < numobjects; i++)
+     {
+         if (numReportedDeps < MAX_REPORTED_DEPS)
+         {
+             numReportedDeps++;
+             storeObjectDescription(&descs,
+                                    objects[i].objtype,
+                                    &objects[i].object,
+                                    objects[i].deptype,
+                                    0);
+         }
+         else
+             numNotReportedDeps++;
+         storeObjectDescription(&alldescs,
+                                objects[i].objtype,
+                                &objects[i].object,
+                                objects[i].deptype,
+                                0);
+     }
+
+     /*
       * Summarize dependencies in remote databases.
       */
      foreach(cell, remDeps)
*************** checkSharedDependencies(Oid classId, Oid
*** 670,675 ****
--- 744,750 ----
                                 SHARED_DEPENDENCY_INVALID, dep->count);
      }

+     pfree(objects);
      list_free_deep(remDeps);

      if (descs.len == 0)
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8d31110..2f04b71 100644
*** a/src/test/regress/expected/dependency.out
--- b/src/test/regress/expected/dependency.out
*************** FROM pg_type JOIN pg_class c ON typrelid
*** 128,135 ****
  -- doesn't work: grant still exists
  DROP USER regress_dep_user1;
  ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
! DETAIL:  privileges for table deptest1
! privileges for database regression
  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
  DROP OWNED BY regress_dep_user1;
  DROP USER regress_dep_user1;
--- 128,135 ----
  -- doesn't work: grant still exists
  DROP USER regress_dep_user1;
  ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
! DETAIL:  privileges for database regression
! privileges for table deptest1
  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
  DROP OWNED BY regress_dep_user1;
  DROP USER regress_dep_user1;

Re: pgsql: Make heap TID a tiebreaker nbtree index column.

From
Peter Geoghegan
Date:
On Sun, Mar 24, 2019 at 1:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This flips one expected result to another order from what it was.
> (I experimented with OID-descending sort, but that flips two
> other expected results.)

I think that that's fine, provided that the order is consistent going forward.

> If no objections, I'll push shortly.

Sounds good.

-- 
Peter Geoghegan