Thread: ERROR: found unexpected null value in index

ERROR: found unexpected null value in index

From
Manuel Rigger
Date:
Hi everyone,

Consider the following statement sequence:

CREATE TABLE t0(c0 TEXT);
INSERT INTO t0(c0) VALUES('b'), ('a');
ANALYZE;
INSERT INTO t0(c0) VALUES (NULL);
UPDATE t0 SET c0 = 'a';
CREATE INDEX i0 ON t0(c0);
SELECT * FROM t0 WHERE 'baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
> t0.c0; -- unexpected: ERROR: found unexpected null value in index
"i0"

The SELECT can result in "ERROR: found unexpected null value in index
"i0"". I could reproduce this error only when some actions on other
databases are performed. The error is rather difficult to reproduce,
and small changes to the above statements cause it to no longer be
reproducible on my machine.

I've attached a Java program that runs the above statement sequence on
distinct databases using 32 threads, which, on my machine, reproduces
the error instantly. It is also possible to reproduce the error using
2 threads, which takes multiple minutes on my machine.

My Postgres version is Ubuntu 11.4-1.pgdg19.04+1.

Best,
Manuel

Attachment

Re: ERROR: found unexpected null value in index

From
Peter Geoghegan
Date:
On Tue, Jul 9, 2019 at 4:52 PM Manuel Rigger <rigger.manuel@gmail.com> wrote:
> CREATE TABLE t0(c0 TEXT);
> INSERT INTO t0(c0) VALUES('b'), ('a');
> ANALYZE;
> INSERT INTO t0(c0) VALUES (NULL);
> UPDATE t0 SET c0 = 'a';
> CREATE INDEX i0 ON t0(c0);
> SELECT * FROM t0 WHERE 'baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
> > t0.c0; -- unexpected: ERROR: found unexpected null value in index
> "i0"
>
> The SELECT can result in "ERROR: found unexpected null value in index
> "i0"". I could reproduce this error only when some actions on other
> databases are performed. The error is rather difficult to reproduce,
> and small changes to the above statements cause it to no longer be
> reproducible on my machine.

The error comes from a point at which the planner accesses the index
before execution proper, within get_actual_variable_range(). Perhaps
commit 3ca930fc39c is to blame.

-- 
Peter Geoghegan



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
Manuel Rigger <rigger.manuel@gmail.com> writes:
> CREATE TABLE t0(c0 TEXT);
> INSERT INTO t0(c0) VALUES('b'), ('a');
> ANALYZE t0;
> INSERT INTO t0(c0) VALUES (NULL);
> UPDATE t0 SET c0 = 'a';
> CREATE INDEX i0 ON t0(c0);
> SELECT * FROM t0 WHERE 'baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' > t0.c0;
> -- unexpected: ERROR: found unexpected null value in index "i0"

Nifty.  As before, the way to make this reproducible is to do it with
another open transaction holding a snapshot, so that the NULL entry
has to be reflected in the index.  (I'm not sure why the HOT-update
exception doesn't apply here, but apparently it doesn't.)  That is,
it's sufficient to set up one session with

begin transaction isolation level serializable;
select * from some_table;

and then run the above script in another session.

The error is coming from the planner's get_actual_variable_range:

                    /* Shouldn't have got a null, but be careful */
                    if (isnull[0])
                        elog(ERROR, "found unexpected null value in index \"%s\"",
                             RelationGetRelationName(indexRel));

and I think it's entirely within its rights to complain, because it
set up the scan key to reject nulls.  In short, somebody seems to
have broken btrees' processing of SK_ISNULL | SK_SEARCHNOTNULL scankeys,
and they broke it in v11, because prior versions don't show this failure.

            regards, tom lane



Re: ERROR: found unexpected null value in index

From
Peter Geoghegan
Date:
On Tue, Jul 9, 2019 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The error is coming from the planner's get_actual_variable_range:
>
>                     /* Shouldn't have got a null, but be careful */
>                     if (isnull[0])
>                         elog(ERROR, "found unexpected null value in index \"%s\"",
>                              RelationGetRelationName(indexRel));
>
> and I think it's entirely within its rights to complain, because it
> set up the scan key to reject nulls.  In short, somebody seems to
> have broken btrees' processing of SK_ISNULL | SK_SEARCHNOTNULL scankeys,
> and they broke it in v11, because prior versions don't show this failure.

It's not obvious to me why that might be. I'll run a "git bisect" to
track down the offending commit.

-- 
Peter Geoghegan



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Tue, Jul 9, 2019 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The error is coming from the planner's get_actual_variable_range:
>> and I think it's entirely within its rights to complain, because it
>> set up the scan key to reject nulls.  In short, somebody seems to
>> have broken btrees' processing of SK_ISNULL | SK_SEARCHNOTNULL scankeys,
>> and they broke it in v11, because prior versions don't show this failure.

> It's not obvious to me why that might be. I'll run a "git bisect" to
> track down the offending commit.

I just finished doing that, and indeed it fingers 3ca930fc3 as the
first bad commit.  Seems like it must have exposed a pre-existing
problem though?  Too tired to look further tonight.

            regards, tom lane



Re: ERROR: found unexpected null value in index

From
Peter Geoghegan
Date:
On Tue, Jul 9, 2019 at 6:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I just finished doing that, and indeed it fingers 3ca930fc3 as the
> first bad commit.  Seems like it must have exposed a pre-existing
> problem though?

I think that the issue is related to a broken HOT chain -- the index
doesn't even have any NULL key values, since the CREATE INDEX came
after the INSERT that added a NULL value. However, it does have a
tuple with the key value 'a' that points to the root of a HOT chain
whose first value for the indexed attribute is NULL. The successor
tuple's value for the indexed attribute is 'a', as expected (of
course, this is a normal state that
IndexBuildHeapScan()/heapam_index_build_range_scan() expect other code
to deal with).

Back when get_actual_variable_range() used a dirty snapshot, it would
have not seen any NULL value with this test case, because the root of
the HOT chain would be considered recently dead.

-- 
Peter Geoghegan



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> I think that the issue is related to a broken HOT chain -- the index
> doesn't even have any NULL key values, since the CREATE INDEX came
> after the INSERT that added a NULL value. However, it does have a
> tuple with the key value 'a' that points to the root of a HOT chain
> whose first value for the indexed attribute is NULL. The successor
> tuple's value for the indexed attribute is 'a', as expected (of
> course, this is a normal state that
> IndexBuildHeapScan()/heapam_index_build_range_scan() expect other code
> to deal with).

> Back when get_actual_variable_range() used a dirty snapshot, it would
> have not seen any NULL value with this test case, because the root of
> the HOT chain would be considered recently dead.

Hm.  So maybe we need to teach it to ignore tuples that are not the tips
of their HOT chains?

            regards, tom lane



Re: ERROR: found unexpected null value in index

From
Peter Geoghegan
Date:
On Tue, Jul 9, 2019 at 7:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Back when get_actual_variable_range() used a dirty snapshot, it would
> > have not seen any NULL value with this test case, because the root of
> > the HOT chain would be considered recently dead.
>
> Hm.  So maybe we need to teach it to ignore tuples that are not the tips
> of their HOT chains?

An approach like that was the first thing that I thought of, but I'll
have to study the problem some more before coming up with a firm
opinion.


-- 
Peter Geoghegan



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Tue, Jul 9, 2019 at 7:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hm.  So maybe we need to teach it to ignore tuples that are not the tips
>> of their HOT chains?

> An approach like that was the first thing that I thought of, but I'll
> have to study the problem some more before coming up with a firm
> opinion.

I experimented with the attached.  It solves the reported problem and
passes check-world.  I made it just ignore any tuple for which
HeapTupleHeaderIsHotUpdated is true.  It might seem like there's a risk
of ignoring valid data, if the end tuple of a HOT chain is dead due to
a transaction abort, but since HeapTupleHeaderIsHotUpdated checks for
xmax validity I judge that the risk of that is small enough to be
acceptable.

A bigger problem with this is that in the tableam world, this seems
like a complete disaster modularity-wise.  I think it won't actually
fail --- non-heap AMs are probably not returning
BufferHeapTupleTableSlots, and even if they are, they shouldn't be
marking tuples HOT_UPDATED unless that concept applies to them.
But it sure seems like this leaves get_actual_variable_range() knowing
way more than it ought to about storage-level concerns.

Should we try to transpose some of this logic to below the AM API,
and if so, what should that look like?

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d7e3f09..b51cbed 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5232,8 +5232,6 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
                                    InvalidOid,    /* no reg proc for this */
                                    (Datum) 0);    /* constant */

-            have_data = true;
-
             /* If min is requested ... */
             if (min)
             {
@@ -5262,15 +5260,33 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
                  * get_actual_variable_range() call will not have to visit
                  * that heap entry.  In this way we avoid repetitive work when
                  * this function is used a lot during planning.
+                 *
+                 * However, all is not perfect, because if the index was
+                 * recently created it may have entries pointing to broken HOT
+                 * chains that contain tuples that don't match the index entry
+                 * but aren't yet known-dead either.  We need to ignore such
+                 * tuples, since they might not be representative of the index
+                 * extremal values, indeed could even contain NULLs.  We
+                 * approximate this by ignoring any hot-updated tuples we see
+                 * and continuing to scan.
                  */
                 index_scan = index_beginscan(heapRel, indexRel,
                                              &SnapshotNonVacuumable,
                                              1, 0);
                 index_rescan(index_scan, scankeys, 1, NULL, 0);

-                /* Fetch first tuple in sortop's direction */
-                if (index_getnext_slot(index_scan, indexscandir, slot))
+                /* Fetch first/next tuple in sortop's direction */
+                while (index_getnext_slot(index_scan, indexscandir, slot))
                 {
+                    /* Ignore hot-updated tuples, per comment above */
+                    if (TTS_IS_BUFFERTUPLE(slot))
+                    {
+                        BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+
+                        if (HeapTupleHeaderIsHotUpdated(bslot->base.tupdata.t_data))
+                            continue;
+                    }
+
                     /* Extract the index column values from the slot */
                     FormIndexDatum(indexInfo, slot, estate,
                                    values, isnull);
@@ -5284,24 +5300,40 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
                     MemoryContextSwitchTo(oldcontext);
                     *min = datumCopy(values[0], typByVal, typLen);
                     MemoryContextSwitchTo(tmpcontext);
+                    have_data = true;
+                    break;
                 }
-                else
-                    have_data = false;

                 index_endscan(index_scan);
             }
+            else
+            {
+                /* If min not requested, assume index has data */
+                have_data = true;
+            }

             /* If max is requested, and we didn't find the index is empty */
             if (max && have_data)
             {
+                have_data = false;
+
                 index_scan = index_beginscan(heapRel, indexRel,
                                              &SnapshotNonVacuumable,
                                              1, 0);
                 index_rescan(index_scan, scankeys, 1, NULL, 0);

-                /* Fetch first tuple in reverse direction */
-                if (index_getnext_slot(index_scan, -indexscandir, slot))
+                /* Fetch first/next tuple in reverse direction */
+                while (index_getnext_slot(index_scan, -indexscandir, slot))
                 {
+                    /* Ignore hot-updated tuples, per comment above */
+                    if (TTS_IS_BUFFERTUPLE(slot))
+                    {
+                        BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot;
+
+                        if (HeapTupleHeaderIsHotUpdated(bslot->base.tupdata.t_data))
+                            continue;
+                    }
+
                     /* Extract the index column values from the slot */
                     FormIndexDatum(indexInfo, slot, estate,
                                    values, isnull);
@@ -5315,9 +5347,9 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
                     MemoryContextSwitchTo(oldcontext);
                     *max = datumCopy(values[0], typByVal, typLen);
                     MemoryContextSwitchTo(tmpcontext);
+                    have_data = true;
+                    break;
                 }
-                else
-                    have_data = false;

                 index_endscan(index_scan);
             }

Re: ERROR: found unexpected null value in index

From
Andres Freund
Date:
Hi,

On 2019-07-10 16:02:50 -0400, Tom Lane wrote:
> A bigger problem with this is that in the tableam world, this seems
> like a complete disaster modularity-wise.  I think it won't actually
> fail --- non-heap AMs are probably not returning
> BufferHeapTupleTableSlots, and even if they are, they shouldn't be
> marking tuples HOT_UPDATED unless that concept applies to them.
> But it sure seems like this leaves get_actual_variable_range() knowing
> way more than it ought to about storage-level concerns.

Yea, I think that's not pretty. I'm not concerned about wrong results
due to BufferHeapTupleTableSlots being returned, and misinterpreted, but
layering wise this seems not good. Even leaving tableam aside, it seems
not nice to have this concern leak into selfuncs.c


> Should we try to transpose some of this logic to below the AM API,
> and if so, what should that look like?

I wonder if we could push this down into a separate type of visibility
(or maybe redefine NonVacuumable)?  So that the scan wouldn't ever
return them in the first place? It's a bit annoying to have a visibility
type that's perhaps best called "what selfuncs.c needs", but still seams
cleaner than having a separate callback? And the explanation for why
that's possibly needed also seems to better belong inside the AMs.

- Andres



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2019-07-10 16:02:50 -0400, Tom Lane wrote:
>> Should we try to transpose some of this logic to below the AM API,
>> and if so, what should that look like?

> I wonder if we could push this down into a separate type of visibility
> (or maybe redefine NonVacuumable)?  So that the scan wouldn't ever
> return them in the first place? It's a bit annoying to have a visibility
> type that's perhaps best called "what selfuncs.c needs", but still seams
> cleaner than having a separate callback? And the explanation for why
> that's possibly needed also seems to better belong inside the AMs.

SnapshotNonVacuumable is already basically "what selfuncs.c needs" ---
nothing else is using it.  So I wouldn't hesitate to redefine it if
that'd fix the problem.  But it doesn't sound like a promising approach.
The problem here isn't whether the heap tuple is live or dead, it's
whether it can be trusted to match the index entry.  Also, snapshot
checking isn't really below the AM API anyway is it?

I wonder if we'd be better off to switch over to using data directly
from the index entry, rather than trying to recover it from the heap.
We'd then need to make sure that the logic exploits btree's "killed
index entry" ability, so that dead extremal values are ignored as
much as possible.  But that'd get us out of the broken-HOT-chain
problem.

            regards, tom lane



Re: ERROR: found unexpected null value in index

From
Andres Freund
Date:
Hi,

On July 10, 2019 1:25:49 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Andres Freund <andres@anarazel.de> writes:
>> On 2019-07-10 16:02:50 -0400, Tom Lane wrote:
>>> Should we try to transpose some of this logic to below the AM API,
>>> and if so, what should that look like?
>
>> I wonder if we could push this down into a separate type of
>visibility
>> (or maybe redefine NonVacuumable)?  So that the scan wouldn't ever
>> return them in the first place? It's a bit annoying to have a
>visibility
>> type that's perhaps best called "what selfuncs.c needs", but still
>seams
>> cleaner than having a separate callback? And the explanation for why
>> that's possibly needed also seems to better belong inside the AMs.
>
>SnapshotNonVacuumable is already basically "what selfuncs.c needs" ---
>nothing else is using it.  So I wouldn't hesitate to redefine it if
>that'd fix the problem.  But it doesn't sound like a promising
>approach.
>The problem here isn't whether the heap tuple is live or dead, it's
>whether it can be trusted to match the index entry.

Well, if we defined i the snapshot semantics as "return rows that are not vacuumable and safe for selfuncs.c
considerations"it seems like a visibility concern if you squint a tiny bit. It's a mostly checking for certain kinds of
deadrows, after all. 


> Also, snapshot
>checking isn't really below the AM API anyway is it?

It is, imo. Scans callbacks are expected to only return visible rows. And if a tuple in a slot needs to be checked for
visibilitylater, that goes through a callback. Don't see how you could otherwise make different non-pgheap storages
work.


>I wonder if we'd be better off to switch over to using data directly
>from the index entry, rather than trying to recover it from the heap.
>We'd then need to make sure that the logic exploits btree's "killed
>index entry" ability, so that dead extremal values are ignored as
>much as possible.  But that'd get us out of the broken-HOT-chain
>problem.

Hm, that doesn't sound like a trivial change. Not at my computer for 20min, so can't check rn, but does the index entry
alwayshave all the information we need? 

Andres

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On July 10, 2019 1:25:49 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The problem here isn't whether the heap tuple is live or dead, it's
>> whether it can be trusted to match the index entry.

> Well, if we defined i the snapshot semantics as "return rows that are not vacuumable and safe for selfuncs.c
considerations"it seems like a visibility concern if you squint a tiny bit. It's a mostly checking for certain kinds of
deadrows, after all. 

Hm, maybe.  I was considering making it work like "apply
SnapshotNonVacuumable if the tuple isn't HOT_UPDATED, but some stricter
liveness check if it is".  It didn't seem like there was much to be gained
that way given how HeapTupleHeaderIsHotUpdated works ... or, perhaps,
you could say that HeapTupleHeaderIsHotUpdated is already applying a very
strict liveness check.

>> Also, snapshot
>> checking isn't really below the AM API anyway is it?

> It is, imo.

Ah, I see HeapTupleSatisfiesVisibility has already been pushed down
below that horizon.  So I agree, we might be able to finesse this by
pushing the hot-update consideration into HeapTupleSatisfiesNonVacuumable,
or making another snapshot_type that does what we need.

>> I wonder if we'd be better off to switch over to using data directly
>> from the index entry, rather than trying to recover it from the heap.

> Hm, that doesn't sound like a trivial change.

Not sure.  We know we're dealing with btree, and it can always do
index-only scans, so it's certainly possible.  But the relevant
APIs might not be conducive to getting what we want.

            regards, tom lane



Re: ERROR: found unexpected null value in index

From
Peter Geoghegan
Date:
On Wed, Jul 10, 2019 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wonder if we'd be better off to switch over to using data directly
> from the index entry, rather than trying to recover it from the heap.
> We'd then need to make sure that the logic exploits btree's "killed
> index entry" ability, so that dead extremal values are ignored as
> much as possible.  But that'd get us out of the broken-HOT-chain
> problem.

I continue to have concerns about the effectiveness of the
kill_prior_tuple mechanism on 9.5+:


https://www.postgresql.org/message-id/flat/CAH2-Wz=SfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA@mail.gmail.com#b20ead9675225f12b6a80e53e19eed9d

Maybe that problem has nothing to do with what you said, but I was
reminded of the fact that it's far from clear how effective
kill_prior_tuple actually is in the real world (i.e. with
concurrency). I guess that your suggestion would make it even less
likely that LP_DEAD hint bits would be set by
get_actual_variable_range() scans, because there would be no
opportunity to check the heap. Wasn't one of the goals of commit
3ca930fc39c to make it more likely that extrema values would be killed
by get_actual_variable_range() scans, for the benefit of future
get_actual_variable_range() scans?

--
Peter Geoghegan



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Jul 10, 2019 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder if we'd be better off to switch over to using data directly
>> from the index entry, rather than trying to recover it from the heap.

> Maybe that problem has nothing to do with what you said, but I was
> reminded of the fact that it's far from clear how effective
> kill_prior_tuple actually is in the real world (i.e. with
> concurrency). I guess that your suggestion would make it even less
> likely that LP_DEAD hint bits would be set by
> get_actual_variable_range() scans, because there would be no
> opportunity to check the heap.

I was imagining it would still check the heap, if necessary, to verify
that it'd found a tuple passing the given snapshot.

> Wasn't one of the goals of commit
> 3ca930fc39c to make it more likely that extrema values would be killed
> by get_actual_variable_range() scans, for the benefit of future
> get_actual_variable_range() scans?

Yes, and my point was that we still need that effect in some form.  But
once we've found that there's a tuple that's "live enough" (for some
definition of that) we could pull the actual data from the index not heap.

            regards, tom lane



Re: ERROR: found unexpected null value in index

From
Peter Geoghegan
Date:
On Wed, Jul 10, 2019 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I was imagining it would still check the heap, if necessary, to verify
> that it'd found a tuple passing the given snapshot.

Not sure I follow. When or how would it not be necessary? Are you
merely referring to the simple case where the LP_DEAD bit is already
set for the item on the B-Tree leaf page?

> > Wasn't one of the goals of commit
> > 3ca930fc39c to make it more likely that extrema values would be killed
> > by get_actual_variable_range() scans, for the benefit of future
> > get_actual_variable_range() scans?
>
> Yes, and my point was that we still need that effect in some form.  But
> once we've found that there's a tuple that's "live enough" (for some
> definition of that) we could pull the actual data from the index not heap.

I think I follow -- that would do the right thing, without requiring
selfuncs.c to know about HOT, or some similar mechanism in an
alternative table AM.

-- 
Peter Geoghegan



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Wed, Jul 10, 2019 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I was imagining it would still check the heap, if necessary, to verify
>> that it'd found a tuple passing the given snapshot.

> Not sure I follow. When or how would it not be necessary? Are you
> merely referring to the simple case where the LP_DEAD bit is already
> set for the item on the B-Tree leaf page?

Index-only scans already have the LP_DEAD fast path (don't return value)
and the all_visible fast path (always return value), and otherwise they
do a heap visit.  If we can use a custom visibility test in the heap
visit and not muck up the opportunity to set LP_DEAD when relevant, then
it seems like using the IOS code path will do exactly what we want.
Otherwise some finagling might be necessary.  But it still might be
cleaner than directly looking at HOT-update status.

I'll take a look at that tomorrow if nobody beats me to it.

            regards, tom lane



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
I wrote:
> Index-only scans already have the LP_DEAD fast path (don't return value)
> and the all_visible fast path (always return value), and otherwise they
> do a heap visit.  If we can use a custom visibility test in the heap
> visit and not muck up the opportunity to set LP_DEAD when relevant, then
> it seems like using the IOS code path will do exactly what we want.
> Otherwise some finagling might be necessary.  But it still might be
> cleaner than directly looking at HOT-update status.

> I'll take a look at that tomorrow if nobody beats me to it.

As far as I can tell, no special finagling is needed: if we just use
the regular index-only-scan logic then this all works the way we want,
and it's actually better than before because we get to skip heap visits
altogether when dealing with unchanging data.  Attached is a patch
against HEAD that seems to do all the right things.

I'm a little dissatisfied with the fact that I had to duplicate the
all-visible checking logic out of nodeIndexonlyscan.c.  Maybe we should
think about refactoring to avoid multiple copies of that?  But that's
probably a task for a separate patch, if it's practical at all.

            regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d7e3f09..f6859b1 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -107,6 +107,7 @@
 #include "access/sysattr.h"
 #include "access/table.h"
 #include "access/tableam.h"
+#include "access/visibilitymap.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
 #include "catalog/pg_collation.h"
@@ -127,6 +128,7 @@
 #include "parser/parse_clause.h"
 #include "parser/parsetree.h"
 #include "statistics/statistics.h"
+#include "storage/bufmgr.h"
 #include "utils/builtins.h"
 #include "utils/date.h"
 #include "utils/datum.h"
@@ -198,6 +200,15 @@ static bool get_actual_variable_range(PlannerInfo *root,
                                       VariableStatData *vardata,
                                       Oid sortop,
                                       Datum *min, Datum *max);
+static bool get_actual_variable_endpoint(Relation heapRel,
+                                         Relation indexRel,
+                                         ScanDirection indexscandir,
+                                         ScanKey scankeys,
+                                         int16 typLen,
+                                         bool typByVal,
+                                         TupleTableSlot *tableslot,
+                                         MemoryContext outercontext,
+                                         Datum *endpointDatum);
 static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);


@@ -5186,25 +5197,18 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
         {
             EState       *estate;
             ExprContext *econtext;
-            MemoryContext tmpcontext;
             MemoryContext oldcontext;
             Relation    heapRel;
             Relation    indexRel;
-            IndexInfo  *indexInfo;
             TupleTableSlot *slot;
             int16        typLen;
             bool        typByVal;
             ScanKeyData scankeys[1];
-            IndexScanDesc index_scan;
-            Datum        values[INDEX_MAX_KEYS];
-            bool        isnull[INDEX_MAX_KEYS];
-            SnapshotData SnapshotNonVacuumable;

             estate = CreateExecutorState();
             econtext = GetPerTupleExprContext(estate);
             /* Make sure any cruft is generated in the econtext's memory */
-            tmpcontext = econtext->ecxt_per_tuple_memory;
-            oldcontext = MemoryContextSwitchTo(tmpcontext);
+            oldcontext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);

             /*
              * Open the table and index so we can read from them.  We should
@@ -5213,14 +5217,9 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
             heapRel = table_open(rte->relid, NoLock);
             indexRel = index_open(index->indexoid, NoLock);

-            /* extract index key information from the index's pg_index info */
-            indexInfo = BuildIndexInfo(indexRel);
-
-            /* some other stuff */
+            /* some other stuff needed for execution */
             slot = table_slot_create(heapRel, NULL);
-            econtext->ecxt_scantuple = slot;
             get_typlenbyval(vardata->atttype, &typLen, &typByVal);
-            InitNonVacuumableSnapshot(SnapshotNonVacuumable, RecentGlobalXmin);

             /* set up an IS NOT NULL scan key so that we ignore nulls */
             ScanKeyEntryInitialize(&scankeys[0],
@@ -5232,94 +5231,38 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
                                    InvalidOid,    /* no reg proc for this */
                                    (Datum) 0);    /* constant */

-            have_data = true;
-
             /* If min is requested ... */
             if (min)
             {
-                /*
-                 * In principle, we should scan the index with our current
-                 * active snapshot, which is the best approximation we've got
-                 * to what the query will see when executed.  But that won't
-                 * be exact if a new snap is taken before running the query,
-                 * and it can be very expensive if a lot of recently-dead or
-                 * uncommitted rows exist at the beginning or end of the index
-                 * (because we'll laboriously fetch each one and reject it).
-                 * Instead, we use SnapshotNonVacuumable.  That will accept
-                 * recently-dead and uncommitted rows as well as normal
-                 * visible rows.  On the other hand, it will reject known-dead
-                 * rows, and thus not give a bogus answer when the extreme
-                 * value has been deleted (unless the deletion was quite
-                 * recent); that case motivates not using SnapshotAny here.
-                 *
-                 * A crucial point here is that SnapshotNonVacuumable, with
-                 * RecentGlobalXmin as horizon, yields the inverse of the
-                 * condition that the indexscan will use to decide that index
-                 * entries are killable (see heap_hot_search_buffer()).
-                 * Therefore, if the snapshot rejects a tuple and we have to
-                 * continue scanning past it, we know that the indexscan will
-                 * mark that index entry killed.  That means that the next
-                 * get_actual_variable_range() call will not have to visit
-                 * that heap entry.  In this way we avoid repetitive work when
-                 * this function is used a lot during planning.
-                 */
-                index_scan = index_beginscan(heapRel, indexRel,
-                                             &SnapshotNonVacuumable,
-                                             1, 0);
-                index_rescan(index_scan, scankeys, 1, NULL, 0);
-
-                /* Fetch first tuple in sortop's direction */
-                if (index_getnext_slot(index_scan, indexscandir, slot))
-                {
-                    /* Extract the index column values from the slot */
-                    FormIndexDatum(indexInfo, slot, estate,
-                                   values, isnull);
-
-                    /* Shouldn't have got a null, but be careful */
-                    if (isnull[0])
-                        elog(ERROR, "found unexpected null value in index \"%s\"",
-                             RelationGetRelationName(indexRel));
-
-                    /* Copy the index column value out to caller's context */
-                    MemoryContextSwitchTo(oldcontext);
-                    *min = datumCopy(values[0], typByVal, typLen);
-                    MemoryContextSwitchTo(tmpcontext);
-                }
-                else
-                    have_data = false;
-
-                index_endscan(index_scan);
+                have_data = get_actual_variable_endpoint(heapRel,
+                                                         indexRel,
+                                                         indexscandir,
+                                                         scankeys,
+                                                         typLen,
+                                                         typByVal,
+                                                         slot,
+                                                         oldcontext,
+                                                         min);
+            }
+            else
+            {
+                /* If min not requested, assume index is nonempty */
+                have_data = true;
             }

             /* If max is requested, and we didn't find the index is empty */
             if (max && have_data)
             {
-                index_scan = index_beginscan(heapRel, indexRel,
-                                             &SnapshotNonVacuumable,
-                                             1, 0);
-                index_rescan(index_scan, scankeys, 1, NULL, 0);
-
-                /* Fetch first tuple in reverse direction */
-                if (index_getnext_slot(index_scan, -indexscandir, slot))
-                {
-                    /* Extract the index column values from the slot */
-                    FormIndexDatum(indexInfo, slot, estate,
-                                   values, isnull);
-
-                    /* Shouldn't have got a null, but be careful */
-                    if (isnull[0])
-                        elog(ERROR, "found unexpected null value in index \"%s\"",
-                             RelationGetRelationName(indexRel));
-
-                    /* Copy the index column value out to caller's context */
-                    MemoryContextSwitchTo(oldcontext);
-                    *max = datumCopy(values[0], typByVal, typLen);
-                    MemoryContextSwitchTo(tmpcontext);
-                }
-                else
-                    have_data = false;
-
-                index_endscan(index_scan);
+                /* scan in the opposite direction; all else is the same */
+                have_data = get_actual_variable_endpoint(heapRel,
+                                                         indexRel,
+                                                         -indexscandir,
+                                                         scankeys,
+                                                         typLen,
+                                                         typByVal,
+                                                         slot,
+                                                         oldcontext,
+                                                         max);
             }

             /* Clean everything up */
@@ -5340,6 +5283,139 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 }

 /*
+ * Get one endpoint datum (min or max depending on indexscandir) from the
+ * specified index.  Return true if successful, false if index is empty.
+ * On success, endpoint value is stored to *endpointDatum (and copied into
+ * outercontext).
+ *
+ * scankeys is a 1-element scankey array set up to reject nulls.
+ * typLen/typByVal describe the datatype of the index's first column.
+ * tableslot is a slot suitable to hold table tuples, in case we need
+ * to probe the heap.
+ * (We could compute these values locally, but that would mean computing them
+ * twice when get_actual_variable_range needs both the min and the max.)
+ */
+static bool
+get_actual_variable_endpoint(Relation heapRel,
+                             Relation indexRel,
+                             ScanDirection indexscandir,
+                             ScanKey scankeys,
+                             int16 typLen,
+                             bool typByVal,
+                             TupleTableSlot *tableslot,
+                             MemoryContext outercontext,
+                             Datum *endpointDatum)
+{
+    bool        have_data = false;
+    SnapshotData SnapshotNonVacuumable;
+    IndexScanDesc index_scan;
+    Buffer        vmbuffer = InvalidBuffer;
+    ItemPointer tid;
+    Datum        values[INDEX_MAX_KEYS];
+    bool        isnull[INDEX_MAX_KEYS];
+    MemoryContext oldcontext;
+
+    /*
+     * We use the index-only-scan machinery for this.  With mostly-static
+     * tables that's a win because it avoids a heap visit.  It's also a win
+     * for dynamic data, but the reason is less obvious; read on for details.
+     *
+     * In principle, we should scan the index with our current active
+     * snapshot, which is the best approximation we've got to what the query
+     * will see when executed.  But that won't be exact if a new snap is taken
+     * before running the query, and it can be very expensive if a lot of
+     * recently-dead or uncommitted rows exist at the beginning or end of the
+     * index (because we'll laboriously fetch each one and reject it).
+     * Instead, we use SnapshotNonVacuumable.  That will accept recently-dead
+     * and uncommitted rows as well as normal visible rows.  On the other
+     * hand, it will reject known-dead rows, and thus not give a bogus answer
+     * when the extreme value has been deleted (unless the deletion was quite
+     * recent); that case motivates not using SnapshotAny here.
+     *
+     * A crucial point here is that SnapshotNonVacuumable, with
+     * RecentGlobalXmin as horizon, yields the inverse of the condition that
+     * the indexscan will use to decide that index entries are killable (see
+     * heap_hot_search_buffer()).  Therefore, if the snapshot rejects a tuple
+     * and we have to continue scanning past it, we know that the indexscan
+     * will mark that index entry killed.  That means that the next
+     * get_actual_variable_endpoint() call will not have to visit that heap
+     * entry.  In this way we avoid repetitive work when this function is used
+     * a lot during planning.
+     *
+     * But using SnapshotNonVacuumable creates a hazard of its own.  In a
+     * recently-created index, some index entries may point at "broken" HOT
+     * chains in which not all the tuple versions contain data matching the
+     * index entry.  The live tuple version(s) certainly do match the index,
+     * but SnapshotNonVacuumable can accept tuple versions that don't.  Hence,
+     * if we took data from the selected heap tuple, we might get a bogus
+     * answer that's not close to the index extremal value, or could even be
+     * NULL.  We avoid this hazard because we take the data from the index
+     * entry not the heap.
+     */
+    InitNonVacuumableSnapshot(SnapshotNonVacuumable, RecentGlobalXmin);
+
+    index_scan = index_beginscan(heapRel, indexRel,
+                                 &SnapshotNonVacuumable,
+                                 1, 0);
+    /* Set it up for index-only scan */
+    index_scan->xs_want_itup = true;
+    index_rescan(index_scan, scankeys, 1, NULL, 0);
+
+    /* Fetch first/next tuple in specified direction */
+    while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL)
+    {
+        if (!VM_ALL_VISIBLE(heapRel,
+                            ItemPointerGetBlockNumber(tid),
+                            &vmbuffer))
+        {
+            /* Rats, we have to visit the heap to check visibility */
+            if (!index_fetch_heap(index_scan, tableslot))
+                continue;        /* no visible tuple, try next index entry */
+
+            /* We don't actually need the heap tuple for anything */
+            ExecClearTuple(tableslot);
+
+            /*
+             * We don't care whether there's more than one visible tuple in
+             * the HOT chain; if any are visible, that's good enough.
+             */
+        }
+
+        /*
+         * We expect that btree will return data in IndexTuple not HeapTuple
+         * format.  It's not lossy either.
+         */
+        if (!index_scan->xs_itup)
+            elog(ERROR, "no data returned for index-only scan");
+        if (index_scan->xs_recheck)
+            elog(ERROR, "unexpected recheck indication from btree");
+
+        /* OK to deform the index tuple */
+        index_deform_tuple(index_scan->xs_itup,
+                           index_scan->xs_itupdesc,
+                           values, isnull);
+
+        /* Shouldn't have got a null, but be careful */
+        if (isnull[0])
+            elog(ERROR, "found unexpected null value in index \"%s\"",
+                 RelationGetRelationName(indexRel));
+
+        /* Copy the index column value out to caller's context */
+        oldcontext = MemoryContextSwitchTo(outercontext);
+        *endpointDatum = datumCopy(values[0], typByVal, typLen);
+        MemoryContextSwitchTo(oldcontext);
+        have_data = true;
+        break;
+    }
+
+    if (vmbuffer != InvalidBuffer)
+        ReleaseBuffer(vmbuffer);
+    index_endscan(index_scan);
+
+    return have_data;
+}
+
+/*
  * find_join_input_rel
  *        Look up the input relation for a join.
  *

Re: ERROR: found unexpected null value in index

From
Peter Geoghegan
Date:
On Thu, Jul 11, 2019 at 2:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> As far as I can tell, no special finagling is needed: if we just use
> the regular index-only-scan logic then this all works the way we want,
> and it's actually better than before because we get to skip heap visits
> altogether when dealing with unchanging data.  Attached is a patch
> against HEAD that seems to do all the right things.

Interesting approach. I certainly prefer it to the alternative
approach of framing the problem as a visibility concern.

> I'm a little dissatisfied with the fact that I had to duplicate the
> all-visible checking logic out of nodeIndexonlyscan.c.  Maybe we should
> think about refactoring to avoid multiple copies of that?  But that's
> probably a task for a separate patch, if it's practical at all.

I suspect that you'd end up writing more code than you'd save if you
generalized what we already have. Not clear that that's worth it.

-- 
Peter Geoghegan



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Thu, Jul 11, 2019 at 2:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> As far as I can tell, no special finagling is needed: if we just use
>> the regular index-only-scan logic then this all works the way we want,
>> and it's actually better than before because we get to skip heap visits
>> altogether when dealing with unchanging data.  Attached is a patch
>> against HEAD that seems to do all the right things.

> Interesting approach. I certainly prefer it to the alternative
> approach of framing the problem as a visibility concern.

Yes, I certainly like this better than my previous attempt.

I still feel like we're going to want to push (most of?) this logic
below the tableam API at some point, because its implementation was
and remains tied to heap+btree.  But other table AMs are likely to
support ordered scan accesses of some sort, and they're going to
want to be able to adjust the planner's extremal-value estimates
too.  It's a job for later though.

            regards, tom lane



Re: ERROR: found unexpected null value in index

From
Tom Lane
Date:
I wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
>> Interesting approach. I certainly prefer it to the alternative
>> approach of framing the problem as a visibility concern.

> Yes, I certainly like this better than my previous attempt.

Re-reading the patch, I realized that it wasn't using the ExecutorState
infrastructure anymore, except for a short-lived memory context.  So we
can get an additional small savings by dropping the executor dependency
and just making a temp context for ourselves.

Pushed with that improvement.

            regards, tom lane



Re: ERROR: found unexpected null value in index

From
Manuel Rigger
Date:
Thanks for fixing this!

Best,
Manuel

On Fri, Jul 12, 2019 at 10:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > Peter Geoghegan <pg@bowt.ie> writes:
> >> Interesting approach. I certainly prefer it to the alternative
> >> approach of framing the problem as a visibility concern.
>
> > Yes, I certainly like this better than my previous attempt.
>
> Re-reading the patch, I realized that it wasn't using the ExecutorState
> infrastructure anymore, except for a short-lived memory context.  So we
> can get an additional small savings by dropping the executor dependency
> and just making a temp context for ourselves.
>
> Pushed with that improvement.
>
>                         regards, tom lane



Re: ERROR: found unexpected null value in index

From
Andres Freund
Date:
Hi,

On 2019-07-11 17:20:07 -0400, Tom Lane wrote:
> I'm a little dissatisfied with the fact that I had to duplicate the
> all-visible checking logic out of nodeIndexonlyscan.c.  Maybe we should
> think about refactoring to avoid multiple copies of that?  But that's
> probably a task for a separate patch, if it's practical at all.

It's maybe worthwhile to note that having the visibilitymap logic in
nodeIndexonlyscan.c. is one of the larger remaining limiting issues from
the tableam work (called out in [1]). We're going to have to move the VM
accesses into the AM at some point not that far away - but it's not
entirely trivial to do so without creating additional VM buffer lookups
/ pins.

Greetings,

Andres Freund