Thread: ERROR: found unexpected null value in index
Hi everyone, Consider the following statement sequence: CREATE TABLE t0(c0 TEXT); INSERT INTO t0(c0) VALUES('b'), ('a'); ANALYZE; INSERT INTO t0(c0) VALUES (NULL); UPDATE t0 SET c0 = 'a'; CREATE INDEX i0 ON t0(c0); SELECT * FROM t0 WHERE 'baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' > t0.c0; -- unexpected: ERROR: found unexpected null value in index "i0" The SELECT can result in "ERROR: found unexpected null value in index "i0"". I could reproduce this error only when some actions on other databases are performed. The error is rather difficult to reproduce, and small changes to the above statements cause it to no longer be reproducible on my machine. I've attached a Java program that runs the above statement sequence on distinct databases using 32 threads, which, on my machine, reproduces the error instantly. It is also possible to reproduce the error using 2 threads, which takes multiple minutes on my machine. My Postgres version is Ubuntu 11.4-1.pgdg19.04+1. Best, Manuel
Attachment
On Tue, Jul 9, 2019 at 4:52 PM Manuel Rigger <rigger.manuel@gmail.com> wrote: > CREATE TABLE t0(c0 TEXT); > INSERT INTO t0(c0) VALUES('b'), ('a'); > ANALYZE; > INSERT INTO t0(c0) VALUES (NULL); > UPDATE t0 SET c0 = 'a'; > CREATE INDEX i0 ON t0(c0); > SELECT * FROM t0 WHERE 'baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' > > t0.c0; -- unexpected: ERROR: found unexpected null value in index > "i0" > > The SELECT can result in "ERROR: found unexpected null value in index > "i0"". I could reproduce this error only when some actions on other > databases are performed. The error is rather difficult to reproduce, > and small changes to the above statements cause it to no longer be > reproducible on my machine. The error comes from a point at which the planner accesses the index before execution proper, within get_actual_variable_range(). Perhaps commit 3ca930fc39c is to blame. -- Peter Geoghegan
Manuel Rigger <rigger.manuel@gmail.com> writes: > CREATE TABLE t0(c0 TEXT); > INSERT INTO t0(c0) VALUES('b'), ('a'); > ANALYZE t0; > INSERT INTO t0(c0) VALUES (NULL); > UPDATE t0 SET c0 = 'a'; > CREATE INDEX i0 ON t0(c0); > SELECT * FROM t0 WHERE 'baaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' > t0.c0; > -- unexpected: ERROR: found unexpected null value in index "i0" Nifty. As before, the way to make this reproducible is to do it with another open transaction holding a snapshot, so that the NULL entry has to be reflected in the index. (I'm not sure why the HOT-update exception doesn't apply here, but apparently it doesn't.) That is, it's sufficient to set up one session with begin transaction isolation level serializable; select * from some_table; and then run the above script in another session. The error is coming from the planner's get_actual_variable_range: /* Shouldn't have got a null, but be careful */ if (isnull[0]) elog(ERROR, "found unexpected null value in index \"%s\"", RelationGetRelationName(indexRel)); and I think it's entirely within its rights to complain, because it set up the scan key to reject nulls. In short, somebody seems to have broken btrees' processing of SK_ISNULL | SK_SEARCHNOTNULL scankeys, and they broke it in v11, because prior versions don't show this failure. regards, tom lane
On Tue, Jul 9, 2019 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > The error is coming from the planner's get_actual_variable_range: > > /* Shouldn't have got a null, but be careful */ > if (isnull[0]) > elog(ERROR, "found unexpected null value in index \"%s\"", > RelationGetRelationName(indexRel)); > > and I think it's entirely within its rights to complain, because it > set up the scan key to reject nulls. In short, somebody seems to > have broken btrees' processing of SK_ISNULL | SK_SEARCHNOTNULL scankeys, > and they broke it in v11, because prior versions don't show this failure. It's not obvious to me why that might be. I'll run a "git bisect" to track down the offending commit. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Tue, Jul 9, 2019 at 6:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The error is coming from the planner's get_actual_variable_range: >> and I think it's entirely within its rights to complain, because it >> set up the scan key to reject nulls. In short, somebody seems to >> have broken btrees' processing of SK_ISNULL | SK_SEARCHNOTNULL scankeys, >> and they broke it in v11, because prior versions don't show this failure. > It's not obvious to me why that might be. I'll run a "git bisect" to > track down the offending commit. I just finished doing that, and indeed it fingers 3ca930fc3 as the first bad commit. Seems like it must have exposed a pre-existing problem though? Too tired to look further tonight. regards, tom lane
On Tue, Jul 9, 2019 at 6:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I just finished doing that, and indeed it fingers 3ca930fc3 as the > first bad commit. Seems like it must have exposed a pre-existing > problem though? I think that the issue is related to a broken HOT chain -- the index doesn't even have any NULL key values, since the CREATE INDEX came after the INSERT that added a NULL value. However, it does have a tuple with the key value 'a' that points to the root of a HOT chain whose first value for the indexed attribute is NULL. The successor tuple's value for the indexed attribute is 'a', as expected (of course, this is a normal state that IndexBuildHeapScan()/heapam_index_build_range_scan() expect other code to deal with). Back when get_actual_variable_range() used a dirty snapshot, it would have not seen any NULL value with this test case, because the root of the HOT chain would be considered recently dead. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > I think that the issue is related to a broken HOT chain -- the index > doesn't even have any NULL key values, since the CREATE INDEX came > after the INSERT that added a NULL value. However, it does have a > tuple with the key value 'a' that points to the root of a HOT chain > whose first value for the indexed attribute is NULL. The successor > tuple's value for the indexed attribute is 'a', as expected (of > course, this is a normal state that > IndexBuildHeapScan()/heapam_index_build_range_scan() expect other code > to deal with). > Back when get_actual_variable_range() used a dirty snapshot, it would > have not seen any NULL value with this test case, because the root of > the HOT chain would be considered recently dead. Hm. So maybe we need to teach it to ignore tuples that are not the tips of their HOT chains? regards, tom lane
On Tue, Jul 9, 2019 at 7:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Back when get_actual_variable_range() used a dirty snapshot, it would > > have not seen any NULL value with this test case, because the root of > > the HOT chain would be considered recently dead. > > Hm. So maybe we need to teach it to ignore tuples that are not the tips > of their HOT chains? An approach like that was the first thing that I thought of, but I'll have to study the problem some more before coming up with a firm opinion. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Tue, Jul 9, 2019 at 7:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hm. So maybe we need to teach it to ignore tuples that are not the tips >> of their HOT chains? > An approach like that was the first thing that I thought of, but I'll > have to study the problem some more before coming up with a firm > opinion. I experimented with the attached. It solves the reported problem and passes check-world. I made it just ignore any tuple for which HeapTupleHeaderIsHotUpdated is true. It might seem like there's a risk of ignoring valid data, if the end tuple of a HOT chain is dead due to a transaction abort, but since HeapTupleHeaderIsHotUpdated checks for xmax validity I judge that the risk of that is small enough to be acceptable. A bigger problem with this is that in the tableam world, this seems like a complete disaster modularity-wise. I think it won't actually fail --- non-heap AMs are probably not returning BufferHeapTupleTableSlots, and even if they are, they shouldn't be marking tuples HOT_UPDATED unless that concept applies to them. But it sure seems like this leaves get_actual_variable_range() knowing way more than it ought to about storage-level concerns. Should we try to transpose some of this logic to below the AM API, and if so, what should that look like? regards, tom lane diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index d7e3f09..b51cbed 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -5232,8 +5232,6 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, InvalidOid, /* no reg proc for this */ (Datum) 0); /* constant */ - have_data = true; - /* If min is requested ... */ if (min) { @@ -5262,15 +5260,33 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, * get_actual_variable_range() call will not have to visit * that heap entry. In this way we avoid repetitive work when * this function is used a lot during planning. + * + * However, all is not perfect, because if the index was + * recently created it may have entries pointing to broken HOT + * chains that contain tuples that don't match the index entry + * but aren't yet known-dead either. We need to ignore such + * tuples, since they might not be representative of the index + * extremal values, indeed could even contain NULLs. We + * approximate this by ignoring any hot-updated tuples we see + * and continuing to scan. */ index_scan = index_beginscan(heapRel, indexRel, &SnapshotNonVacuumable, 1, 0); index_rescan(index_scan, scankeys, 1, NULL, 0); - /* Fetch first tuple in sortop's direction */ - if (index_getnext_slot(index_scan, indexscandir, slot)) + /* Fetch first/next tuple in sortop's direction */ + while (index_getnext_slot(index_scan, indexscandir, slot)) { + /* Ignore hot-updated tuples, per comment above */ + if (TTS_IS_BUFFERTUPLE(slot)) + { + BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot; + + if (HeapTupleHeaderIsHotUpdated(bslot->base.tupdata.t_data)) + continue; + } + /* Extract the index column values from the slot */ FormIndexDatum(indexInfo, slot, estate, values, isnull); @@ -5284,24 +5300,40 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, MemoryContextSwitchTo(oldcontext); *min = datumCopy(values[0], typByVal, typLen); MemoryContextSwitchTo(tmpcontext); + have_data = true; + break; } - else - have_data = false; index_endscan(index_scan); } + else + { + /* If min not requested, assume index has data */ + have_data = true; + } /* If max is requested, and we didn't find the index is empty */ if (max && have_data) { + have_data = false; + index_scan = index_beginscan(heapRel, indexRel, &SnapshotNonVacuumable, 1, 0); index_rescan(index_scan, scankeys, 1, NULL, 0); - /* Fetch first tuple in reverse direction */ - if (index_getnext_slot(index_scan, -indexscandir, slot)) + /* Fetch first/next tuple in reverse direction */ + while (index_getnext_slot(index_scan, -indexscandir, slot)) { + /* Ignore hot-updated tuples, per comment above */ + if (TTS_IS_BUFFERTUPLE(slot)) + { + BufferHeapTupleTableSlot *bslot = (BufferHeapTupleTableSlot *) slot; + + if (HeapTupleHeaderIsHotUpdated(bslot->base.tupdata.t_data)) + continue; + } + /* Extract the index column values from the slot */ FormIndexDatum(indexInfo, slot, estate, values, isnull); @@ -5315,9 +5347,9 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, MemoryContextSwitchTo(oldcontext); *max = datumCopy(values[0], typByVal, typLen); MemoryContextSwitchTo(tmpcontext); + have_data = true; + break; } - else - have_data = false; index_endscan(index_scan); }
Hi, On 2019-07-10 16:02:50 -0400, Tom Lane wrote: > A bigger problem with this is that in the tableam world, this seems > like a complete disaster modularity-wise. I think it won't actually > fail --- non-heap AMs are probably not returning > BufferHeapTupleTableSlots, and even if they are, they shouldn't be > marking tuples HOT_UPDATED unless that concept applies to them. > But it sure seems like this leaves get_actual_variable_range() knowing > way more than it ought to about storage-level concerns. Yea, I think that's not pretty. I'm not concerned about wrong results due to BufferHeapTupleTableSlots being returned, and misinterpreted, but layering wise this seems not good. Even leaving tableam aside, it seems not nice to have this concern leak into selfuncs.c > Should we try to transpose some of this logic to below the AM API, > and if so, what should that look like? I wonder if we could push this down into a separate type of visibility (or maybe redefine NonVacuumable)? So that the scan wouldn't ever return them in the first place? It's a bit annoying to have a visibility type that's perhaps best called "what selfuncs.c needs", but still seams cleaner than having a separate callback? And the explanation for why that's possibly needed also seems to better belong inside the AMs. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2019-07-10 16:02:50 -0400, Tom Lane wrote: >> Should we try to transpose some of this logic to below the AM API, >> and if so, what should that look like? > I wonder if we could push this down into a separate type of visibility > (or maybe redefine NonVacuumable)? So that the scan wouldn't ever > return them in the first place? It's a bit annoying to have a visibility > type that's perhaps best called "what selfuncs.c needs", but still seams > cleaner than having a separate callback? And the explanation for why > that's possibly needed also seems to better belong inside the AMs. SnapshotNonVacuumable is already basically "what selfuncs.c needs" --- nothing else is using it. So I wouldn't hesitate to redefine it if that'd fix the problem. But it doesn't sound like a promising approach. The problem here isn't whether the heap tuple is live or dead, it's whether it can be trusted to match the index entry. Also, snapshot checking isn't really below the AM API anyway is it? I wonder if we'd be better off to switch over to using data directly from the index entry, rather than trying to recover it from the heap. We'd then need to make sure that the logic exploits btree's "killed index entry" ability, so that dead extremal values are ignored as much as possible. But that'd get us out of the broken-HOT-chain problem. regards, tom lane
Hi, On July 10, 2019 1:25:49 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Andres Freund <andres@anarazel.de> writes: >> On 2019-07-10 16:02:50 -0400, Tom Lane wrote: >>> Should we try to transpose some of this logic to below the AM API, >>> and if so, what should that look like? > >> I wonder if we could push this down into a separate type of >visibility >> (or maybe redefine NonVacuumable)? So that the scan wouldn't ever >> return them in the first place? It's a bit annoying to have a >visibility >> type that's perhaps best called "what selfuncs.c needs", but still >seams >> cleaner than having a separate callback? And the explanation for why >> that's possibly needed also seems to better belong inside the AMs. > >SnapshotNonVacuumable is already basically "what selfuncs.c needs" --- >nothing else is using it. So I wouldn't hesitate to redefine it if >that'd fix the problem. But it doesn't sound like a promising >approach. >The problem here isn't whether the heap tuple is live or dead, it's >whether it can be trusted to match the index entry. Well, if we defined i the snapshot semantics as "return rows that are not vacuumable and safe for selfuncs.c considerations"it seems like a visibility concern if you squint a tiny bit. It's a mostly checking for certain kinds of deadrows, after all. > Also, snapshot >checking isn't really below the AM API anyway is it? It is, imo. Scans callbacks are expected to only return visible rows. And if a tuple in a slot needs to be checked for visibilitylater, that goes through a callback. Don't see how you could otherwise make different non-pgheap storages work. >I wonder if we'd be better off to switch over to using data directly >from the index entry, rather than trying to recover it from the heap. >We'd then need to make sure that the logic exploits btree's "killed >index entry" ability, so that dead extremal values are ignored as >much as possible. But that'd get us out of the broken-HOT-chain >problem. Hm, that doesn't sound like a trivial change. Not at my computer for 20min, so can't check rn, but does the index entry alwayshave all the information we need? Andres Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On July 10, 2019 1:25:49 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The problem here isn't whether the heap tuple is live or dead, it's >> whether it can be trusted to match the index entry. > Well, if we defined i the snapshot semantics as "return rows that are not vacuumable and safe for selfuncs.c considerations"it seems like a visibility concern if you squint a tiny bit. It's a mostly checking for certain kinds of deadrows, after all. Hm, maybe. I was considering making it work like "apply SnapshotNonVacuumable if the tuple isn't HOT_UPDATED, but some stricter liveness check if it is". It didn't seem like there was much to be gained that way given how HeapTupleHeaderIsHotUpdated works ... or, perhaps, you could say that HeapTupleHeaderIsHotUpdated is already applying a very strict liveness check. >> Also, snapshot >> checking isn't really below the AM API anyway is it? > It is, imo. Ah, I see HeapTupleSatisfiesVisibility has already been pushed down below that horizon. So I agree, we might be able to finesse this by pushing the hot-update consideration into HeapTupleSatisfiesNonVacuumable, or making another snapshot_type that does what we need. >> I wonder if we'd be better off to switch over to using data directly >> from the index entry, rather than trying to recover it from the heap. > Hm, that doesn't sound like a trivial change. Not sure. We know we're dealing with btree, and it can always do index-only scans, so it's certainly possible. But the relevant APIs might not be conducive to getting what we want. regards, tom lane
On Wed, Jul 10, 2019 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wonder if we'd be better off to switch over to using data directly > from the index entry, rather than trying to recover it from the heap. > We'd then need to make sure that the logic exploits btree's "killed > index entry" ability, so that dead extremal values are ignored as > much as possible. But that'd get us out of the broken-HOT-chain > problem. I continue to have concerns about the effectiveness of the kill_prior_tuple mechanism on 9.5+: https://www.postgresql.org/message-id/flat/CAH2-Wz=SfAKVMv1x9Jh19EJ8am8TZn9f-yECipS9HrrRqSswnA@mail.gmail.com#b20ead9675225f12b6a80e53e19eed9d Maybe that problem has nothing to do with what you said, but I was reminded of the fact that it's far from clear how effective kill_prior_tuple actually is in the real world (i.e. with concurrency). I guess that your suggestion would make it even less likely that LP_DEAD hint bits would be set by get_actual_variable_range() scans, because there would be no opportunity to check the heap. Wasn't one of the goals of commit 3ca930fc39c to make it more likely that extrema values would be killed by get_actual_variable_range() scans, for the benefit of future get_actual_variable_range() scans? -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Wed, Jul 10, 2019 at 1:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wonder if we'd be better off to switch over to using data directly >> from the index entry, rather than trying to recover it from the heap. > Maybe that problem has nothing to do with what you said, but I was > reminded of the fact that it's far from clear how effective > kill_prior_tuple actually is in the real world (i.e. with > concurrency). I guess that your suggestion would make it even less > likely that LP_DEAD hint bits would be set by > get_actual_variable_range() scans, because there would be no > opportunity to check the heap. I was imagining it would still check the heap, if necessary, to verify that it'd found a tuple passing the given snapshot. > Wasn't one of the goals of commit > 3ca930fc39c to make it more likely that extrema values would be killed > by get_actual_variable_range() scans, for the benefit of future > get_actual_variable_range() scans? Yes, and my point was that we still need that effect in some form. But once we've found that there's a tuple that's "live enough" (for some definition of that) we could pull the actual data from the index not heap. regards, tom lane
On Wed, Jul 10, 2019 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I was imagining it would still check the heap, if necessary, to verify > that it'd found a tuple passing the given snapshot. Not sure I follow. When or how would it not be necessary? Are you merely referring to the simple case where the LP_DEAD bit is already set for the item on the B-Tree leaf page? > > Wasn't one of the goals of commit > > 3ca930fc39c to make it more likely that extrema values would be killed > > by get_actual_variable_range() scans, for the benefit of future > > get_actual_variable_range() scans? > > Yes, and my point was that we still need that effect in some form. But > once we've found that there's a tuple that's "live enough" (for some > definition of that) we could pull the actual data from the index not heap. I think I follow -- that would do the right thing, without requiring selfuncs.c to know about HOT, or some similar mechanism in an alternative table AM. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Wed, Jul 10, 2019 at 3:26 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I was imagining it would still check the heap, if necessary, to verify >> that it'd found a tuple passing the given snapshot. > Not sure I follow. When or how would it not be necessary? Are you > merely referring to the simple case where the LP_DEAD bit is already > set for the item on the B-Tree leaf page? Index-only scans already have the LP_DEAD fast path (don't return value) and the all_visible fast path (always return value), and otherwise they do a heap visit. If we can use a custom visibility test in the heap visit and not muck up the opportunity to set LP_DEAD when relevant, then it seems like using the IOS code path will do exactly what we want. Otherwise some finagling might be necessary. But it still might be cleaner than directly looking at HOT-update status. I'll take a look at that tomorrow if nobody beats me to it. regards, tom lane
I wrote: > Index-only scans already have the LP_DEAD fast path (don't return value) > and the all_visible fast path (always return value), and otherwise they > do a heap visit. If we can use a custom visibility test in the heap > visit and not muck up the opportunity to set LP_DEAD when relevant, then > it seems like using the IOS code path will do exactly what we want. > Otherwise some finagling might be necessary. But it still might be > cleaner than directly looking at HOT-update status. > I'll take a look at that tomorrow if nobody beats me to it. As far as I can tell, no special finagling is needed: if we just use the regular index-only-scan logic then this all works the way we want, and it's actually better than before because we get to skip heap visits altogether when dealing with unchanging data. Attached is a patch against HEAD that seems to do all the right things. I'm a little dissatisfied with the fact that I had to duplicate the all-visible checking logic out of nodeIndexonlyscan.c. Maybe we should think about refactoring to avoid multiple copies of that? But that's probably a task for a separate patch, if it's practical at all. regards, tom lane diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index d7e3f09..f6859b1 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -107,6 +107,7 @@ #include "access/sysattr.h" #include "access/table.h" #include "access/tableam.h" +#include "access/visibilitymap.h" #include "catalog/index.h" #include "catalog/pg_am.h" #include "catalog/pg_collation.h" @@ -127,6 +128,7 @@ #include "parser/parse_clause.h" #include "parser/parsetree.h" #include "statistics/statistics.h" +#include "storage/bufmgr.h" #include "utils/builtins.h" #include "utils/date.h" #include "utils/datum.h" @@ -198,6 +200,15 @@ static bool get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, Oid sortop, Datum *min, Datum *max); +static bool get_actual_variable_endpoint(Relation heapRel, + Relation indexRel, + ScanDirection indexscandir, + ScanKey scankeys, + int16 typLen, + bool typByVal, + TupleTableSlot *tableslot, + MemoryContext outercontext, + Datum *endpointDatum); static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids); @@ -5186,25 +5197,18 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, { EState *estate; ExprContext *econtext; - MemoryContext tmpcontext; MemoryContext oldcontext; Relation heapRel; Relation indexRel; - IndexInfo *indexInfo; TupleTableSlot *slot; int16 typLen; bool typByVal; ScanKeyData scankeys[1]; - IndexScanDesc index_scan; - Datum values[INDEX_MAX_KEYS]; - bool isnull[INDEX_MAX_KEYS]; - SnapshotData SnapshotNonVacuumable; estate = CreateExecutorState(); econtext = GetPerTupleExprContext(estate); /* Make sure any cruft is generated in the econtext's memory */ - tmpcontext = econtext->ecxt_per_tuple_memory; - oldcontext = MemoryContextSwitchTo(tmpcontext); + oldcontext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory); /* * Open the table and index so we can read from them. We should @@ -5213,14 +5217,9 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, heapRel = table_open(rte->relid, NoLock); indexRel = index_open(index->indexoid, NoLock); - /* extract index key information from the index's pg_index info */ - indexInfo = BuildIndexInfo(indexRel); - - /* some other stuff */ + /* some other stuff needed for execution */ slot = table_slot_create(heapRel, NULL); - econtext->ecxt_scantuple = slot; get_typlenbyval(vardata->atttype, &typLen, &typByVal); - InitNonVacuumableSnapshot(SnapshotNonVacuumable, RecentGlobalXmin); /* set up an IS NOT NULL scan key so that we ignore nulls */ ScanKeyEntryInitialize(&scankeys[0], @@ -5232,94 +5231,38 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, InvalidOid, /* no reg proc for this */ (Datum) 0); /* constant */ - have_data = true; - /* If min is requested ... */ if (min) { - /* - * In principle, we should scan the index with our current - * active snapshot, which is the best approximation we've got - * to what the query will see when executed. But that won't - * be exact if a new snap is taken before running the query, - * and it can be very expensive if a lot of recently-dead or - * uncommitted rows exist at the beginning or end of the index - * (because we'll laboriously fetch each one and reject it). - * Instead, we use SnapshotNonVacuumable. That will accept - * recently-dead and uncommitted rows as well as normal - * visible rows. On the other hand, it will reject known-dead - * rows, and thus not give a bogus answer when the extreme - * value has been deleted (unless the deletion was quite - * recent); that case motivates not using SnapshotAny here. - * - * A crucial point here is that SnapshotNonVacuumable, with - * RecentGlobalXmin as horizon, yields the inverse of the - * condition that the indexscan will use to decide that index - * entries are killable (see heap_hot_search_buffer()). - * Therefore, if the snapshot rejects a tuple and we have to - * continue scanning past it, we know that the indexscan will - * mark that index entry killed. That means that the next - * get_actual_variable_range() call will not have to visit - * that heap entry. In this way we avoid repetitive work when - * this function is used a lot during planning. - */ - index_scan = index_beginscan(heapRel, indexRel, - &SnapshotNonVacuumable, - 1, 0); - index_rescan(index_scan, scankeys, 1, NULL, 0); - - /* Fetch first tuple in sortop's direction */ - if (index_getnext_slot(index_scan, indexscandir, slot)) - { - /* Extract the index column values from the slot */ - FormIndexDatum(indexInfo, slot, estate, - values, isnull); - - /* Shouldn't have got a null, but be careful */ - if (isnull[0]) - elog(ERROR, "found unexpected null value in index \"%s\"", - RelationGetRelationName(indexRel)); - - /* Copy the index column value out to caller's context */ - MemoryContextSwitchTo(oldcontext); - *min = datumCopy(values[0], typByVal, typLen); - MemoryContextSwitchTo(tmpcontext); - } - else - have_data = false; - - index_endscan(index_scan); + have_data = get_actual_variable_endpoint(heapRel, + indexRel, + indexscandir, + scankeys, + typLen, + typByVal, + slot, + oldcontext, + min); + } + else + { + /* If min not requested, assume index is nonempty */ + have_data = true; } /* If max is requested, and we didn't find the index is empty */ if (max && have_data) { - index_scan = index_beginscan(heapRel, indexRel, - &SnapshotNonVacuumable, - 1, 0); - index_rescan(index_scan, scankeys, 1, NULL, 0); - - /* Fetch first tuple in reverse direction */ - if (index_getnext_slot(index_scan, -indexscandir, slot)) - { - /* Extract the index column values from the slot */ - FormIndexDatum(indexInfo, slot, estate, - values, isnull); - - /* Shouldn't have got a null, but be careful */ - if (isnull[0]) - elog(ERROR, "found unexpected null value in index \"%s\"", - RelationGetRelationName(indexRel)); - - /* Copy the index column value out to caller's context */ - MemoryContextSwitchTo(oldcontext); - *max = datumCopy(values[0], typByVal, typLen); - MemoryContextSwitchTo(tmpcontext); - } - else - have_data = false; - - index_endscan(index_scan); + /* scan in the opposite direction; all else is the same */ + have_data = get_actual_variable_endpoint(heapRel, + indexRel, + -indexscandir, + scankeys, + typLen, + typByVal, + slot, + oldcontext, + max); } /* Clean everything up */ @@ -5340,6 +5283,139 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata, } /* + * Get one endpoint datum (min or max depending on indexscandir) from the + * specified index. Return true if successful, false if index is empty. + * On success, endpoint value is stored to *endpointDatum (and copied into + * outercontext). + * + * scankeys is a 1-element scankey array set up to reject nulls. + * typLen/typByVal describe the datatype of the index's first column. + * tableslot is a slot suitable to hold table tuples, in case we need + * to probe the heap. + * (We could compute these values locally, but that would mean computing them + * twice when get_actual_variable_range needs both the min and the max.) + */ +static bool +get_actual_variable_endpoint(Relation heapRel, + Relation indexRel, + ScanDirection indexscandir, + ScanKey scankeys, + int16 typLen, + bool typByVal, + TupleTableSlot *tableslot, + MemoryContext outercontext, + Datum *endpointDatum) +{ + bool have_data = false; + SnapshotData SnapshotNonVacuumable; + IndexScanDesc index_scan; + Buffer vmbuffer = InvalidBuffer; + ItemPointer tid; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + MemoryContext oldcontext; + + /* + * We use the index-only-scan machinery for this. With mostly-static + * tables that's a win because it avoids a heap visit. It's also a win + * for dynamic data, but the reason is less obvious; read on for details. + * + * In principle, we should scan the index with our current active + * snapshot, which is the best approximation we've got to what the query + * will see when executed. But that won't be exact if a new snap is taken + * before running the query, and it can be very expensive if a lot of + * recently-dead or uncommitted rows exist at the beginning or end of the + * index (because we'll laboriously fetch each one and reject it). + * Instead, we use SnapshotNonVacuumable. That will accept recently-dead + * and uncommitted rows as well as normal visible rows. On the other + * hand, it will reject known-dead rows, and thus not give a bogus answer + * when the extreme value has been deleted (unless the deletion was quite + * recent); that case motivates not using SnapshotAny here. + * + * A crucial point here is that SnapshotNonVacuumable, with + * RecentGlobalXmin as horizon, yields the inverse of the condition that + * the indexscan will use to decide that index entries are killable (see + * heap_hot_search_buffer()). Therefore, if the snapshot rejects a tuple + * and we have to continue scanning past it, we know that the indexscan + * will mark that index entry killed. That means that the next + * get_actual_variable_endpoint() call will not have to visit that heap + * entry. In this way we avoid repetitive work when this function is used + * a lot during planning. + * + * But using SnapshotNonVacuumable creates a hazard of its own. In a + * recently-created index, some index entries may point at "broken" HOT + * chains in which not all the tuple versions contain data matching the + * index entry. The live tuple version(s) certainly do match the index, + * but SnapshotNonVacuumable can accept tuple versions that don't. Hence, + * if we took data from the selected heap tuple, we might get a bogus + * answer that's not close to the index extremal value, or could even be + * NULL. We avoid this hazard because we take the data from the index + * entry not the heap. + */ + InitNonVacuumableSnapshot(SnapshotNonVacuumable, RecentGlobalXmin); + + index_scan = index_beginscan(heapRel, indexRel, + &SnapshotNonVacuumable, + 1, 0); + /* Set it up for index-only scan */ + index_scan->xs_want_itup = true; + index_rescan(index_scan, scankeys, 1, NULL, 0); + + /* Fetch first/next tuple in specified direction */ + while ((tid = index_getnext_tid(index_scan, indexscandir)) != NULL) + { + if (!VM_ALL_VISIBLE(heapRel, + ItemPointerGetBlockNumber(tid), + &vmbuffer)) + { + /* Rats, we have to visit the heap to check visibility */ + if (!index_fetch_heap(index_scan, tableslot)) + continue; /* no visible tuple, try next index entry */ + + /* We don't actually need the heap tuple for anything */ + ExecClearTuple(tableslot); + + /* + * We don't care whether there's more than one visible tuple in + * the HOT chain; if any are visible, that's good enough. + */ + } + + /* + * We expect that btree will return data in IndexTuple not HeapTuple + * format. It's not lossy either. + */ + if (!index_scan->xs_itup) + elog(ERROR, "no data returned for index-only scan"); + if (index_scan->xs_recheck) + elog(ERROR, "unexpected recheck indication from btree"); + + /* OK to deform the index tuple */ + index_deform_tuple(index_scan->xs_itup, + index_scan->xs_itupdesc, + values, isnull); + + /* Shouldn't have got a null, but be careful */ + if (isnull[0]) + elog(ERROR, "found unexpected null value in index \"%s\"", + RelationGetRelationName(indexRel)); + + /* Copy the index column value out to caller's context */ + oldcontext = MemoryContextSwitchTo(outercontext); + *endpointDatum = datumCopy(values[0], typByVal, typLen); + MemoryContextSwitchTo(oldcontext); + have_data = true; + break; + } + + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); + index_endscan(index_scan); + + return have_data; +} + +/* * find_join_input_rel * Look up the input relation for a join. *
On Thu, Jul 11, 2019 at 2:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > As far as I can tell, no special finagling is needed: if we just use > the regular index-only-scan logic then this all works the way we want, > and it's actually better than before because we get to skip heap visits > altogether when dealing with unchanging data. Attached is a patch > against HEAD that seems to do all the right things. Interesting approach. I certainly prefer it to the alternative approach of framing the problem as a visibility concern. > I'm a little dissatisfied with the fact that I had to duplicate the > all-visible checking logic out of nodeIndexonlyscan.c. Maybe we should > think about refactoring to avoid multiple copies of that? But that's > probably a task for a separate patch, if it's practical at all. I suspect that you'd end up writing more code than you'd save if you generalized what we already have. Not clear that that's worth it. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Thu, Jul 11, 2019 at 2:20 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> As far as I can tell, no special finagling is needed: if we just use >> the regular index-only-scan logic then this all works the way we want, >> and it's actually better than before because we get to skip heap visits >> altogether when dealing with unchanging data. Attached is a patch >> against HEAD that seems to do all the right things. > Interesting approach. I certainly prefer it to the alternative > approach of framing the problem as a visibility concern. Yes, I certainly like this better than my previous attempt. I still feel like we're going to want to push (most of?) this logic below the tableam API at some point, because its implementation was and remains tied to heap+btree. But other table AMs are likely to support ordered scan accesses of some sort, and they're going to want to be able to adjust the planner's extremal-value estimates too. It's a job for later though. regards, tom lane
I wrote: > Peter Geoghegan <pg@bowt.ie> writes: >> Interesting approach. I certainly prefer it to the alternative >> approach of framing the problem as a visibility concern. > Yes, I certainly like this better than my previous attempt. Re-reading the patch, I realized that it wasn't using the ExecutorState infrastructure anymore, except for a short-lived memory context. So we can get an additional small savings by dropping the executor dependency and just making a temp context for ourselves. Pushed with that improvement. regards, tom lane
Thanks for fixing this! Best, Manuel On Fri, Jul 12, 2019 at 10:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > Peter Geoghegan <pg@bowt.ie> writes: > >> Interesting approach. I certainly prefer it to the alternative > >> approach of framing the problem as a visibility concern. > > > Yes, I certainly like this better than my previous attempt. > > Re-reading the patch, I realized that it wasn't using the ExecutorState > infrastructure anymore, except for a short-lived memory context. So we > can get an additional small savings by dropping the executor dependency > and just making a temp context for ourselves. > > Pushed with that improvement. > > regards, tom lane
Hi, On 2019-07-11 17:20:07 -0400, Tom Lane wrote: > I'm a little dissatisfied with the fact that I had to duplicate the > all-visible checking logic out of nodeIndexonlyscan.c. Maybe we should > think about refactoring to avoid multiple copies of that? But that's > probably a task for a separate patch, if it's practical at all. It's maybe worthwhile to note that having the visibilitymap logic in nodeIndexonlyscan.c. is one of the larger remaining limiting issues from the tableam work (called out in [1]). We're going to have to move the VM accesses into the AM at some point not that far away - but it's not entirely trivial to do so without creating additional VM buffer lookups / pins. Greetings, Andres Freund