Re: WIP: Covering + unique indexes. - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: WIP: Covering + unique indexes. |
Date | |
Msg-id | CAH2-WzmDjdu+=LTs8fuKn5tSWfyzzF0TMPOLa=0qnWf2=nDdhw@mail.gmail.com Whole thread Raw |
In response to | Re: WIP: Covering + unique indexes. (Alexander Korotkov <a.korotkov@postgrespro.ru>) |
Responses |
Re: WIP: Covering + unique indexes.
|
List | pgsql-hackers |
On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote: >> So? GIN doesn't have the same legacy at all. The GIN posting lists >> *don't* have regular heap TID pointers at all. They started out >> without them, and still don't have them. > > > Yes, GIN never stored heap TID pointers in t_tid of index tuple. But GIN > assumes that heap TID pointer has at most 11 significant bits during > posting list encoding. I think that we should avoid assuming things, unless the cost of representing them is too high, which I don't think applies here. The more defensive general purpose code can be, the better. I will admit to being paranoid here. But experience suggests that paranoia is a good thing, if it isn't too expensive. Look at the thread on XFS + fsync() for an example of things being wrong for a very long time without anyone realizing, and despite the best efforts of many smart people. As far as anyone can tell, PostgreSQL on Linux + XFS is kinda, sorta broken, and has been forever. XFS was mature before ext4 was, and is a popular choice, and yet this is the first we're hearing about it being kind of broken. After many years. Look at this check that made it into my amcheck patch, that was committed yesterday: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=contrib/amcheck/verify_nbtree.c;h=a15fe21933b9a5b8baefedaa8f38e517d6c91877;hb=7f563c09f8901f6acd72cb8fba7b1bd3cf3aca8e#l745 As it says, nbtree is surprisingly tolerant of corrupt lp_len fields. You may find it an interesting exercise to use pg_hexedit to corrupt many lp_len fields in an index page. What's really interesting about this is that it doesn't appear to break anything at all! We don't get the length from there in most cases, so reads won't break at all. I see that we use ItemIdGetLength() in a couple of rare cases (though even those could be avoided) during a page split. You'd be lucky to notice a problem if lp_len fields were regularly corrupt. When you notice, it will probably have already caused big problems. On a similar note, I've noticed that many of my experimental B-Tree patches (that I never find time to finish) tend to almost work quite early on, sometimes without my really understanding why. The whole L&Y approach of recovering from problems that were detected (detecting concurrent page splits, and moving right) makes the code *very* forgiving. I hope that I don't sound trite, but everyone should try to be modest about what they *don't* know when writing complex system software with concurrency. It is not a platitude, even though it probably seems that way. A tiny mistake can have big consequences, so it's very important that we have a way to easily detect them after the fact. > I don't think we should use assertions, because they are typically disabled > on > production PostgreSQL builds. But we can have some explicit check in some > common path. In the attached patch I've such check to _bt_compare(). > Probably, > together with amcheck, that would be sufficient. Good idea -- a "can't happen" check in _bt_compare seems better, which I see here: > diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c > index 51dca64e13..fcf9832147 100644 > --- a/src/backend/access/nbtree/nbtsearch.c > +++ b/src/backend/access/nbtree/nbtsearch.c > @@ -443,6 +443,17 @@ _bt_compare(Relation rel, > if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) > return 1; > > + /* > + * Check tuple has correct number of attributes. > + */ > + if (!_bt_check_natts(rel, page, offnum)) > + { > + ereport(ERROR, > + (errcode(ERRCODE_INTERNAL_ERROR), > + errmsg("tuple has wrong number of attributes in index \"%s\"", > + RelationGetRelationName(rel)))); > + } > + > itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); It seems like it might be a good idea to make this accept an IndexTuple, though, to possibly save some work. Also, perhaps this should be an unlikely() condition, if only because it makes the intent clearer (might actually matter in a tight loop like this too, though). Do you store an attribute number in the "minus infinity" item (the leftmost one of internal pages)? I guess that that should be zero, because it's totally truncated. > OK, thank for the explanation. I agree that check of offset is redundant > here. Cool. >> The fact is that that information can go out of date almost >> immediately, whereas high keys usually last forever. The only reason >> that there is a heap TID in the high key is because we'd have to add >> special code to remove it; not because it has any real value. I find >> it very hard to imagine it being used in a forensic situation. If you >> actually wanted to do this, the key itself is probably enough -- you >> probably wouldn't need the TID. > > > I don't know, When I wrote my own implementation of B-tree and debug > it, I found saving hikeys "as is" to be very valuable for debugging. I would like to see your implementation at some point. That sounds interesting. > However, B-trees in PostgreSQL are quite mature, and probably > don't need so much debug information. Today, the highkey at the leaf level is an exact copy of the right sibling's first item immediately after the split. The absence of a usable heap TID offset (due to using it for number of attributes in high keys) is unlikely to make it harder to locate that right sibling's first item (to get a full heap TID), which could have moved a lot further right after the split, or even have been removed entirely. It could now be ambiguous where it wouldn't have been before in the event of duplicates, but it's unlikely. And when it does happen, it's unlikely to matter. We can still include the heap block number, I suppose. I think of the highkey as only having one simple job -- separating the keyspace between siblings. We actually have a very neat choke point to check that it does that one job -- when a high key is generated for a page split at the leaf level. If we were doing generic suffix truncation, we'd add a test that made sure that the high key was strictly greater than the last item on the left, and strictly less than the first item on the right. As I said yesterday, I don't like how we allow a highkey to be equal to both sides of the split, which goes against L&Y, and I think that we would at least be strict about < and > for suffix truncation. The highkey's actual value can be invented, provided it does this one simple job, which needs to be assessed only once at our "neat choke point". Everything else falls into place afterwards, since that's where teh downlink actually comes from. You can check it during a leaf page split while debugging (that's the neat choke point). That's why the high key doesn't seem very interesting from a debuggability perspective. >> Nobody asked you to write a suffix truncation patch. That has >> complexity above and beyond what the covering index patch needs. I >> just expect it to be compatible with an eventual suffix truncation >> patch, which you've now shown is quite possible. It is clearly a >> complimentary technique. > > > OK, but change of on-disk tuple format also changes what people > see in pageinspect. Right now, they see "1" as offset for tuples in intenal > page and hikeys. After patch, they would see some large values > (assuming we set some of hi bits) in offset. I'm not sure it's OK. > We probably should change display of index tuples in pageinspect. This reminds me of a discussion I had with Robert Haas about pageinspect + t_infomask bits. Robert thought that we should show the composite bits as single constants, where we do that (with things like HEAP_XMIN_FROZEN). I disagreed, saying I think that we should just show "the bits that are on the page", while also documenting that this situation exists in pageinspect directly. I think something similar here. I think it's okay to just show offset, provided it is documented. We have a number of odd things within nbtree that I actually saw to it were documented, such as the "minus infinity" item on internal pages, which looks odd and out of places. I remember Tatsuo Ishii asked about it before this happened. It seems helpful to show what's really there, and offer guidance on how to interpret it. I actually thought carefully about many things like this for pg_hexedit, which tries to be very consistent and logical, uses color to suggest meaning, and so on. Anyway, that's what I think about it, though I wouldn't really care if I lost that particular argument and we did something special with internal page offset in pageinspect. It seems like a matter of opinion, or aesthetics. > I'm sorry, I do not understand. New version of amcheck determines > the expected number of attributes and compares that to the numer of > attributes stored in the offset number. But I can get *expected* number of > attributes even wihtout storing them also in the offset number... Maybe I was confused. > I'd like to note that I really appreciate your attention to this patch > as well as other patches. Thanks. I would like to thank Anastasia and you for your patience and perseverance, despite what I see as mistakes in how this project was manged. I really want for it to be possible for there to be more patches in the nbtree code, because they're really needed. That was a big part of my motivation for writing amcheck, in fact. It's tedious to link this patch to a bigger picture about what we need to do with nbtree in the next 5 years, but I think that that's what it will take to get this patch in. That's my opinion. -- Peter Geoghegan
pgsql-hackers by date: