Re: new heapcheck contrib module - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: new heapcheck contrib module |
Date | |
Msg-id | CAH2-Wzk7ZcJH5=cmewOy4Z9m5b7MoFAPF=OCeORknKvPrYGMRA@mail.gmail.com Whole thread Raw |
In response to | Re: new heapcheck contrib module (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Thu, May 14, 2020 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > I have seen that, I believe. I think it's more common to fail with > errors about not being able to palloc>1GB, not being able to look up > an xid or mxid, etc. but I am pretty sure I've seen multiple cases > involving seg faults, too. Unfortunately for my credibility, I can't > remember the details right now. I believe you, both in general, and also because what you're saying here is plausible, even if it doesn't fit my own experience. Corruption is by its very nature exceptional. At least, if that isn't true then something must be seriously wrong, so the idea that it will be different in some way each time seems like a good working assumption. Your exceptional cases are not necessarily the same as mine, especially where hardware problems are concerned. On the other hand, it's also possible for corruption that originates from very different sources to exhibit the same basic inconsistencies and symptoms. I've noticed that SLRU corruption is often a leading indicator of general storage problems. The inconsistencies between certain SLRU state and the heap happens to be far easier to notice in practice, particularly when VACUUM runs. But it's not fundamentally different to inconsistencies from pages within one single main fork of some heap relation. > > I personally don't recall seeing that. If it happened, the segfaults > > themselves probably wouldn't be the main concern. > > I don't really agree. Hypothetically speaking, suppose you corrupt > your only copy of a critical table in such a way that every time you > select from it, the system seg faults. A user in this situation might > ask questions like: I agree that that could be a problem. But that's not what I've seen happen in production systems myself. Maybe there is some low hanging fruit here. Perhaps we can make the real PageGetItemId() a little closer to PageGetItemIdCareful() without noticeable overhead, as I suggested already. Are there any real generalizations that we can make about why backends segfault with corrupt data? Maybe there is. That seems important. > Slightly off-topic here, but I think our error reporting in this area > is pretty lame. I've learned over the years that when a customer > reports that they get a complaint about a too-large memory allocation > every time they access a table, they've probably got a corrupted > varlena header. I certainlt learned the same lesson in the same way. > However, that's extremely non-obvious to a typical > user. We should try to report errors indicative of corruption in a way > that gives the user some clue that corruption has happened. Peter made > a stab at improving things there by adding > errcode(ERRCODE_DATA_CORRUPTED) in a bunch of places, but a lot of > users will never see the error code, only the message, and a lot of > corruption produces still produces errors that weren't changed by that > commit. The theory is that "can't happen" errors having an errcode that should be considered similar to or equivalent to ERRCODE_DATA_CORRUPTED. I doubt that it works out that way in practice, though. -- Peter Geoghegan
pgsql-hackers by date: