I wrote:
> Relation truncation throws away the page image in memory without ever
> writing it to disk. Then, if the subsequent file truncate step fails,
> we have a problem, because anyone who goes looking for that page will
> fetch it afresh from disk and see the tuples as live.
> There are WAL entries recording the row deletions, but that doesn't
> help unless we crash and replay the WAL.
> It's hard to see a way around this that isn't fairly catastrophic for
> performance :-(.
Just to throw out a possibly-crazy idea: maybe we could fix this by
PANIC'ing if truncation fails, so that we replay the row deletions from
WAL. Obviously this would be intolerable if the case were frequent,
but we've had only two such complaints in the last nine years, so maybe
it's tolerable. It seems more attractive than taking a large performance
hit on truncation speed in normal cases, anyway.
A gotcha to be concerned about is what happens if we replay from WAL,
come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation
failure again, which is surely not unlikely. PANIC'ing again will not
do. I think we could probably handle that by having the replay code
path zero out all the pages it was unable to delete; as long as that
succeeds, we can call it good and move on.
Or maybe just do that in the mainline case too? That is, if ftruncate
fails, handle it by zeroing the undeletable pages and pressing on?
> But in any case it's wrapped up in order-of-operations
> issues. I've long since forgotten the details, but I seem to have thought
> that there were additional order-of-operations hazards besides this one.
It'd be a good idea to redo that investigation before concluding this
issue is fixed, too. I was not thinking at the time that it'd be years
before anybody did anything, or I'd have made more notes.
regards, tom lane