Re: buildfarm: could not read block 3 in file "base/16384/2662":read only 0 of 8192 bytes - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: buildfarm: could not read block 3 in file "base/16384/2662":read only 0 of 8192 bytes |
Date | |
Msg-id | 20180829201648.cqxw6ft547csnhrc@alap3.anarazel.de Whole thread Raw |
In response to | Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: buildfarm: could not read block 3 in file "base/16384/2662": read only 0 of 8192 bytes
|
List | pgsql-hackers |
Hi, On 2018-08-29 14:00:12 -0400, Tom Lane wrote: > A couple thoughts after reading and reflecting for awhile: Thanks. This definitely is too complicated for a single brain :( > 1. I don't much like the pending_rebuilds list, mainly because of this > consideration: what happens if we hit an OOM error trying to add an entry > to that list? As you've drafted the patch, we don't even mark the > relevant relcache entry rd_invalid before that fails, so that's surely > bad. Now, I'm not sure how bulletproof relcache inval is in general > with respect to OOM failures, but let's not add more hazards. Obviously you're right and we first need to mark the entry as invalid - and I think that should also mostly [1] address the OOM issue. If we error out there's no need to eagerly rebuild the entry during invalidation - subsequent RelationIdGetRelation() calls will trigger the necessary rebuild. But it sounds like your concern might be larger? I don't quite see how we could get around without such a list? I mean I think it'd better if it were an array, for efficiency, but that seems like a fairly minor point? [1] I'm kinda doubful that's entirely bulletproof in case an OOM, or similar, error is caught with a savepoint, and then execution continues outside. That in turn might access the relcache entry without a RelationIdGetRelation call. But that doesn't seem new. > 2. I think we may need to address the same order-of-operations hazards > as RelationCacheInvalidate() worries about. Alternatively, maybe we > could simplify that function by making it use the same > delayed-revalidation logic as we're going to develop for this. Hm, just to make sure I understand correctly: You're thinking about having to rebuild various nailed entries in a specific order [2]? I'm not quite sure I really see a problem here - in contrast to the current RelationCacheInvalidate() logic, what I'm proposing marks the entries as invalid in the first phase, so any later access guarantees will rebuild the entry as necessary. [2] Hm, I'm right now a bit confused as to why "Other nailed relations go to the front of rebuildList" is a good idea. That means they might be built while ClassOidIndexId isn't yet rebuilt? Practically I don't think it matters, because the lookup for RelationRelationId will rebuild it, but I don't quite understand what that logic is precisely getting at. > 3. I don't at all like the ProcessPendingRelcacheRebuilds call you added > to ProcessInvalidationMessages. That's effectively assuming that the > "func" *must* be LocalExecuteInvalidationMessage and not anything else; > likewise, the lack of such a call inside ProcessInvalidationMessagesMulti > presumes that that one is never called to actually execute invalidations. > (While those are true statements, it's a horrible violation of modularity > for these two functions to know it.) Probably better to put this into the > callers, which will know what the actual semantics are. Yea, I mostly hacked this together quickly to have a proposal on the table, I'm not happy with this as it stands. > 4. The call added to the middle of ReceiveSharedInvalidMessages doesn't > seem all that safe either; the problem is its relationship to the > "catchup" processing. We are caught up at the moment we exit the loop, > but are we sure we still would be after doing assorted work for relcache > rebuild? Swapping the order of the two steps might help, but then we > have to consider what happens if we error out from SICleanupQueue. I'm not sure I understand the problem here. If there's new invalidation processing inside the rebuilds, that'd mean there was another ReceiveSharedInvalidMessages, which then would have triggered the "new" ReceiveSharedInvalidMessages, to then process the new pending why I made ProcessPendingRelcacheRebuilds() unlink the queue it processes first. I think locking should prevent issues with entries that are currently accessed during rebuild from being re-invalidated. Did I misunderstand? As the comment in ProcessPendingRelcacheRebuilds() notes, I'm concerned what happens if we error out after unlinking the current pending items. But I think, if we handle the memory leak with a PG_CATCH, that should be ok, because after the error new accesses should trigger the rebuilds. > (In general, the hard part of all this stuff is being sure that sane > things happen if you error out part way through ...) Yea :/ Greetings, Andres Freund
pgsql-hackers by date: