Re: making relfilenodes 56 bits - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: making relfilenodes 56 bits |
Date | |
Msg-id | 20220702192951.irfhpydyisgkphtj@alap3.anarazel.de Whole thread Raw |
In response to | Re: making relfilenodes 56 bits (Dilip Kumar <dilipbalaut@gmail.com>) |
Responses |
Re: making relfilenodes 56 bits
Re: making relfilenodes 56 bits |
List | pgsql-hackers |
Hi, On 2022-07-02 14:23:08 +0530, Dilip Kumar wrote: > > > + if (ShmemVariableCache->relnumbercount == 0) > > > + { > > > + XLogPutNextRelFileNumber(ShmemVariableCache->nextRelFileNumber + > > > + VAR_RFN_PREFETCH); > > > > I know this is just copied, but I find "XLogPut" as a prefix pretty unhelpful. > > Maybe we can change to LogNextRelFileNumber()? Much better. Hm. Now that I think about it, isn't the XlogFlush() in XLogPutNextRelFileNumber() problematic performance wise? Yes, we'll spread the cost across a number of GetNewRelFileNumber() calls, but still, an additional f[data]sync for every 64 relfilenodes assigned isn't cheap - today there's zero fsyncs when creating a sequence or table inside a transaction (there are some for indexes, but there's patches to fix that). Not that I really see an obvious alternative. I guess we could try to invent a flush-log-before-write type logic for relfilenodes somehow? So that the first block actually written to a file needs to ensure the WAL record that created the relation is flushed. But getting that to work reliably seems nontrivial. One thing that would be good is to add an assertion to a few places ensuring that relfilenodes aren't above ->nextRelFileNumber, most importantly somewhere in the recovery path. Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH is 8192, but you chose 64 for VAR_RFN_PREFETCH? I'd spell out RFN in VAR_RFN_PREFETCH btw, it took me a bit to expand RFN to relfilenode. > > What's the story behind moving relfilenode to the front? Alignment > > consideration? Seems odd to move the relfilenode before the oid. If there's an > > alignment issue, can't you just swap it with reltablespace or such to resolve > > it? > > Because of a test case added in this commit > (79b716cfb7a1be2a61ebb4418099db1258f35e30). Even I did not like to > move relfilenode before oid, but under this commit it is expected this > to aligned as well as before any NameData check this comments > > === > +-- > +-- Keep such columns before the first NameData column of the > +-- catalog, since packagers can override NAMEDATALEN to an odd number. > +-- > === This is embarassing. Trying to keep alignment match between C and catalog alignment on AIX, without actually making the system understand the alignment rules, is a remarkably shortsighted approach. I started a separate thread about it, since it's not really relevant to this thread: https://postgr.es/m/20220702183354.a6uhja35wta7agew%40alap3.anarazel.de Maybe we could at least make the field order to be something like oid, relam, relfilenode, relname that should be ok alignment wise, keep oid first, and seems to make sense from an "importance" POV? Can't really interpret later fields without knowing relam etc. > > > Now as part of the previous patch set we have made relfilenode > > > 56 bit wider and removed the risk of wraparound so now we don't > > > need to wait till the next checkpoint for removing the unused > > > relation file and we can clean them up on commit. > > > > Hm. Wasn't there also some issue around crash-restarts benefiting from having > > those files around until later? I think what I'm remembering is what is > > referenced in this comment: > > I think due to wraparound if relfilenode gets reused by another > relation in the same checkpoint then there was an issue in crash > recovery with wal level minimal. But the root of the issue is a > wraparound, right? I'm not convinced the tombstones were required solely in the oid wraparound case before, despite the comment saying so, with wal_level=minimal. I gotta do some non-work stuff for a bit, so I need to stop pondering this now :) I think it might be a good idea to have a few weeks in which we do *not* remove the tombstones, but have assertion checks against such files existing when we don't expect them to. I.e. commit 1-3, add the asserts, then commit 4 a bit later. > > I'm doubtful it's a good idea to start dropping at the first segment. I'm > > fairly certain that there's smgrexists() checks in some places, and they'll > > now stop working, even if there are later segments that remained, e.g. because > > of an error in the middle of removing later segments. > > Okay, so you mean to say that we can first drop the remaining segment > and at last we drop the segment 0 right? I'd use the approach Robert suggested and delete from the end, going down. Greetings, Andres Freund
pgsql-hackers by date: