Re: making relfilenodes 56 bits - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: making relfilenodes 56 bits
Date
Msg-id CAFiTN-t9j=z423_UcvrdwD0Hry5GBuSXJEU72sftRec9dM-1ZA@mail.gmail.com
Whole thread Raw
In response to Re: making relfilenodes 56 bits  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: making relfilenodes 56 bits
List pgsql-hackers
On Thu, Jul 7, 2022 at 10:56 PM Robert Haas <robertmhaas@gmail.com> wrote:

I have accepted all the suggestion, find my inline replies where we
need more thoughts.

> buf_init.c:119:4: error: implicit truncation from 'int' to bit-field
> changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion]
>                         CLEAR_BUFFERTAG(buf->tag);
>                         ^~~~~~~~~~~~~~~~~~~~~~~~~
> ../../../../src/include/storage/buf_internals.h:122:14: note: expanded
> from macro 'CLEAR_BUFFERTAG'
>         (a).forkNum = InvalidForkNumber, \
>                     ^ ~~~~~~~~~~~~~~~~~
> 1 error generated.

Hmm so now we are using an unsigned int field so IMHO we can make
InvalidForkNumber to 255 instead of -1?


> > I think it is in line with oidCount, what do you think?
>
> Oh it definitely is, and maybe it's OK the way you have it. But the
> OID stuff has wraparound to worry about, and this doesn't; and this
> has the SetNextRelFileNumber and that doesn't; so it is not
> necessarily the case that the design which is best for that case is
> also best for this case.

Yeah right, but now with the latest changes for piggybacking the
XlogFlush I think it is cleaner to have the count.

> I believe that the persistence model for SetNextRelFileNumber needs
> more thought. Right now I believe it's relying on the fact that, after
> we try to restore the dump, we'll try to perform a clean shutdown of
> the server before doing anything important, and that will persist the
> final value, whatever it ends up being. However, there's no comment
> explaining that theory of operation, and it seems pretty fragile
> anyway. What if things don't go as planned? Suppose the power goes out
> halfway through restoring the dump, and the user for some reason then
> gives up on running pg_upgrade and just tries to do random things with
> that server? Then I think there will be trouble, because nothing has
> updated the nextrelfilenumber value and yet there are potentially new
> files on disk. Maybe that's a stretch since I think other things might
> also break if you do that, but I'm also not sure that's the only
> scenario to worry about, especially if you factor in the possibility
> of future code changes, like changes to the timing of when we shut
> down and restart the server during pg_upgrade, or other uses of
> binary-upgrade mode, or whatever. I don't know. Perhaps it's not
> actually broken but I'm inclined to think it should be logging its
> changes.

But we are already logging this if we are setting the relfilenumber
which is out of the already logged range, am I missing something?
Check this change.
+    relnumbercount = relnumber - ShmemVariableCache->nextRelFileNumber;
+    if (ShmemVariableCache->relnumbercount <= relnumbercount)
+    {
+        LogNextRelFileNumber(relnumber + VAR_RELNUMBER_PREFETCH, NULL);
+        ShmemVariableCache->relnumbercount = VAR_RELNUMBER_PREFETCH;
+    }
+    else
+        ShmemVariableCache->relnumbercount -= relnumbercount;

> A related thought is that I don't think this patch has as many
> cross-checks as it could have. For instance, suppose that when we
> replay a WAL record that creates relation storage, we cross-check that
> the value is less than the counter. I think you have a check in there
> someplace that will error out if there is an actual collision --
> although I can't find it at the moment, and possibly we want to add
> some comments there even if it's in existing code -- but this kind of
> thing would detect bugs that could lead to collisions even if no
> collision actually occurs, e.g. because a duplicate relfilenumber is
> used but in a different database or tablespace. It might be worth
> spending some time thinking about other possible cross-checks too.
> We're trying to create a system where the relfilenumber counter is
> always ahead of all the relfilenumbers used on disk, but the coupling
> between the relfilenumber-advancement machinery and the
> make-files-on-disk machinery is pretty loose, and so there is a risk
> that bugs could escape detection. Whatever we can do to increase the
> probability of noticing when things have gone wrong, and/or to notice
> it quicker, will be good.

I had those changes in v7-0003, now I have merged with 0002.  This has
assert check while replaying the WAL for smgr create and smgr
truncate, and while during normal path when allocating the new
relfilenumber we are asserting for any existing file.

> One thing that isn't great about this whole scheme is that it can lead
> to lock pile-ups. Once somebody is waiting for an
> XLOG_NEXT_RELFILENUMBER record to reach the disk, any other backend
> that tries to get a new relfilenumber is going to block waiting for
> RelFileNumberGenLock. I wonder whether this effect is observable in
> practice: suppose we just create relations in a tight loop from inside
> a stored procedure, and do that simultaneously in multiple backends?
> What does the wait event distribution look like? Can we observe a lot
> of RelFileNumberGenLock events or not really? I guess if we reduce
> VAR_RELNUMBER_PREFETCH enough we can probably create a problem, but
> how small a value is needed?

I have done some performance tests, with very small values I can see a
lot of wait events for RelFileNumberGen but with bigger numbers like
256 or 512 it is not really bad.  See results at the end of the
mail[1]

> One thing we could think about doing here is try to stagger the xlog
> and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
> relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
> where we are now, and remember the LSN. When we've used up our entire
> previous allocation, XLogFlush() that record before allowing the
> additional values to be used. The bookkeeping would be a bit more
> complicated than currently, but I don't think it would be too bad. I'm
> not sure how much it would actually help, though, or whether we need
> it. If new relfilenumbers are being used up really quickly, then maybe
> the record won't get flushed into the background before we run out of
> available numbers anyway, and if they aren't, then maybe it doesn't
> matter. On the other hand, even one transaction commit between when
> the record is logged and when we run out of the previous allocation is
> enough to force a flush, at least with synchronous_commit=on, so maybe
> the chances of being able to piggyback on an existing flush are not so
> bad after all. I'm not sure.

I have done these changes during GetNewRelFileNumber() this required
to track the last logged record pointer as well but I think this looks
clean.  With this I can see some reduction in RelFileNumberGen wait
event[1]

> In theory I suppose there's another way we could solve this problem:
> keep using the same relfilenumber, and if the scenario described here
> occurs, just reuse the old file. The reason why we can't do that today
> is because we could be running with wal_level=minimal and replace a
> relation with one whose contents aren't logged. If WAL replay then
> replays the drop, we're in trouble. But if the only time we reuse a
> relfilenumber for new relation storage is when relations are moved
> around, then I think that scenario can't happen. However, I think
> assigning a new relfilenumber is probably better, because it gets us
> closer to a world in which relfilenumbers are never reused at all. It
> doesn't get us all the way there because of createdb() and movedb(),
> but it gets us closer and I prefer that.

I agree with you.

> I notice that the patch makes no changes to relmapper.c, and I think
> that's a problem. Notice in particular:
>
> #define MAX_MAPPINGS            62  /* 62 * 8 + 16 = 512 */
>
> I believe that making RelFileNumber into a 64-bit value will cause the
> 8 in the calculation above to change to 16, defeating the intention
> that the size of the file ought to be the smallest imaginable size of
> a disk sector. It does seem like it would have been smart to include a
> StaticAssertStmt in this file someplace that checks that the data
> structure has the expected size, and now might be a good time, perhaps
> in a separate patch, to add one. If we do nothing fancy here, the
> maximum number of mappings will have to be reduced from 62 to 31,
> which is a problem because global/pg_filenode.map currently has 48
> entries. We could try to arrange to squeeze padding out of the
> RelMapping struct, which would let us use just 12 bytes per mapping,
> which would increase the limit to 41, but that's still less than we're
> using already, never mind leaving room for future growth.
>
> I don't know what to do about this exactly. I believe it's been
> previously suggested that the actual minimum sector size on reasonably
> modern hardware is never as small as 512 bytes, so maybe the file size
> can just be increased to 1kB or something. If that idea is judged
> unsafe, I can think of two other possible approaches offhand. One is
> that we could move away from the idea of storing the OIDs in the file
> along with the RelFileNodes, and instead store the offset for a given
> RelFileNode at a fixed offset in the file. That would require either
> hard-wiring offset tables into the code someplace, or generating them
> as part of the build process, with separate tables for shared and
> database-local relation map files. The other is that we could have
> multiple 512-byte sectors and try to arrange for each relation to be
> in the same sector with the indexes of that relation, since the
> comments in relmapper.c say this:
>
>  * aborts.  An important factor here is that the indexes and toast table of
>  * a mapped catalog must also be mapped, so that the rewrites/relocations of
>  * all these files commit in a single map file update rather than being tied
>  * to transaction commit.
>
> This suggests that atomicity is required across a table and its
> indexes, but that it's needed across arbitrary sets of entries in the
> file.
>
> Whatever we do, we shouldn't forget to bump RELMAPPER_FILEMAGIC.

I am not sure what is the best solution here, but I agree that most of
the modern hardware will have bigger sector size than 512 so we can
just change file size of 1024.

The current value of RELMAPPER_FILEMAGIC is 0x592717, I am not sure
how this version ID is decide is this some random magic number or
based on some logic?

>
> +       uint32          relNumber_low;  /* relfilenumber 32 lower bits */
> +       uint32          relNumber_hi:24;        /* relfilenumber 24 high bits */
> +       uint32          forkNum:8;              /* fork number */
>
> I still think we'd be better off with something like uint32
> relForkDetails[2]. The bitfields would be nice if they meant that we
> didn't have to do bit-shifting and masking operations ourselves, but
> with the field split this way, we do anyway. So what's the point in
> mixing the approaches?

Actually with this we were able to access the forkNum directly, but I
also think changing as relForkDetails[2] is cleaner so done that.  And
as part of the related changes in 0001 I have removed the direct
access to the forkNum.

[1] Wait event details

Procedure:
CREATE OR REPLACE FUNCTION create_table(count int) RETURNS void AS $$
DECLARE
  relname varchar;
  pid int;
  i   int;
BEGIN
  SELECT pg_backend_pid() INTO pid;
  relname := 'test_' || pid;
  FOR i IN 1..count LOOP
    EXECUTE format('CREATE TABLE %s(a int)', relname);

    EXECUTE format('DROP TABLE %s', relname);
  END LOOP;
END;

Target test: Executed "select create_table(100);" query from pgbench
with 32 concurrent backends.

VAR_RELNUMBER_PREFETCH = 8

    905  LWLock          | LockManager
    346  LWLock          | RelFileNumberGen
    192
    190  Activity        | WalWriterMain

VAR_RELNUMBER_PREFETCH=128
   1187  LWLock          | LockManager
    247  LWLock          | RelFileNumberGen
    139  Activity        | CheckpointerMain

VAR_RELNUMBER_PREFETCH=256

   1029  LWLock          | LockManager
    158  LWLock          | BufferContent
    134  Activity        | CheckpointerMain
    134  Activity        | AutoVacuumMain
    133  Activity        | BgWriterMain
    132  Activity        | WalWriterMain
    130  Activity        | LogicalLauncherMain
    123  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH=512

  1174  LWLock          | LockManager
    136  Activity        | CheckpointerMain
    136  Activity        | BgWriterMain
    136  Activity        | AutoVacuumMain
    134  Activity        | WalWriterMain
    134  Activity        | LogicalLauncherMain
     99  LWLock          | BufferContent
     35  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH=2048
   1070  LWLock          | LockManager
    160  LWLock          | BufferContent
    156  Activity        | CheckpointerMain
    156
    155  Activity        | BgWriterMain
    154  Activity        | AutoVacuumMain
    153  Activity        | WalWriterMain
    149  Activity        | LogicalLauncherMain
     31  LWLock          | RelFileNumberGen
     28  Timeout         | VacuumDelay


VAR_RELNUMBER_PREFETCH=4096
Note, no wait event for RelFileNumberGen at value 4096

New patch with piggybacking XLogFlush()

VAR_RELNUMBER_PREFETCH = 8

  1105  LWLock          | LockManager
    143  LWLock          | BufferContent
    140  Activity        | CheckpointerMain
    140  Activity        | BgWriterMain
    139  Activity        | WalWriterMain
    138  Activity        | AutoVacuumMain
    137  Activity        | LogicalLauncherMain
    115  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH = 256
   1130  LWLock          | LockManager
    141  Activity        | CheckpointerMain
    139  Activity        | BgWriterMain
    137  Activity        | AutoVacuumMain
    136  Activity        | LogicalLauncherMain
    135  Activity        | WalWriterMain
     69  LWLock          | BufferContent
     31  LWLock          | RelFileNumberGen

VAR_RELNUMBER_PREFETCH = 1024
Note: no wait event for RelFileNumberGen at value 1024


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Handle infinite recursion in logical replication setup
Next
From: Nikita Malakhov
Date:
Subject: Re: Pluggable toaster