Re: making relfilenodes 56 bits - Mailing list pgsql-hackers

From Robert Haas
Subject Re: making relfilenodes 56 bits
Date
Msg-id CA+TgmoZq5=LWDK7kHaUbmWXxcaTuw_QwafgG9dr-BaPym_U8WQ@mail.gmail.com
Whole thread Raw
In response to Re: making relfilenodes 56 bits  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: making relfilenodes 56 bits
Re: making relfilenodes 56 bits
List pgsql-hackers
Trying to compile with 0001 and 0002 applied and -Wall -Werror in use, I get:

buf_init.c:119:4: error: implicit truncation from 'int' to bit-field
changes value from -1 to 255 [-Werror,-Wbitfield-constant-conversion]
                        CLEAR_BUFFERTAG(buf->tag);
                        ^~~~~~~~~~~~~~~~~~~~~~~~~
../../../../src/include/storage/buf_internals.h:122:14: note: expanded
from macro 'CLEAR_BUFFERTAG'
        (a).forkNum = InvalidForkNumber, \
                    ^ ~~~~~~~~~~~~~~~~~
1 error generated.

More review comments:

In pg_buffercache_pages_internal(), I suggest that we add an error
check. If fctx->record[i].relfilenumber is greater than the largest
value that can be represented as an OID, then let's do something like:

ERROR: relfilenode is too large to be represented as an OID
HINT: Upgrade the extension using ALTER EXTENSION pg_buffercache UPDATE

That way, instead of confusing people by giving them an incorrect
answer, we'll push them toward a step that they may have overlooked.

In src/backend/access/transam/README, I think the sentence "So
cleaning up isn't really necessary." isn't too helpful. I suggest
replacing it with "Thus, on-disk collisions aren't possible."

> I think it is in line with oidCount, what do you think?

Oh it definitely is, and maybe it's OK the way you have it. But the
OID stuff has wraparound to worry about, and this doesn't; and this
has the SetNextRelFileNumber and that doesn't; so it is not
necessarily the case that the design which is best for that case is
also best for this case.

I believe that the persistence model for SetNextRelFileNumber needs
more thought. Right now I believe it's relying on the fact that, after
we try to restore the dump, we'll try to perform a clean shutdown of
the server before doing anything important, and that will persist the
final value, whatever it ends up being. However, there's no comment
explaining that theory of operation, and it seems pretty fragile
anyway. What if things don't go as planned? Suppose the power goes out
halfway through restoring the dump, and the user for some reason then
gives up on running pg_upgrade and just tries to do random things with
that server? Then I think there will be trouble, because nothing has
updated the nextrelfilenumber value and yet there are potentially new
files on disk. Maybe that's a stretch since I think other things might
also break if you do that, but I'm also not sure that's the only
scenario to worry about, especially if you factor in the possibility
of future code changes, like changes to the timing of when we shut
down and restart the server during pg_upgrade, or other uses of
binary-upgrade mode, or whatever. I don't know. Perhaps it's not
actually broken but I'm inclined to think it should be logging its
changes.

A related thought is that I don't think this patch has as many
cross-checks as it could have. For instance, suppose that when we
replay a WAL record that creates relation storage, we cross-check that
the value is less than the counter. I think you have a check in there
someplace that will error out if there is an actual collision --
although I can't find it at the moment, and possibly we want to add
some comments there even if it's in existing code -- but this kind of
thing would detect bugs that could lead to collisions even if no
collision actually occurs, e.g. because a duplicate relfilenumber is
used but in a different database or tablespace. It might be worth
spending some time thinking about other possible cross-checks too.
We're trying to create a system where the relfilenumber counter is
always ahead of all the relfilenumbers used on disk, but the coupling
between the relfilenumber-advancement machinery and the
make-files-on-disk machinery is pretty loose, and so there is a risk
that bugs could escape detection. Whatever we can do to increase the
probability of noticing when things have gone wrong, and/or to notice
it quicker, will be good.

+       if (!IsBinaryUpgrade)
+               elog(ERROR, "the RelFileNumber can be set only during
binary upgrade");

I think you should remove the word "the". Primary error messages are
written telegram-style and "the" is usually omitted, especially at the
beginning of the message.

+        * This should not impact the performance, since we are not WAL logging
+        * it for every allocation, but only after allocating 512 RelFileNumber.

I think this claim is overly bold, and it would be better if the
current value of the constant weren't encoded in the comment. I'm not
sure we really need this part of the comment at all, but if we do,
maybe it should be reworded to something like: This is potentially a
somewhat expensive operation, but fortunately we only need to do it
for every VAR_RELNUMBER_PREFETCH new relfilenodes. Or maybe it's
better to put this explanation in GetNewRelFileNumber instead, e.g.
"If we run out of logged RelFileNumbers, then we must log more, and
also wait for the xlog record to be flushed to disk. This is somewhat
expensive, but hopefully VAR_RELNUMBER_PREFETCH is large enough that
this doesn't slow things down too much."

One thing that isn't great about this whole scheme is that it can lead
to lock pile-ups. Once somebody is waiting for an
XLOG_NEXT_RELFILENUMBER record to reach the disk, any other backend
that tries to get a new relfilenumber is going to block waiting for
RelFileNumberGenLock. I wonder whether this effect is observable in
practice: suppose we just create relations in a tight loop from inside
a stored procedure, and do that simultaneously in multiple backends?
What does the wait event distribution look like? Can we observe a lot
of RelFileNumberGenLock events or not really? I guess if we reduce
VAR_RELNUMBER_PREFETCH enough we can probably create a problem, but
how small a value is needed?

One thing we could think about doing here is try to stagger the xlog
and the flush. When we've used VAR_RELNUMBER_PREFETCH/2
relfilenumbers, log a record reserving VAR_RELNUMBER_PREFETCH from
where we are now, and remember the LSN. When we've used up our entire
previous allocation, XLogFlush() that record before allowing the
additional values to be used. The bookkeeping would be a bit more
complicated than currently, but I don't think it would be too bad. I'm
not sure how much it would actually help, though, or whether we need
it. If new relfilenumbers are being used up really quickly, then maybe
the record won't get flushed into the background before we run out of
available numbers anyway, and if they aren't, then maybe it doesn't
matter. On the other hand, even one transaction commit between when
the record is logged and when we run out of the previous allocation is
enough to force a flush, at least with synchronous_commit=on, so maybe
the chances of being able to piggyback on an existing flush are not so
bad after all. I'm not sure.

+        * Generate a new relfilenumber.  We can not reuse the old relfilenumber
+        * because the unused relfilenumber files are not unlinked
until the next
+        * checkpoint.  So if move the relation to the old tablespace again, we
+        * will get the conflicting relfilenumber file.

This is much clearer now but the grammar has some issues, e.g. "the
unused relfilenumber" should be just "unused relfilenumber" and "So if
move" is not right either. I suggest: We cannot reuse the old
relfilenumber because of the possibility that that relation will be
moved back to the original tablespace before the next checkpoint. At
that point, the first segment of the main fork won't have been
unlinked yet, and an attempt to create new relation storage with that
same relfilenumber will fail."

In theory I suppose there's another way we could solve this problem:
keep using the same relfilenumber, and if the scenario described here
occurs, just reuse the old file. The reason why we can't do that today
is because we could be running with wal_level=minimal and replace a
relation with one whose contents aren't logged. If WAL replay then
replays the drop, we're in trouble. But if the only time we reuse a
relfilenumber for new relation storage is when relations are moved
around, then I think that scenario can't happen. However, I think
assigning a new relfilenumber is probably better, because it gets us
closer to a world in which relfilenumbers are never reused at all. It
doesn't get us all the way there because of createdb() and movedb(),
but it gets us closer and I prefer that.

+ * XXX although this all was true when the relfilenumbers were 32 bits wide but
+ * now the relfilenumbers are 56 bits wide so we don't have risk of
+ * relfilenumber being reused so in future we can immediately unlink the first
+ * segment as well.  Although we can reuse the relfilenumber during createdb()
+ * using file copy method or during movedb() but the above scenario is only
+ * applicable when we create a new relation.

Here is an edited version:

XXX. Although all of this was true when relfilenumbers were 32 bits wide, they
are now 56 bits wide and do not wrap around, so in the future we can change
the code to immediately unlink the first segment of the relation along
with all the
others. We still do reuse relfilenumbers when createdb() is performed using the
file-copy method or during movedb(), but the scenario described above can only
happen when creating a new relation.

I think that pg_filenode_relation,
binary_upgrade_set_next_heap_relfilenode, and other functions that are
now going to be accepting a RelFileNode using the SQL int8 datatype
should bounds-check the argument. It could be <0 or >2^56, and I
believe it'd be best to throw an error for that straight off. The
three functions in pg_upgrade_support.c could share a static
subroutine for this, to avoid duplicating code.

This bounds-checking issue also applies to the -f argument to pg_checksums.

I notice that the patch makes no changes to relmapper.c, and I think
that's a problem. Notice in particular:

#define MAX_MAPPINGS            62  /* 62 * 8 + 16 = 512 */

I believe that making RelFileNumber into a 64-bit value will cause the
8 in the calculation above to change to 16, defeating the intention
that the size of the file ought to be the smallest imaginable size of
a disk sector. It does seem like it would have been smart to include a
StaticAssertStmt in this file someplace that checks that the data
structure has the expected size, and now might be a good time, perhaps
in a separate patch, to add one. If we do nothing fancy here, the
maximum number of mappings will have to be reduced from 62 to 31,
which is a problem because global/pg_filenode.map currently has 48
entries. We could try to arrange to squeeze padding out of the
RelMapping struct, which would let us use just 12 bytes per mapping,
which would increase the limit to 41, but that's still less than we're
using already, never mind leaving room for future growth.

I don't know what to do about this exactly. I believe it's been
previously suggested that the actual minimum sector size on reasonably
modern hardware is never as small as 512 bytes, so maybe the file size
can just be increased to 1kB or something. If that idea is judged
unsafe, I can think of two other possible approaches offhand. One is
that we could move away from the idea of storing the OIDs in the file
along with the RelFileNodes, and instead store the offset for a given
RelFileNode at a fixed offset in the file. That would require either
hard-wiring offset tables into the code someplace, or generating them
as part of the build process, with separate tables for shared and
database-local relation map files. The other is that we could have
multiple 512-byte sectors and try to arrange for each relation to be
in the same sector with the indexes of that relation, since the
comments in relmapper.c say this:

 * aborts.  An important factor here is that the indexes and toast table of
 * a mapped catalog must also be mapped, so that the rewrites/relocations of
 * all these files commit in a single map file update rather than being tied
 * to transaction commit.

This suggests that atomicity is required across a table and its
indexes, but that it's needed across arbitrary sets of entries in the
file.

Whatever we do, we shouldn't forget to bump RELMAPPER_FILEMAGIC.

--- a/src/include/catalog/pg_class.h
+++ b/src/include/catalog/pg_class.h
@@ -34,6 +34,13 @@ CATALOG(pg_class,1259,RelationRelationId)
BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
        /* oid */
        Oid                     oid;

+       /* access method; 0 if not a table / index */
+       Oid                     relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am);
+
+       /* identifier of physical storage file */
+       /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
+       int64           relfilenode BKI_DEFAULT(0);
+
        /* class name */

        NameData        relname;

@@ -49,13 +56,6 @@ CATALOG(pg_class,1259,RelationRelationId)
BKI_BOOTSTRAP BKI_ROWTYPE_OID(83,Relat
        /* class owner */
        Oid                     relowner BKI_DEFAULT(POSTGRES)
BKI_LOOKUP(pg_authid);

-       /* access method; 0 if not a table / index */
-       Oid                     relam BKI_DEFAULT(heap) BKI_LOOKUP_OPT(pg_am);
-
-       /* identifier of physical storage file */
-       /* relfilenode == 0 means it is a "mapped" relation, see relmapper.c */
-       Oid                     relfilenode BKI_DEFAULT(0);
-
        /* identifier of table space for relation (0 means default for
database) */
        Oid                     reltablespace BKI_DEFAULT(0)
BKI_LOOKUP_OPT(pg_tablespace);

As Andres said elsewhere, this stinks. Not sure what the resolution of
the discussion over on the "AIX support" thread is going to be yet,
but hopefully not this.

+       uint32          relNumber_low;  /* relfilenumber 32 lower bits */
+       uint32          relNumber_hi:24;        /* relfilenumber 24 high bits */
+       uint32          forkNum:8;              /* fork number */

I still think we'd be better off with something like uint32
relForkDetails[2]. The bitfields would be nice if they meant that we
didn't have to do bit-shifting and masking operations ourselves, but
with the field split this way, we do anyway. So what's the point in
mixing the approaches?

  * relNumber identifies the specific relation.  relNumber corresponds to
  * pg_class.relfilenode (NOT pg_class.oid, because we need to be able
  * to assign new physical files to relations in some situations).
- * Notice that relNumber is only unique within a database in a particular
- * tablespace.
+ * Notice that relNumber is unique within a cluster.

I think this paragraph would benefit from more revision. I think that
we should just nuke the parenthesized part altogether, since we'll now
never use pg_class.oid as relNumber, and to suggest otherwise is just
confusing. As for the last sentence, "Notice that relNumber is unique
within a cluster." isn't wrong, but I think we could be more precise
and informative. Perhaps: "relNumber values are assigned by
GetNewRelFileNumber(), which will only ever assign the same value once
during the lifetime of a cluster. However, since CREATE DATABASE
duplicates the relfilenumbers of the template database, the values are
in practice only unique within a database, not globally."

That's all I've got for now.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: pg15b2: large objects lost on upgrade
Next
From: Robert Haas
Date:
Subject: Re: pg15b2: large objects lost on upgrade