Thread: [PATCH] Clarify the comments about varlena header encoding

[PATCH] Clarify the comments about varlena header encoding

From

Aleksander Alekseev

Date:

17 August 2022, 18:06:38

Hi hackers,

I noticed that the comments regarding bit layouts for varlena headers
in postgres.h are somewhat misleading. For instance, when reading:

```
00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
```

... one can assume this is a 00xxxxxx byte followed by another 4 bytes
(which is wrong). Also one can read this as "aligned, uncompressed
data" (which again is wrong).

```
10000000 1-byte length word, unaligned, TOAST pointer
```

This is misleading too. The comments above this line say that `struct
varatt_external` is a TOAST pointer. sizeof(varatt_external) = 16,
plus 1 byte equals 17, right? However the documentation [1] claims the
result should be 18:

"""
Allowing for the varlena header bytes, the total size of an on-disk
TOAST pointer datum is therefore 18 bytes regardless of the actual
size of the represented value.
"""

I did my best to get rid of any ambiguity. The patch is attached.

[1]: https://www.postgresql.org/docs/current/storage-toast.html

-- 
Best regards,
Aleksander Alekseev

Attachment

v1-0001-Clarify-the-comments-about-varlena-header-encodin.patch

Re: [PATCH] Clarify the comments about varlena header encoding

From

John Naylor

Date:

18 August 2022, 04:14:34

On Thu, Aug 18, 2022 at 1:06 AM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi hackers,
>
> I noticed that the comments regarding bit layouts for varlena headers
> in postgres.h are somewhat misleading. For instance, when reading:

I agree it's confusing, but I don't think this patch is the right direction.

> ```
> 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
> ```
>
> ... one can assume this is a 00xxxxxx byte followed by another 4 bytes
> (which is wrong). Also one can read this as "aligned, uncompressed
> data" (which again is wrong).

- * 00xxxxxx 4-byte length word, aligned, uncompressed data (up to 1G)
+ * 00xxxxxx xxxxxxxx xxxxxxxx xxxxxxxx, uncompressed data (up to 1G)

Maybe "00xxxxxx 4-byte length word (aligned)," is more clear about
what is aligned. Also, adding all those xxx's obscures the point that
we only need to examine one byte to figure out what to do next.

> ```
> 10000000 1-byte length word, unaligned, TOAST pointer
> ```
>
> This is misleading too. The comments above this line say that `struct
> varatt_external` is a TOAST pointer. sizeof(varatt_external) = 16,
> plus 1 byte equals 17, right? However the documentation [1] claims the
> result should be 18:

The patch has:

+ * In the third case the va_tag field (see varattrib_1b_e) is used to discern
+ * the specific type and length of the pointer datum. On disk the "xxx" bits
+ * currently always store sizeof(varatt_external) + 2.

...so not sure where 17 came from.

- * 10000000 1-byte length word, unaligned, TOAST pointer
+ * 10000000 xxxxxxxx, TOAST pointer (struct varatt_external)

This implies that the header is two bytes, which is not accurate. That
next byte is a type tag:

/* TOAST pointers are a subset of varattrib_1b with an identifying tag byte */
typedef struct
{
uint8 va_header; /* Always 0x80 or 0x01 */
uint8 va_tag; /* Type of datum */
char va_data[FLEXIBLE_ARRAY_MEMBER]; /* Type-specific data */
} varattrib_1b_e;

...and does not always represent the on-disk length:

/*
 * Type tag for the various sorts of "TOAST pointer" datums.  The peculiar
 * value for VARTAG_ONDISK comes from a requirement for on-disk compatibility
 * with a previous notion that the tag field was the pointer datum's length.
 */
typedef enum vartag_external
{
VARTAG_INDIRECT = 1,
VARTAG_EXPANDED_RO = 2,
VARTAG_EXPANDED_RW = 3,
VARTAG_ONDISK = 18
} vartag_external;

And I don't think the new comments referring to "third case", "first
two cases", etc make it easier to follow.
-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: [PATCH] Clarify the comments about varlena header encoding

From

Aleksander Alekseev

Date:

06 September 2022, 10:18:52

Hi John,

Thanks for the feedback.

> Maybe "00xxxxxx 4-byte length word (aligned)," is more clear about
> what is aligned. Also, adding all those xxx's obscures the point that
> we only need to examine one byte to figure out what to do next.

IMO "00xxxxxx 4-byte length word" is still confusing. One can misread
this as a 00-xx-xx-xx hex value, where the first byte (not two bits)
is 00h.

> The patch has:
>
> + * In the third case the va_tag field (see varattrib_1b_e) is used to discern
> + * the specific type and length of the pointer datum. On disk the "xxx" bits
> + * currently always store sizeof(varatt_external) + 2.
>
> ...so not sure where 17 came from.

Right, AFTER applying the patch it's clear that it's actually 18
bytes. Currently the comment says "1 byte followed by a TOAST pointer
(16 bytes)" which is wrong.

> - * 10000000 1-byte length word, unaligned, TOAST pointer
> + * 10000000 xxxxxxxx, TOAST pointer (struct varatt_external)
>
> This implies that the header is two bytes, which is not accurate. That
> next byte is a type tag:
> [...]
> ...and does not always represent the on-disk length:

Well, the comments don't say what is the header and what is the type
tag. They merely describe the bit layouts. The patch doesn't seem to
make things worse in this respect. Do you think we should address this
too? I suspect that describing the difference between the header and
the type tag here will create even more confusion.

> And I don't think the new comments referring to "third case", "first
> two cases", etc make it easier to follow.

Maybe you are right. I'm open to suggestions.

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Clarify the comments about varlena header encoding

From

John Naylor

Date:

08 September 2022, 02:54:43

On Tue, Sep 6, 2022 at 5:19 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi John,
>
> Thanks for the feedback.
>
> > Maybe "00xxxxxx 4-byte length word (aligned)," is more clear about
> > what is aligned. Also, adding all those xxx's obscures the point that
> > we only need to examine one byte to figure out what to do next.
>
> IMO "00xxxxxx 4-byte length word" is still confusing. One can misread
> this as a 00-xx-xx-xx hex value, where the first byte (not two bits)
> is 00h.

The top of the comment literally says

 * Bit layouts for varlena headers on big-endian machines:

...but maybe we can say at the top that we inspect the first byte to
determine what kind of header it is. Or put the now-standard 0b in
front.

> > The patch has:
> >
> > + * In the third case the va_tag field (see varattrib_1b_e) is used to discern
> > + * the specific type and length of the pointer datum. On disk the "xxx" bits
> > + * currently always store sizeof(varatt_external) + 2.
> >
> > ...so not sure where 17 came from.
>
> Right, AFTER applying the patch it's clear that it's actually 18
> bytes.

Okay, I see now that this quote from your first email:

"This is misleading too. The comments above this line say that `struct
varatt_external` is a TOAST pointer. sizeof(varatt_external) = 16,
plus 1 byte equals 17, right? However the documentation [1] claims the
result should be 18:"

...is not your thought, but one of a fictional misled reader. I
actually found this phrase more misleading than the header comments.
:-)

I think the problem is ambiguity about what a "toast pointer" is. This comment:

 * struct varatt_external is a traditional "TOAST pointer", that is, the

has caused people to think a toasted value in the main relation takes
up 16 bytes on disk sizeof(varatt_external) = 16, when it's actually
18. Is the 16 the "toast pointer" or the 18?

> > - * 10000000 1-byte length word, unaligned, TOAST pointer
> > + * 10000000 xxxxxxxx, TOAST pointer (struct varatt_external)
> >
> > This implies that the header is two bytes, which is not accurate. That
> > next byte is a type tag:
> > [...]
> > ...and does not always represent the on-disk length:
>
> Well, the comments don't say what is the header and what is the type
> tag.

Because the comments explain the following macros that read bits in
the *first* byte of a 1- or 4-byte header to determine what kind it
is.

> They merely describe the bit layouts. The patch doesn't seem to
> make things worse in this respect. Do you think we should address this
> too? I suspect that describing the difference between the header and
> the type tag here will create even more confusion.

I said nothing about describing the difference between the header and
type tag. The patch added xxx's for the type tag in a comment about
the header. This is more misleading than what is there now.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: [PATCH] Clarify the comments about varlena header encoding

From

Aleksander Alekseev

Date:

11 September 2022, 10:06:07

Hi John,

Many thanks for the feedback!

> Or put the now-standard 0b in front.

Good idea.

> I think the problem is ambiguity about what a "toast pointer" is. This comment:
>
> * struct varatt_external is a traditional "TOAST pointer", that is, the

Right. The comment for varatt_external says that it IS a TOAST
pointer. However the comments for varlena headers bit layout
implicitly include it into a TOAST pointer, which contradicts the
previous comments. I suggest we fix this ambiguity by explicitly
enumerating the type tag in the comments for varlena headers.

> The patch added xxx's for the type tag in a comment about
> the header. This is more misleading than what is there now.

OK, here is another attempt. Changes compared to v1:

* "xxx xxx xxx" were removed, according to the feedback
* 0b prefix was added in order to make sure the reader will not
misread this as a hex values
* The clarification about the type tag was added
* The references to "first case", "second case", etc were removed

Hopefully it's better now.

-- 
Best regards,
Aleksander Alekseev

Attachment

v2-0001-Clarify-the-comments-about-varlena-header-encodin.patch

Re: [PATCH] Clarify the comments about varlena header encoding

From

John Naylor

Date:

12 September 2022, 04:47:07

On Sun, Sep 11, 2022 at 5:06 PM Aleksander Alekseev
<aleksander@timescale.com> wrote:
>
> Hi John,
>
> Many thanks for the feedback!
>
> > Or put the now-standard 0b in front.
>
> Good idea.

Now that I look at the results, though, it's distracting and not good
for readability. I'm not actually sure we need to do anything here,
but I am somewhat in favor of putting [un]aligned in parentheses, as
already discussed. Even there, in the first email you said:

> Also one can read this as "aligned, uncompressed
> data" (which again is wrong).

I'm not sure it rises to the level of "wrong", because a blob of bytes
immediately after an aligned uint32 is in fact aligned. The important
thing is: a zero byte is always either a padding byte or part of a
4-byte header, so it's the alignment of the header we really care
about.

> > I think the problem is ambiguity about what a "toast pointer" is. This comment:
> >
> > * struct varatt_external is a traditional "TOAST pointer", that is, the
>
> Right. The comment for varatt_external says that it IS a TOAST
> pointer.

Well, the word "traditional" is not very informative, but it is there.
And afterwards there is also varatt_indirect, varatt_expanded, and
varattrib_1b_e, which all mention "TOAST pointer".

> However the comments for varlena headers bit layout
> implicitly include it into a TOAST pointer, which contradicts the
> previous comments. I suggest we fix this ambiguity by explicitly
> enumerating the type tag in the comments for varlena headers.

- * 10000000 1-byte length word, unaligned, TOAST pointer
+ * 0b10000000 1-byte length word (unaligned), type tag, TOAST pointer

This is distracting from the point of this whole comment, which, I
will say again is: How to look at the first byte to determine what
kind of varlena we're looking at. There is no reason to mention the
type tag here, at all.

- * In TOAST pointers the va_tag field (see varattrib_1b_e) is used to discern
- * the specific type and length of the pointer datum.
+ * For the TOAST pointers the type tag (see varattrib_1b_e.va_tag field) is
+ * used to discern the specific type and length of the pointer datum.

I don't think this clarifies anything, it's just a rephrasing.

More broadly, I think the description of varlenas in this header is at
a kind of "local maximum" -- minor adjustments are more likely to make
it worse. To significantly improve clarity might require a larger
rewriting, but I'm not personally interested in taking part in that.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Re: [PATCH] Clarify the comments about varlena header encoding

From

Michael Paquier

Date:

12 October 2022, 05:26:38

On Mon, Sep 12, 2022 at 11:47:07AM +0700, John Naylor wrote:
> I don't think this clarifies anything, it's just a rephrasing.
>
> More broadly, I think the description of varlenas in this header is at
> a kind of "local maximum" -- minor adjustments are more likely to make
> it worse. To significantly improve clarity might require a larger
> rewriting, but I'm not personally interested in taking part in that.

This has remained unanswered for four weeks now, so marked as returned
with feedback in the 2022-09 CF.
--
Michael

Attachment

signature.asc