Thread: constants for tar header offsets

constants for tar header offsets

From
Robert Haas
Date:
Hi,

We have a few different places in the code where we generate or modify
tar headers or just read data out of them. The code in question uses
one of my less-favorite programming things: magic numbers. The offsets
of the various fields within the tar header are just hard-coded in
each relevant place in our code. I think we should clean that up, as
in the attached patch.

I hasten to emphasize that, while I think this is an improvement, I
don't think the result is particularly awesome. Even with the patch,
src/port/tar.c and src/include/pgtar.h do a poor job insulating
callers from the details of the tar format. However, it's also not
very clear to me how to fix that. For instance, I thought about
writing a function that parses a tar header into a struct and then
using it in all of these places, but that seems like it would lose too
much efficiency relative to the current ad-hoc coding. So for now I
don't have a better idea than this.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: constants for tar header offsets

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> We have a few different places in the code where we generate or modify
> tar headers or just read data out of them. The code in question uses
> one of my less-favorite programming things: magic numbers. The offsets
> of the various fields within the tar header are just hard-coded in
> each relevant place in our code. I think we should clean that up, as
> in the attached patch.

Generally +1, with a couple of additional thoughts:

1. Is it worth inventing macros for the values of the file type,
rather than just writing the comment you did?

2. The header size is defined as 512 bytes, but this doesn't sum to 512:

+    TAR_OFFSET_PREFIX = 345        /* 155 byte string */

Either that field's length is really 167 bytes, or there's some other
field you didn't document.  (It looks like you may have copied "155"
from an incorrect existing comment?)

> I hasten to emphasize that, while I think this is an improvement, I
> don't think the result is particularly awesome. Even with the patch,
> src/port/tar.c and src/include/pgtar.h do a poor job insulating
> callers from the details of the tar format. However, it's also not
> very clear to me how to fix that.

Yeah, this is adding greppability (a good thing!) but little more.
However, I'm not convinced it's worth doing more.  It's not like
this data structure will change anytime soon.

            regards, tom lane



Re: constants for tar header offsets

From
Robert Haas
Date:
On Tue, Apr 18, 2023 at 11:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > We have a few different places in the code where we generate or modify
> > tar headers or just read data out of them. The code in question uses
> > one of my less-favorite programming things: magic numbers. The offsets
> > of the various fields within the tar header are just hard-coded in
> > each relevant place in our code. I think we should clean that up, as
> > in the attached patch.
>
> Generally +1, with a couple of additional thoughts:
>
> 1. Is it worth inventing macros for the values of the file type,
> rather than just writing the comment you did?

Might be.

> 2. The header size is defined as 512 bytes, but this doesn't sum to 512:
>
> +       TAR_OFFSET_PREFIX = 345         /* 155 byte string */
>
> Either that field's length is really 167 bytes, or there's some other
> field you didn't document.  (It looks like you may have copied "155"
> from an incorrect existing comment?)

According to my research, it is neither of those, e.g. see

https://www.subspacefield.org/~vax/tar_format.html
https://www.ibm.com/docs/en/zos/2.4.0?topic=formats-tar-format-tar-archives
https://wiki.osdev.org/USTAR

I think that what happened is that whoever designed the original tar
format decided on 512 byte blocks. And the header did not take up the
whole block. The USTAR format is an extension of the original format
which uses more of the block, but still not all of it.

> Yeah, this is adding greppability (a good thing!) but little more.
> However, I'm not convinced it's worth doing more.  It's not like
> this data structure will change anytime soon.

Right. greppability is a major concern for me here, and also bug
surface. Right now, to use the functions in pgtar.h, you need to know
all the header offsets as well as the format and length of each header
field. This centralizes constants for the header offsets, and at least
provides some centralized documentation of the rest. It's not great,
though, because it seems like there's some risk of someone writing new
code and getting confused about whether the length of a certain field
is 8 or 12, for example. A thicker abstraction layer might be able to
avoid or minimize such risks better than what we have, but I don't
really know how to design it, whereas this seems like an obvious
improvement.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: constants for tar header offsets

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 18, 2023 at 11:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 2. The header size is defined as 512 bytes, but this doesn't sum to 512:
>> +       TAR_OFFSET_PREFIX = 345         /* 155 byte string */

> I think that what happened is that whoever designed the original tar
> format decided on 512 byte blocks. And the header did not take up the
> whole block. The USTAR format is an extension of the original format
> which uses more of the block, but still not all of it.

Hmm, you're right: I checked the POSIX.1-2018 spec as well, and
it agrees that the prefix field is 155 bytes long.  Perhaps just
add another comment line indicating that 12 bytes remain unassigned?

            regards, tom lane



Re: constants for tar header offsets

From
Robert Haas
Date:
On Tue, Apr 18, 2023 at 12:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm, you're right: I checked the POSIX.1-2018 spec as well, and
> it agrees that the prefix field is 155 bytes long.  Perhaps just
> add another comment line indicating that 12 bytes remain unassigned?

OK. Here's v2, with that change and a few others.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: constants for tar header offsets

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> OK. Here's v2, with that change and a few others.

LGTM.

            regards, tom lane



Re: constants for tar header offsets

From
Dagfinn Ilmari Mannsåker
Date:
Robert Haas <robertmhaas@gmail.com> writes:

> On Tue, Apr 18, 2023 at 12:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm, you're right: I checked the POSIX.1-2018 spec as well, and
>> it agrees that the prefix field is 155 bytes long.  Perhaps just
>> add another comment line indicating that 12 bytes remain unassigned?
>
> OK. Here's v2, with that change and a few others.

It still has magic numbers for the sizes of the fields, should those
also be named constants?

- ilmari



Re: constants for tar header offsets

From
Robert Haas
Date:
On Tue, Apr 18, 2023 at 12:56 PM Dagfinn Ilmari Mannsåker
<ilmari@ilmari.org> wrote:
> It still has magic numbers for the sizes of the fields, should those
> also be named constants?

I thought about that. It's arguable, but personally, I don't think
it's worth it. If the concern is greppability, having constants for
the offsets is good enough for that. If the concern is making it
error-free, I think we'd be well-advised to consider bigger redesigns
of the API. For example, we could have a function
read_number_from_tar_header(char *pointer_to_the_start_of_the_block,
enum which_field) and then that function could encapsulate the
knowledge of which tar numbers are 8 bytes and which are 12 bytes.
Writing read_number_from_tar_header(h, TAR_FIELD_CHECKSUM) seems
potentially less error-prone than
read_tar_number(&h[TAR_OFFSET_CHECKSUM], 8). On the other hand,
changing the latter to read_tar_number(&h[TAR_OFFSET_CHECKSUM],
TAR_LENGTH_CHECKSUM) seems longer but not necessarily cleaner. So I
felt it didn't make sense.

Just to be clear, I don't have a full vision for what a replacement
API ought to look like, and I'm not sure that figuring that out is
something that has to be done right this minute. I proposed this patch
not because it's perfect, but because it's simple. We can think of
doing more in the future if someone wants to devote the effort, and
that person might even be me, but right now it's not.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: constants for tar header offsets

From
"Tristan Partin"
Date:
On Wed Apr 19, 2023 at 8:09 AM CDT, Robert Haas wrote:
> On Tue, Apr 18, 2023 at 12:56 PM Dagfinn Ilmari Mannsåker
> <ilmari@ilmari.org> wrote:
> > It still has magic numbers for the sizes of the fields, should those
> > also be named constants?
>
> I thought about that. It's arguable, but personally, I don't think
> it's worth it. If the concern is greppability, having constants for
> the offsets is good enough for that. If the concern is making it
> error-free, I think we'd be well-advised to consider bigger redesigns
> of the API. For example, we could have a function
> read_number_from_tar_header(char *pointer_to_the_start_of_the_block,
> enum which_field) and then that function could encapsulate the
> knowledge of which tar numbers are 8 bytes and which are 12 bytes.
> Writing read_number_from_tar_header(h, TAR_FIELD_CHECKSUM) seems
> potentially less error-prone than
> read_tar_number(&h[TAR_OFFSET_CHECKSUM], 8). On the other hand,
> changing the latter to read_tar_number(&h[TAR_OFFSET_CHECKSUM],
> TAR_LENGTH_CHECKSUM) seems longer but not necessarily cleaner. So I
> felt it didn't make sense.
>
> Just to be clear, I don't have a full vision for what a replacement
> API ought to look like, and I'm not sure that figuring that out is
> something that has to be done right this minute. I proposed this patch
> not because it's perfect, but because it's simple. We can think of
> doing more in the future if someone wants to devote the effort, and
> that person might even be me, but right now it's not.

A new API design would be great, but for right now v2 is good enough and
should be committed. It is much easier to read the code with this patch
applied.

Marking as "Ready for Committer" since we all seem to agree that this is
better than what exists.

--
Tristan Partin
Neon (https://neon.tech)



Re: constants for tar header offsets

From
Robert Haas
Date:
On Tue, Aug 1, 2023 at 11:07 AM Tristan Partin <tristan@neon.tech> wrote:
> A new API design would be great, but for right now v2 is good enough and
> should be committed. It is much easier to read the code with this patch
> applied.
>
> Marking as "Ready for Committer" since we all seem to agree that this is
> better than what exists.

Thanks, committed now.

--
Robert Haas
EDB: http://www.enterprisedb.com