Thread: byte-size of column values

byte-size of column values

From
Dominique Devienne
Date:
Hi. I'm surprised by the result for bit(3) and char, when calling
pg_column_size().

Why 6, instead of 1? The doc does mention 5-8 bytes overhead, but I
expected those for varying bit, not fixed-sized bit typed values. How
come?

Similarly, why 2 for char? Is it linked to Unicode?
1 byte for the varying UTF-8 encoded length, then the (potentially)
multi-byte encoding?
(the very bit pattern of UTF-8 allows to infer the encoded length, so
storing the length explicitly is theoretically not even necessary)

Similarly, enums are always 4 bytes I read, despite rarely exceeding
cardinalities beyond a single byte can store.

How does one store as compactedly as possible several small enums, on
millions of rows?

And is the minimum column size always 2?

I'm sure many we call out "premature optimization", but isn't using 32
bits instead of 2, 3 (or 8, to round to a byte) wasteful, in disk
space, thus then ultimately energy? (OK, that last one is pushing it
:) ).

I'm sure there are reasons for the above. And I guess I'm curious
about them. Thanks, --DD

ddevienne=> create table foo (b3 bit(3), i2 int2, i4 int4, i8 int8, c char);
CREATE TABLE
ddevienne=> insert into foo values (b'101', 1002, 1004, 1008, 'C');
INSERT 0 1
ddevienne=> select pg_column_size(b3), pg_column_size(i2),
pg_column_size(i4), pg_column_size(i8), pg_column_size(c) from foo;
 pg_column_size | pg_column_size | pg_column_size | pg_column_size |
pg_column_size
----------------+----------------+----------------+----------------+----------------
              6 |              2 |              4 |              8 |
           2
(1 row)



Re: byte-size of column values

From
"David G. Johnston"
Date:
On Tue, Oct 18, 2022 at 8:53 AM Dominique Devienne <ddevienne@gmail.com> wrote:
Hi. I'm surprised by the result for bit(3) and char, when calling
pg_column_size().

Why 6, instead of 1? The doc does mention 5-8 bytes overhead, but I
expected those for varying bit, not fixed-sized bit typed values. How
come?


The base type is what matters, if the length of the actual type is a parameter (the (n) part) the underlying type must be variable, and thus has a component that says how long the actually stored value is.
 
How does one store as compactedly as possible several small enums, on
millions of rows?

int2
David J.

p.s., pretend char doesn't even exist.

Re: byte-size of column values

From
Tom Lane
Date:
Dominique Devienne <ddevienne@gmail.com> writes:
> Hi. I'm surprised by the result for bit(3) and char, when calling
> pg_column_size().

> Why 6, instead of 1? The doc does mention 5-8 bytes overhead, but I
> expected those for varying bit, not fixed-sized bit typed values. How
> come?

Your expectation is incorrect.  Postgres always treats these types
as variable-length, whether or not the column has a length constraint.
Thus, there's always a header to store the actual length.  That can
be either 1 or 4 bytes (I think the doc you are looking at might be
a little out of date on that point).

Because of the popularity of variable-width character encodings,
a column declared as N characters wide isn't necessarily a fixed
number of bytes wide, making it a lot less useful than you might
think to have optimizations for fixed-width storage.  Between that
and the fact that most Postgres developers regard CHAR(N) as an
obsolete hangover from the days of punched cards, no such
optimizations have been attempted.

            regards, tom lane



Re: byte-size of column values

From
Dominique Devienne
Date:
On Tue, Oct 18, 2022 at 6:04 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> On Tue, Oct 18, 2022 at 8:53 AM Dominique Devienne <ddevienne@gmail.com> wrote:
>> I'm surprised by the result for bit(3) and char, when calling pg_column_size().

> The base type is what matters, if the length of the actual type is a parameter
> (the (n) part) the underlying type must be variable.

Thanks. Interesting. Didn't know (n)-suffixed "fixed-length" types
where always based on variable-size ones.

>> How does one store as compactedly as possible several small enums
> int2

OK, I see. Thanks again.

> p.s., pretend char doesn't even exist.

I realize that now. Wasn't obvious to me, despite the warning in the doc.



Re: byte-size of column values

From
Dominique Devienne
Date:
On Tue, Oct 18, 2022 at 6:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dominique Devienne <ddevienne@gmail.com> writes:
> > I'm surprised by the result for bit(3) and char
> > The doc does mention 5-8 bytes overhead, but I expected
> > those for varying bit, not fixed-sized bit typed values.
>
> Your expectation is incorrect.  Postgres always treats these types
> as variable-length, whether or not the column has a length constraint.

OK. Still, wasn't such a stretch to assume that, no?
Now I know better, thanks to you and David.
I'm not sure the doc on types talks about that either.
Didn't see it for sure (but could still be there and I missed it).

> Thus, there's always a header to store the actual length.  That can
> be either 1 or 4 bytes (I think the doc you are looking at might be
> a little out of date on that point).

Even the doc on v15 (or devel) still says 5-to-8.
https://www.postgresql.org/docs/15/datatype-bit.html

And on my v12, that's born out from my experimentation.
Being used to SQLite using varints,
I'd have expected fewer overhead bytes for the size, like your 1-to-4.

> Because of the popularity of variable-width character encodings,
> a column declared as N characters wide isn't necessarily a fixed
> number of bytes wide, making it a lot less useful than you might
> think to have optimizations for fixed-width storage.  Between that
> and the fact that most Postgres developers regard CHAR(N) as an
> obsolete hangover from the days of punched cards, no such
> optimizations have been attempted.

Thanks for the background. I definitely appreciate PostgreSQL's large 1GB
limit on text and bytea columns, coming from Oracle's tiny 4K one, which
created us all kind of headaches.

For kicks, I looked at bpchar, blank-padded-char, and its extra byte, which
I assume is again some kind of length, there at least the overhead is small
compared to bit(n). 1 bytes versus 5 bytes is no small difference.

ddevienne=> create table bar (bpc bpchar(16));
CREATE TABLE
ddevienne=> insert into bar values ('foo'), ('bar baz');
INSERT 0 2
ddevienne=> select length(bpc), pg_column_size(bpc), '<'||bpc||'>' from bar;
 length | pg_column_size | ?column?
--------+----------------+-----------
      3 |             17 | <foo>
      7 |             17 | <bar baz>
(2 rows)



Re: byte-size of column values

From
Tom Lane
Date:
Dominique Devienne <ddevienne@gmail.com> writes:
> On Tue, Oct 18, 2022 at 6:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thus, there's always a header to store the actual length.  That can
>> be either 1 or 4 bytes (I think the doc you are looking at might be
>> a little out of date on that point).

> Even the doc on v15 (or devel) still says 5-to-8.
> https://www.postgresql.org/docs/15/datatype-bit.html

Oh, if you're thinking about bit rather than string types,
that's correct: in addition to the overall-field-width
header (1 or 4 bytes) there's a length-in-bits field
(always 4 bytes), then the bits themselves.  The overall
width is clearly insufficient to know how many bits are
valid in the last byte, so some additional storage is
necessary.  This could have been done in a more compact
way no doubt, but we aren't likely to change it now.

            regards, tom lane