[ Not subscribed, please keep me in the CC list ]
Is there a standard idiom for encoding small variable length data in an array? I wrote the varint extension[1] that
encodesdata using a variable width encoding scheme[2] for signed and unsigned integers[3]. Right now the extension is
mostlyof use in skinny tables that have at least 4-5 columns, all of which are of INT or INT8. If you have only 5
columnsof INT8, you can save ~50% of your table space.
But, to get larger savings, it's required to bypass the tuple overhead and aggregating data in to an array (i.e.
aggregateall time series data for a 5min window of time in to a single varuint[]).
The problem with that being, each varint takes 8 bytes in an array because of padding and alignment. Is there a way to
preventthat, or, more realistically, are there standard ways of encoding this data in to a BYTEA and then manually
scanningand unpacking the data? Random access in to the array isn't a concern. I was thinking about adding a BYTEA to
varint[]cast, but am fishing for a better idea.
Any hints or thoughts? Thanks in advance. -sc
[1] https://github.com/sean-/postgresql-varint
[2] SELECT varint64, pg_column_size(varint64) FROM varint64_table ORDER BY varint64 ASC;
varint64 | pg_column_size
----------------------+-----------------
-4611686018427387905 | 11
-4611686018427387904 | 10
-36028797018963969 | 10
-36028797018963968 | 9
-281474976710657 | 9
-281474976710656 | 8
-2199023255553 | 8
-2199023255552 | 7
-17179869185 | 7
-17179869184 | 6
-134217729 | 6
-134217728 | 5
-1048577 | 5
-1048576 | 4
-8193 | 4
-8192 | 3
-65 | 3
-64 | 2
-1 | 2
0 | 2
1 | 2
63 | 2
64 | 3
8191 | 3
8192 | 4
1048575 | 4
1048576 | 5
134217727 | 5
134217728 | 6
17179869183 | 6
17179869184 | 7
2199023255551 | 7
2199023255552 | 8
281474976710655 | 8
281474976710656 | 9
36028797018963967 | 9
36028797018963968 | 10
4611686018427387903 | 10
4611686018427387904 | 11
(39 rows)
SELECT varuint64, pg_column_size(varint64) FROM varuint64_table ORDER BY varint64 ASC;
varuint64 | pg_column_size
---------------------+-----------------
0 | 2
127 | 2
128 | 3
16383 | 3
16384 | 4
2097151 | 4
2097152 | 5
268435455 | 5
268435456 | 6
34359738367 | 6
34359738368 | 7
4398046511103 | 7
4398046511104 | 8
562949953421311 | 8
562949953421312 | 9
72057594037927935 | 9
72057594037927936 | 10
9223372036854775807 | 10
[3] I know the unsigned int only goes up to 2^^63 atm, it will go to 2^^64 once I get around to setting up a test
methodology.Using INT8 internally was too convenient at the time.
--
Sean Chittenden
sean@chittenden.org