Thread: Re: Variable length varlena headers redux

Re: Variable length varlena headers redux

From

Bruce Momjian

Date:

08 February 2007, 23:58:09

Uh, I thought the approach was to create type-specific in/out functions,
and add casting so every time there were referenced, they would expand
to a varlena structure in memory.

---------------------------------------------------------------------------

Gregory Stark wrote:
> 
> I've been looking at this again and had a few conversations about it. This may
> be easier than I had originally thought but there's one major issue that's
> bugging me. Do you see any way to avoid having every user function everywhere
> use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP?
> 
> The two approaches I see are either 
> 
> a) To have two sets of macros, one of which, VARATT_DATA and VARATT_SIZEP are
> for constructing new tuples and behaves exactly as it does now. So you always
> construct a four-byte header datum. Then in heap_form*tuple we check if you
> can use a shorter header and convert. VARDATA/VARSIZE would be for looking at
> existing datums and would interpret the header bits.
> 
> This seems very fragile since one stray call site using VARATT_DATA to find
> the data in an existing datum would cause random bugs that only occur rarely
> in certain circumstances. It would even work as long as the size is filled in
> with VARATT_SIZEP first which it usually is, but fail if someone changes the
> order of the statements.
> 
> or 
> 
> b) throw away VARATT_DATA and VARATT_SIZEP and make all user function
> everywhere change over to a new macro api. That seems like a pretty big
> burden. It's safer but means every contrib module would have to be updated and
> so on.
> 
> I'm hoping I'm missing something and there's a way to do this without breaking
> the api for every user function.
> 
> 

-- Start of included mail From: Tom Lane <tgl@sss.pgh.pa.us>

> To: Gregory Stark <stark@enterprisedb.com>
> cc: Gregory Stark <gsstark@mit.edu>, Bruce Momjian <bruce@momjian.us>, 
>             Peter Eisentraut <peter_e@gmx.net>, pgsql-hackers@postgresql.org, 
>             Martijn van Oosterhout <kleptog@svana.org>
> Subject: Re: [HACKERS] Fixed length data types issue 
> Date: Mon, 11 Sep 2006 13:15:43 -0400
> Lines: 64
> Xref: stark.xeocode.com work.enterprisedb:683

> Gregory Stark <stark@enterprisedb.com> writes:
> > In any case it seems a bit backwards to me. Wouldn't it be better to
> > preserve bits in the case of short length words where they're precious
> > rather than long ones? If we make 0xxxxxxx the 1-byte case it means ...
> 
> Well, I don't find that real persuasive: you're saying that it's
> important to have a 1-byte not 2-byte header for datums between 64 and
> 127 bytes long.  Which is by definition less than a 2% savings for those
> values.  I think its's more important to pick bitpatterns that reduce
> the number of cases heap_deform_tuple has to think about while decoding
> the length of a field --- every "if" in that inner loop is expensive.
> 
> I realized this morning that if we are going to preserve the rule that
> 4-byte-header and compressed-header cases can be distinguished from the
> data alone, there is no reason to be very worried about whether the
> 2-byte cases can represent the maximal length of an in-line datum.
> If you want to do 16K inline (and your page is big enough for that)
> you can just fall back to the 4-byte-header case.  So there's no real
> disadvantage if the 2-byte headers can only go up to 4K or so.  This
> gives us some more flexibility in the bitpattern choices.
> 
> Another thought that occurred to me is that if we preserve the
> convention that a length word's value includes itself, then for a
> 1-byte header the bit pattern 10000000 is meaningless --- the count
> has to be at least 1.  So one trick we could play is to take over
> this value as the signal for "toast pointer follows", with the
> assumption that the tuple-decoder code knows a-priori how big a
> toast pointer is.  I am not real enamored of this, because it certainly
> adds one case to the inner heap_deform_tuple loop and it'll give us
> problems if we ever want more than one kind of toast pointer.  But
> it's a possibility.
> 
> Anyway, a couple of encodings that I'm thinking about now involve
> limiting uncompressed data to 1G (same as now), so that we can play
> with the first 2 bits instead of just 1:
> 
> 00xxxxxx    4-byte length word, aligned, uncompressed data (up to 1G)
> 01xxxxxx    4-byte length word, aligned, compressed data (up to 1G)
> 100xxxxx    1-byte length word, unaligned, TOAST pointer
> 1010xxxx    2-byte length word, unaligned, uncompressed data (up to 4K)
> 1011xxxx    2-byte length word, unaligned, compressed data (up to 4K)
> 11xxxxxx    1-byte length word, unaligned, uncompressed data (up to 63b)
> 
> or
> 
> 00xxxxxx    4-byte length word, aligned, uncompressed data (up to 1G)
> 010xxxxx    2-byte length word, unaligned, uncompressed data (up to 8K)
> 011xxxxx    2-byte length word, unaligned, compressed data (up to 8K)
> 10000000    1-byte length word, unaligned, TOAST pointer
> 1xxxxxxx    1-byte length word, unaligned, uncompressed data (up to 127b)
>         (xxxxxxx not all zero)
> 
> This second choice allows longer datums in both the 1-byte and 2-byte
> header formats, but it hardwires the length of a TOAST pointer and
> requires four cases to be distinguished in the inner loop; the first
> choice only requires three cases, because TOAST pointer and 1-byte
> header can be handled by the same rule "length is low 6 bits of byte".
> The second choice also loses the ability to store in-line compressed
> data above 8K, but that's probably an insignificant loss.
> 
> There's more than one way to do it ...
> 
>             regards, tom lane
> 
-- End of included mail.

> 
> 
> -- 
>   Gregory Stark
>   EnterpriseDB          http://www.enterprisedb.com

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Variable length varlena headers redux

From

Bruce Momjian

Date:

09 February 2007, 07:45:13

Bruce Momjian wrote:
> 
> Uh, I thought the approach was to create type-specific in/out functions,
> and add casting so every time there were referenced, they would expand
> to a varlena structure in memory.

Oh, one more thing.  You are going to need to teach the code that walks
through a tuple attributes about the short header types.  I think you
should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro
code there too.  (As an example, see the macro att_addlength().)

I know it is kind of odd to have a data type that is only used on disk,
and not in memory, but I see this as a baby varlena type, used only to
store and get varlena values using less disk space.

---------------------------------------------------------------------------
> 
> Gregory Stark wrote:
> > 
> > I've been looking at this again and had a few conversations about it. This may
> > be easier than I had originally thought but there's one major issue that's
> > bugging me. Do you see any way to avoid having every user function everywhere
> > use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP?
> > 
> > The two approaches I see are either 
> > 
> > a) To have two sets of macros, one of which, VARATT_DATA and VARATT_SIZEP are
> > for constructing new tuples and behaves exactly as it does now. So you always
> > construct a four-byte header datum. Then in heap_form*tuple we check if you
> > can use a shorter header and convert. VARDATA/VARSIZE would be for looking at
> > existing datums and would interpret the header bits.
> > 
> > This seems very fragile since one stray call site using VARATT_DATA to find
> > the data in an existing datum would cause random bugs that only occur rarely
> > in certain circumstances. It would even work as long as the size is filled in
> > with VARATT_SIZEP first which it usually is, but fail if someone changes the
> > order of the statements.
> > 
> > or 
> > 
> > b) throw away VARATT_DATA and VARATT_SIZEP and make all user function
> > everywhere change over to a new macro api. That seems like a pretty big
> > burden. It's safer but means every contrib module would have to be updated and
> > so on.
> > 
> > I'm hoping I'm missing something and there's a way to do this without breaking
> > the api for every user function.
> > 
> > 
> 
> -- Start of included mail From: Tom Lane <tgl@sss.pgh.pa.us>
> 
> > To: Gregory Stark <stark@enterprisedb.com>
> > cc: Gregory Stark <gsstark@mit.edu>, Bruce Momjian <bruce@momjian.us>, 
> >             Peter Eisentraut <peter_e@gmx.net>, pgsql-hackers@postgresql.org, 
> >             Martijn van Oosterhout <kleptog@svana.org>
> > Subject: Re: [HACKERS] Fixed length data types issue 
> > Date: Mon, 11 Sep 2006 13:15:43 -0400
> > Lines: 64
> > Xref: stark.xeocode.com work.enterprisedb:683
> 
> > Gregory Stark <stark@enterprisedb.com> writes:
> > > In any case it seems a bit backwards to me. Wouldn't it be better to
> > > preserve bits in the case of short length words where they're precious
> > > rather than long ones? If we make 0xxxxxxx the 1-byte case it means ...
> > 
> > Well, I don't find that real persuasive: you're saying that it's
> > important to have a 1-byte not 2-byte header for datums between 64 and
> > 127 bytes long.  Which is by definition less than a 2% savings for those
> > values.  I think its's more important to pick bitpatterns that reduce
> > the number of cases heap_deform_tuple has to think about while decoding
> > the length of a field --- every "if" in that inner loop is expensive.
> > 
> > I realized this morning that if we are going to preserve the rule that
> > 4-byte-header and compressed-header cases can be distinguished from the
> > data alone, there is no reason to be very worried about whether the
> > 2-byte cases can represent the maximal length of an in-line datum.
> > If you want to do 16K inline (and your page is big enough for that)
> > you can just fall back to the 4-byte-header case.  So there's no real
> > disadvantage if the 2-byte headers can only go up to 4K or so.  This
> > gives us some more flexibility in the bitpattern choices.
> > 
> > Another thought that occurred to me is that if we preserve the
> > convention that a length word's value includes itself, then for a
> > 1-byte header the bit pattern 10000000 is meaningless --- the count
> > has to be at least 1.  So one trick we could play is to take over
> > this value as the signal for "toast pointer follows", with the
> > assumption that the tuple-decoder code knows a-priori how big a
> > toast pointer is.  I am not real enamored of this, because it certainly
> > adds one case to the inner heap_deform_tuple loop and it'll give us
> > problems if we ever want more than one kind of toast pointer.  But
> > it's a possibility.
> > 
> > Anyway, a couple of encodings that I'm thinking about now involve
> > limiting uncompressed data to 1G (same as now), so that we can play
> > with the first 2 bits instead of just 1:
> > 
> > 00xxxxxx    4-byte length word, aligned, uncompressed data (up to 1G)
> > 01xxxxxx    4-byte length word, aligned, compressed data (up to 1G)
> > 100xxxxx    1-byte length word, unaligned, TOAST pointer
> > 1010xxxx    2-byte length word, unaligned, uncompressed data (up to 4K)
> > 1011xxxx    2-byte length word, unaligned, compressed data (up to 4K)
> > 11xxxxxx    1-byte length word, unaligned, uncompressed data (up to 63b)
> > 
> > or
> > 
> > 00xxxxxx    4-byte length word, aligned, uncompressed data (up to 1G)
> > 010xxxxx    2-byte length word, unaligned, uncompressed data (up to 8K)
> > 011xxxxx    2-byte length word, unaligned, compressed data (up to 8K)
> > 10000000    1-byte length word, unaligned, TOAST pointer
> > 1xxxxxxx    1-byte length word, unaligned, uncompressed data (up to 127b)
> >         (xxxxxxx not all zero)
> > 
> > This second choice allows longer datums in both the 1-byte and 2-byte
> > header formats, but it hardwires the length of a TOAST pointer and
> > requires four cases to be distinguished in the inner loop; the first
> > choice only requires three cases, because TOAST pointer and 1-byte
> > header can be handled by the same rule "length is low 6 bits of byte".
> > The second choice also loses the ability to store in-line compressed
> > data above 8K, but that's probably an insignificant loss.
> > 
> > There's more than one way to do it ...
> > 
> >             regards, tom lane
> > 
> -- End of included mail.
> 
> > 
> > 
> > -- 
> >   Gregory Stark
> >   EnterpriseDB          http://www.enterprisedb.com
> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>          http://momjian.us
>   EnterpriseDB                               http://www.enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Variable length varlena headers redux

From

Greg Stark

Date:

09 February 2007, 10:06:50

Bruce Momjian <bruce@momjian.us> writes:

> Bruce Momjian wrote:
> > 
> > Uh, I thought the approach was to create type-specific in/out functions,
> > and add casting so every time there were referenced, they would expand
> > to a varlena structure in memory.

Are you talking about actual casts? Because that would lead to all kinds of
strange places with indexes and function lookups and so on. Or are you just
talking about code in the macro api to datum?

> Oh, one more thing.  You are going to need to teach the code that walks
> through a tuple attributes about the short header types.  I think you
> should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro
> code there too.  (As an example, see the macro att_addlength().)

I thought of doing this. It would let us, for example, treat text/varchar,
bpchar, and numeric but leave other data types unchanged.

That does help somewhat but unfortunately text is the problem case. There's
tons of code that generates text without using textin. All of pgcrypto for
example.

> I know it is kind of odd to have a data type that is only used on disk,
> and not in memory, but I see this as a baby varlena type, used only to
> store and get varlena values using less disk space.

I was leaning toward generating the short varlena headers primarily in
heap_form*tuple and just having the datatype specific code generate 4-byte
headers much as you describe.

However that doesn't get us away from having VARDATA/VARSIZE aware of the new
headers. Since heap_deform*tuple and the other entry points which extract
individual attributes return pointers to the datum in the tuple. They can't
expand the header to a 4-byte header on the fly.

I thought of doing it in DETOAST_DATUM on the theory that everyone's going to
be calling it on their arguments. However there are other cases than just
arguments. Other functions might call, say, text_concat() and then call
VARDATA() on the result. 

Even if we only ever generate short headers on heap_form*tuple and always
expand them on DETOAST we could have code that passes around tuples that it
"knows" are entirely in memory and therefore not toasted. I'm thinking of
plpgsql here primarily. Perhaps it would be enough to outlaw this behaviour
but it still seems sort of fragile to me.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Variable length varlena headers redux

From

Bruce Momjian

Date:

09 February 2007, 10:37:55

Greg Stark wrote:
> 
> Bruce Momjian <bruce@momjian.us> writes:
> 
> > Bruce Momjian wrote:
> > > 
> > > Uh, I thought the approach was to create type-specific in/out functions,
> > > and add casting so every time there were referenced, they would expand
> > > to a varlena structure in memory.
> 
> Are you talking about actual casts? Because that would lead to all kinds of
> strange places with indexes and function lookups and so on. Or are you just
> talking about code in the macro api to datum?

I am thinking of auto-casts, sort of like how varchar, char, and text
are all internally treated as interchangable.

> > Oh, one more thing.  You are going to need to teach the code that walks
> > through a tuple attributes about the short header types.  I think you
> > should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro
> > code there too.  (As an example, see the macro att_addlength().)
> 
> I thought of doing this. It would let us, for example, treat text/varchar,
> bpchar, and numeric but leave other data types unchanged.
> 
> That does help somewhat but unfortunately text is the problem case. There's
> tons of code that generates text without using textin. All of pgcrypto for
> example.

Well, TEXT can't use short headers.

> > I know it is kind of odd to have a data type that is only used on disk,
> > and not in memory, but I see this as a baby varlena type, used only to
> > store and get varlena values using less disk space.
> 
> I was leaning toward generating the short varlena headers primarily in
> heap_form*tuple and just having the datatype specific code generate 4-byte
> headers much as you describe.

Yep.

> However that doesn't get us away from having VARDATA/VARSIZE aware of the new
> headers. Since heap_deform*tuple and the other entry points which extract
> individual attributes return pointers to the datum in the tuple. They can't
> expand the header to a 4-byte header on the fly.

Yep, you are going to have to special-case those call points to test
typlen and use your short macros there if == -3.

> I thought of doing it in DETOAST_DATUM on the theory that everyone's going to
> be calling it on their arguments. However there are other cases than just
> arguments. Other functions might call, say, text_concat() and then call
> VARDATA() on the result. 

Right, I think all the in-memory stuff has to be varlena.

> Even if we only ever generate short headers on heap_form*tuple and always
> expand them on DETOAST we could have code that passes around tuples that it
> "knows" are entirely in memory and therefore not toasted. I'm thinking of
> plpgsql here primarily. Perhaps it would be enough to outlaw this behaviour
> but it still seems sort of fragile to me.

Yea, we might need some cleanup, but the cleanup is just to do things
properly.  I am unclear on the case you are describing.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Variable length varlena headers redux

From

Tom Lane

Date:

09 February 2007, 11:11:24

Greg Stark <gsstark@mit.edu> writes:
> Bruce Momjian <bruce@momjian.us> writes:
>> I know it is kind of odd to have a data type that is only used on disk,
>> and not in memory, but I see this as a baby varlena type, used only to
>> store and get varlena values using less disk space.

> I was leaning toward generating the short varlena headers primarily in
> heap_form*tuple and just having the datatype specific code generate 4-byte
> headers much as you describe.

I thought we had a solution for all this, namely to make the short-form
headers be essentially a TOAST-compressed representation.  The format
with 4-byte headers is still legal but just not compressed.  Anyone who
fails to detoast an input argument is already broken, so there's no code
compatibility hit taken.
        regards, tom lane

Re: Variable length varlena headers redux

From

Gregory Stark

Date:

09 February 2007, 12:50:25

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> I know it is kind of odd to have a data type that is only used on disk,
>>> and not in memory, but I see this as a baby varlena type, used only to
>>> store and get varlena values using less disk space.
>
>> I was leaning toward generating the short varlena headers primarily in
>> heap_form*tuple and just having the datatype specific code generate 4-byte
>> headers much as you describe.
>
> I thought we had a solution for all this, namely to make the short-form
> headers be essentially a TOAST-compressed representation.  The format
> with 4-byte headers is still legal but just not compressed.  Anyone who
> fails to detoast an input argument is already broken, so there's no code
> compatibility hit taken.

It's not just input arguments though. A function could call
DirectFunctionCall* and rightfully expect the return value not to need
detoasting.

I suppose this leads me to *only* generate short headers at heap_form*tuple
time. Then DirectFunctionCall isn't relevant and most of the user code is
perfectly safe.

There could still be cases where a heaptuple is passed around in pl_exec.c or
somewhere but if it's subsequently deformed whoever looks at it hopefully
wouldn't be too surprised for it to be mandatory that they go through
pg_detoast_datum. It'll happen as long as they use the DatumGetFoo macros
anyways.

It does mean that anyone going through a heap_form*tuple/heap_deform*tuple
cycle may generate more copies and memory allocations than they expected.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Variable length varlena headers redux

From

Greg Stark

Date:

09 February 2007, 14:36:46

Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > Bruce Momjian <bruce@momjian.us> writes:
> >> I know it is kind of odd to have a data type that is only used on disk,
> >> and not in memory, but I see this as a baby varlena type, used only to
> >> store and get varlena values using less disk space.
> 
> > I was leaning toward generating the short varlena headers primarily in
> > heap_form*tuple and just having the datatype specific code generate 4-byte
> > headers much as you describe.
> 
> I thought we had a solution for all this, namely to make the short-form
> headers be essentially a TOAST-compressed representation.  The format
> with 4-byte headers is still legal but just not compressed.  Anyone who
> fails to detoast an input argument is already broken, so there's no code
> compatibility hit taken.

Uh. So I don't see how to make this work on a little-endian machine. If the
leading its are 0 we don't know if they're toast flags or bits on the least
significant byte of a longer length.

If we store all lengths in network byte order that problem goes away but then
user code that does "VARATT_SIZEP(datum) = len" is incorrect.

If we declare in-memory format to be host byte order and on-disk format to be
network byte order then every single varlena datum needs to be copied when
heap_deform*tuple runs.

If we only do this for a new kind of varlena then only text/varchar/
char/numeric datums would need to be copied but that's still a lot.

-- 
greg

Re: Variable length varlena headers redux

From

Bruce Momjian

Date:

09 February 2007, 18:48:24

Greg Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
> 
> > Greg Stark <gsstark@mit.edu> writes:
> > > Bruce Momjian <bruce@momjian.us> writes:
> > >> I know it is kind of odd to have a data type that is only used on disk,
> > >> and not in memory, but I see this as a baby varlena type, used only to
> > >> store and get varlena values using less disk space.
> > 
> > > I was leaning toward generating the short varlena headers primarily in
> > > heap_form*tuple and just having the datatype specific code generate 4-byte
> > > headers much as you describe.
> > 
> > I thought we had a solution for all this, namely to make the short-form
> > headers be essentially a TOAST-compressed representation.  The format
> > with 4-byte headers is still legal but just not compressed.  Anyone who
> > fails to detoast an input argument is already broken, so there's no code
> > compatibility hit taken.
> 
> Uh. So I don't see how to make this work on a little-endian machine. If the
> leading its are 0 we don't know if they're toast flags or bits on the least
> significant byte of a longer length.
> 
> If we store all lengths in network byte order that problem goes away but then
> user code that does "VARATT_SIZEP(datum) = len" is incorrect.
> 
> If we declare in-memory format to be host byte order and on-disk format to be
> network byte order then every single varlena datum needs to be copied when
> heap_deform*tuple runs.
> 
> If we only do this for a new kind of varlena then only text/varchar/
> char/numeric datums would need to be copied but that's still a lot.

I wonder if we need to reorder the TOAST structure to have the bits we
need at the start of the structure so we can be sure they are first.

For example, what if we split varattrib.va_header, which is int32 now,
into for 'char' fields, and just reassemble it in the toast code.  That
would be pretty localized.

I had forgotten about hooking into the TOAST system, but since we are
going to be "expanding" the headers of these types when they get into
memory, it does make sense.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Variable length varlena headers redux

From

Gregory Stark

Date:

09 February 2007, 19:24:07

"Bruce Momjian" <bruce@momjian.us> writes:

>> Uh. So I don't see how to make this work on a little-endian machine. If the
>> leading its are 0 we don't know if they're toast flags or bits on the least
>> significant byte of a longer length.
>> ...
> I had forgotten about hooking into the TOAST system, but since we are
> going to be "expanding" the headers of these types when they get into
> memory, it does make sense.

Ok, I guess this can work if we guarantee that in-memory datums always have
4-byte headers. That means that heap_deform*tuple always copies the datum if
it's this type of datum.

That means we never have pointers to shortvarlena datums inside tuples. I'm
not sure if there are parts of the system that assume that the datums they get
out of heap_deform*tuple are pointers into the tuple or not. I haven't come
across any in my travels thus far.

That seems like an awful lot of copying and pallocs that aren't there
currently though. And it'll make us reluctant to change over frequently used
data types like text -- which are precisely the ones that would gain us the
most.

It seems to me that it might be better to change to storing varlena lengths in
network byte order instead. That way we can dedicate the leading bits to toast
flags and read more bytes as necessary.

I think the way to do this would be to throw out the VARATT_SIZEP macro and
replace it with VARATT_SET_SIZE(datum,size). VARSIZE would just call ntohl (or
ntohs if the leading bits on the first byte indicated...)

That does mean touching every piece of data type code. And invalidating every
piece of user code. :( At least it's fairly mechanical. And it has the
advantage of not being at all fragile -- unfixed code won't even compile.

While we're at it I would suggest taking out the VARHDRSZ offset. Just store
the size of the data payload. The constant VARHDRSZ offset no longer makes
sense since it won't actually be the size of the varlena header size anyways.
And predicting the actual size of the varlena header will be annoying and
bug-prone since it depends on the resulting value you calculate.

(Incidentally, this would actually make EnterpriseDB somewhat sad since we
want pg_migrator to work for 8.3. But it wouldn't be out of the realm of
possibility to go through the database and switch varlena headers to network
byte order. There's no need to compress them, just leave the 4-byte format in
place with the bytes swapped around.)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Variable length varlena headers redux

From

Tom Lane

Date:

09 February 2007, 20:32:13

Gregory Stark <stark@enterprisedb.com> writes:
> That seems like an awful lot of copying and pallocs that aren't there
> currently though. And it'll make us reluctant to change over frequently used
> data types like text -- which are precisely the ones that would gain us the
> most.

> It seems to me that it might be better to change to storing varlena lengths in
> network byte order instead. That way we can dedicate the leading bits to toast
> flags and read more bytes as necessary.

This'll add its own overhead ... but probably less than pallocs and
data-copying would.  And I agree we can find (pretty much) all the
places that need changing by the expedient of deliberately renaming
the macros and struct fields.

One problem I foresee is that I think you are about to propose that
VARDATA depend on the length already having been inserted, which it
does not now; and simple renamings won't detect ordering errors for
that.  Also I believe there are places that over-allocate memory,
fill in the data, and only then set the length; something you will
not easily be able to change.  It might work if we assume that
*creation* of a varlena value always produces the 4-byte-header form
and only reading of a value that might be on disk needs to cope with
the short-header forms.  However this seems to require two different
forms of VARDATA depending on whether one is preparing or reading a
value.  Ugh.
        regards, tom lane

Re: Variable length varlena headers redux

From

Bruce Momjian

Date:

09 February 2007, 22:44:42

Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> > That seems like an awful lot of copying and pallocs that aren't there
> > currently though. And it'll make us reluctant to change over frequently used
> > data types like text -- which are precisely the ones that would gain us the
> > most.
> 
> > It seems to me that it might be better to change to storing varlena lengths in
> > network byte order instead. That way we can dedicate the leading bits to toast
> > flags and read more bytes as necessary.
> 
> This'll add its own overhead ... but probably less than pallocs and
> data-copying would.  And I agree we can find (pretty much) all the
> places that need changing by the expedient of deliberately renaming
> the macros and struct fields.

I think we should go with the pallocs and see how it performs.  That is
certainly going to be easier to do, and we can test it pretty easily.

One palloc optimization idea would be to split out the representation so
the length is stored seprately from the data in memory, and we could use
an int32 for the length, and point to the shared buffer for the data. 
However I don't think our macros can handle that so it might be a
non-starter.

However, I think we should find out of the palloc is a problem before
avoiding it.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Variable length varlena headers redux

From

Bruce Momjian

Date:

09 February 2007, 23:45:49

Bruce Momjian wrote:
> Tom Lane wrote:
> > Gregory Stark <stark@enterprisedb.com> writes:
> > > That seems like an awful lot of copying and pallocs that aren't there
> > > currently though. And it'll make us reluctant to change over frequently used
> > > data types like text -- which are precisely the ones that would gain us the
> > > most.
> > 
> > > It seems to me that it might be better to change to storing varlena lengths in
> > > network byte order instead. That way we can dedicate the leading bits to toast
> > > flags and read more bytes as necessary.
> > 
> > This'll add its own overhead ... but probably less than pallocs and
> > data-copying would.  And I agree we can find (pretty much) all the
> > places that need changing by the expedient of deliberately renaming
> > the macros and struct fields.
> 
> I think we should go with the pallocs and see how it performs.  That is
> certainly going to be easier to do, and we can test it pretty easily.
> 
> One palloc optimization idea would be to split out the representation so
> the length is stored seprately from the data in memory, and we could use
> an int32 for the length, and point to the shared buffer for the data. 
> However I don't think our macros can handle that so it might be a
> non-starter.
> 
> However, I think we should find out of the palloc is a problem before
> avoiding it.

Another idea about reducing palloc is that we know every short column is
at most 128 + 4 = 132 bytes, so we could allocate a 132-byte buffer for
every short column in the scan, and just re-use the buffer for every
row.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +