Re: mcvstats serialization code is still shy of a load - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: mcvstats serialization code is still shy of a load
Date
Msg-id 20190629141312.bo6armpsdf65gy6m@development
Whole thread Raw
In response to Re: mcvstats serialization code is still shy of a load  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: mcvstats serialization code is still shy of a load
List pgsql-hackers
On Thu, Jun 27, 2019 at 01:26:32PM +0200, Tomas Vondra wrote:
>On Thu, Jun 27, 2019 at 12:04:30AM -0400, Tom Lane wrote:
>>Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>>>OK. Attached is a patch ditching the alignment in serialized data. I've
>>>ditched the macros to access parts of serialized data, and everything
>>>gets copied.
>>
>>I lack energy to actually read this patch right now, and I don't currently
>>have an opinion about whether it's worth another catversion bump to fix
>>this stuff in v12.  But I did test the patch, and I can confirm it gets
>>through the core regression tests on hppa (both gaur's host environment
>>with gcc 3.4.6, and the OpenBSD installation with gcc 4.2.1).
>>
>
>Thanks for running it through regression tests, that alone is a very
>useful piece of information for me.
>
>As for the catversion bump - I'd probably vote to do it. Not just because
>of this serialization stuff, but to fix the pg_mcv_list_items function.
>It's not something I'm very enthusiastic about (kinda embarassed about it,
>really), but it seems better than shipping something that we'll need to
>rework in PG13.
>

Attached is a slightly improved version of the serialization patch. The
main difference is that when serializing varlena values, the previous
patch version stored

    length (uint32) + full varlena (incl. the header)

which is kinda redundant, because the varlena stores the length too. So
now it only stores the length + data, without the varlena header. I
don't think there's a better way to store varlena values without
enforcing alignment (which is what happens in current master).

There's one additional change I failed to mention before - I had to add
another field to DimensionInfo, tracking how much space will be needed
for deserialized data. This is needed because the deserialization
allocates the whole MCV as a single chunk of memory, to reduce palloc
overhead. It could parse the data twice (first to determine the space,
then to actually parse it), this allows doing just a single pass. Which
seems useful for large MCV lists, but maybe it's not worth it?

Barring objections I'll commit this together with the pg_mcv_list_items
fix, posted in a separate thread. Of course, this requires catversion
bump - an alternative would be to keep enforcing the alignment, but
tweak the macros to work on all platforms without SIGBUS.

Considering how troublesome this serialiation part of the patch turner
out to be, I'm not really sure by anything at this point. So I'd welcome
thoughts about the proposed changes.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: TM format can mix encodings in to_char()
Next
From: Tomas Vondra
Date:
Subject: Re: Avoid full GIN index scan when possible