Re: [HACKERS] Custom compression methods - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: [HACKERS] Custom compression methods
Date
Msg-id CAFiTN-v9Cs1MORnp-3bGZ5QBwr5v3VarSvfaDizHi1acXES5xQ@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Custom compression methods  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Custom compression methods
List pgsql-hackers
On Wed, Feb 3, 2021 at 2:07 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Even more review comments, still looking mostly at 0001:
>
> If there's a reason why parallel_schedule is arranging to run the
> compression test in parallel with nothing else, the comment in that
> file should explain the reason. If there isn't, it should be added to
> a parallel group that doesn't have the maximum number of tests yet,
> probably the last such group in the file.
>
> serial_schedule should add the test in a position that roughly
> corresponds to where it appears in parallel_schedule.

Done

> I believe it's relatively standard practice to put variable
> declarations at the top of the file. compress_lz4.c and
> compress_pglz.c instead put those declarations nearer to the point of
> use.

Do you mean pglz_compress_methods and lz4_compress_methods ? I
followed that style from
heapam_handler.c.  If you think that doesn't look good then I can move it up.

> compressamapi.c has an awful lot of #include directives for the code
> it actually contains. I believe that we should cut that down to what
> is required by 0001, and other patches can add more later as required.
> In fact, it's tempting to just get rid of this .c file altogether and
> make the two functions it contains static inline functions in the
> header, but I'm not 100% sure that's a good idea.

I think it looks better to move them to compressamapi.h so done that.

> The copyright dates in a number of the file headers are out of date.

Fixed

> binary_upgrade_next_pg_am_oid and the related changes to
> CreateAccessMethod don't belong in 0001, because it doesn't support
> non-built-in compression methods. These changes and the related
> pg_dump change should be moved to the patch that adds support for
> that.

Fixed

> The comments added to dumpTableSchema() say that "compression is
> assigned by ALTER" but don't give a reason. I think they should. I
> don't know how much they need to explain about what the code does, but
> they definitely need to explain why it does it. Also, isn't this bad?
> If we create the column with the wrong compression setting initially
> and then ALTER it, we have to rewrite the table. If it's empty, that's
> cheap, but it'd still be better not to do it at all.

Yeah, actually that part should go in 0003 patch where we implement
the custom compression method.
in that patch we need to alter and set because we want to keep the
preserved method as well
So I will add it there

> I'm not sure it's a good idea for dumpTableSchema() to leave out
> specifying the compression method if it happens to be pglz. I think we
> definitely shouldn't do it in binary-upgrade mode. What if we changed
> the default in a future release? For that matter, even 0002 could make
> the current approach unsafe.... I think, anyway.

Fixed


> The changes to pg_dump.h look like they haven't had a visit from
> pgindent. You should probably try to do that for the whole patch,
> though it's a bit annoying since you'll have to manually remove
> unrelated changes to the same files that are being modified by the
> patch. Also, why the extra blank line here?

Fixed, ran pgindent for other files as well.

> GetAttributeCompression() is hard to understand. I suggest changing
> the comment to "resolve column compression specification to an OID"
> and somehow rejigger the code so that you aren't using one not-NULL
> test and one NULL test on the same variable. Like maybe change the
> first part to if (!IsStorageCompressible(typstorage)) { if
> (compression == NULL) return InvalidOid; ereport(ERROR, ...); }

Done

> It puzzles me that CompareCompressionMethodAndDecompress() calls
> slot_getallattrs() just before clearing the slot. It seems like this
> ought to happen before we loop over the attributes, so that we don't
> need to call slot_getattr() every time. See the comment for that
> function. But even if we didn't do that for some reason, why would we
> do it here? If it's already been done, it shouldn't do anything, and
> if it hasn't been done, it might overwrite some of the values we just
> poked into tts_values. It also seems suspicious that we can get away
> with clearing the slot and then again marking it valid. I'm not sure
> it really works like that. Like, can't clearing the slot invalidate
> pointers stored in tts_values[]? For instance, if they are pointers
> into an in-memory heap tuple, tts_heap_clear() is going to free the
> tuple; if they are pointers into a buffer, tts_buffer_heap_clear() is
> going to unpin it. I think the supported procedure for this sort of
> thing is to have a second slot, set tts_values, tts_isnull etc. and
> then materialize the slot. After materializing the new slot, it's
> independent of the old slot, which can then be cleared. See for
> example tts_virtual_materialize(). The whole approach you've taken
> here might need to be rethought a bit. I think you are right to want
> to avoid copying everything over into a new slot if nothing needs to
> be done, and I think we should definitely keep that optimization, but
> I think if you need to copy stuff, you have to do the above procedure
> and then continue using the other slot instead of the original one.
> Some places I think we have functions that return either the original
> slot or a different one depending on how it goes; that might be a
> useful idea here. But, you also can't just spam-create slots; it's
> important that whatever ones we end up with get reused for every
> tuple.

I have changed this algorithm, so now if we have to decompress
anything we will use the new slot and we will stick that new slot to
the ModifyTableState, DR_transientrel for matviews and DR_intorel for
CTAS.  Does this looks okay or we need to do something else?  If this
logic looks fine then maybe we can think of some more optimization and
cleanup in this function.


> Doesn't the change to describeOneTableDetails() require declaring
> changing the declaration of char *headers[11] to char *headers[12]?
> How does this not fail Assert(cols <= lengthof(headers))?

Fixed

> Why does describeOneTableDetais() arrange to truncate the printed
> value? We don't seem to do that for the other column properties, and
> it's not like this one is particularly long.

Not required, fixed.

> Perhaps the changes to pg_am.dat shouldn't remove the blank line?

Fixed

> I think the comment to pg_attribute.h could be rephrased to stay
> something like: "OID of compression AM. Must be InvalidOid if and only
> if typstorage is 'a' or 'b'," replacing 'a' and 'b' with whatever the
> right letters are. This would be shorter and I think also clearer than
> what you have

Fixed

> The first comment change in postgres.h is wrong. You changed
> va_extsize to "size in va_extinfo" but the associated structure
> definition is unchanged, so the comment shouldn't be changed either.

Yup, not required.

> In toast_internals.h, you end using 30 as a constant several times but
> have no #define for it. You do have a #define for RAWSIZEMASK, but
> that's really a derived value from 30. Also, it's not a great name
> because it's kind of generic. So how about something like:
>
> #define TOAST_RAWSIZE_BITS 30
> #define TOAST_RAWSIZE_MASK ((1 << (TOAST_RAW_SIZE_BITS + 1)) - 1)
>
> But then again on second thought, this 30 seems to be the same 30 that
> shows up in the changes to postgres.h, and there again 0x3FFFFFFF
> shows up too. So maybe we should actually be defining these constants
> there, using names like VARLENA_RAWSIZE_BITS and VARLENA_RAWSIZE_MASK
> and then having toast_internals.h use those constants as well.

Done, IMHO it should be
#define VARLENA_RAWSIZE_BITS 30
#define VARLENA_RAWSIZE_MASK ((1 << VARLENA_RAWSIZE_BITS) -1 )


> Taken with the email I sent yesterday, I think this is a more or less
> complete review of 0001. Although there are a bunch of things to fix
> here still, I don't think this is that far from being committable. I
> don't at this point see too much in terms of big design problems.
> Probably the CompareCompressionMethodAndDecompress() is the closest to
> a design-level problem, and certainly something needs to be done about
> it, but even that is a fairly localized problem in the context of the
> entire patch.

0001 is attached, now pending parts are

- Confirm the new design of CompareCompressionMethodAndDecompress
- Performance test, especially lz4 with small varlena
- Rebase other patches atop this patch
- comment in ddl.sgml


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: Dumping/restoring fails on inherited generated column
Next
From: Dilip Kumar
Date:
Subject: Re: [HACKERS] Custom compression methods