Hi Tom,
On Thu, Mar 6, 2025 at 11:33 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Mar 6, 2025 at 12:43 AM Nikhil Kumar Veldanda
> > <veldanda.nikhilkumar17@gmail.com> wrote:
> >> Notably, this is the first compression algorithm for Postgres that can make use of a dictionary to provide higher
levelsof compression, but dictionaries have to be generated and maintained,
>
> > I think that solving the problems around using a dictionary is going
> > to be really hard. Can we see some evidence that the results will be
> > worth it?
>
> BTW, this is hardly the first such attempt. See [1] for a prior
> attempt at something fairly similar, which ended up going nowhere.
> It'd be wise to understand why that failed before pressing forward.
>
> Note that the thread title for [1] is pretty misleading, as the
> original discussion about JSONB-specific compression soon migrated
> to discussion of compressing TOAST data using dictionaries. At
> least from a ten-thousand-foot viewpoint, that seems like exactly
> what you're proposing here. I see that you dismissed [1] as
> irrelevant upthread, but I think you'd better look closer.
>
> regards, tom lane
>
> [1] https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com
Thank you for highlighting the previous discussion—I reviewed [1]
closely. While both methods involve dictionary-based compression, the
approach I'm proposing differs significantly.
The previous method explicitly extracted string values from JSONB and
assigned unique OIDs to each entry, resulting in distinct dictionary
entries for every unique value. In contrast, this approach directly
leverages Zstandard's dictionary training API. We provide raw data
samples to Zstd, which generates a dictionary of a specified size.
This dictionary is then stored in a catalog table and used to compress
subsequent inserts for the specific attribute it was trained on.
Key differences include:
1. No new data types are required.
2. Attributes can optionally have multiple dictionaries; the latest
dictionary is used during compression, and the exact dictionary used
during compression is retrieved and applied for decompression.
3. Compression utilizes Zstandard's trained dictionaries when available.
Additionally, I have provided an option for users to define custom
sampling and training logic, as directly passing raw buffers to the
training API may not always yield optimal results, especially for
certain custom variable-length data types. This flexibility motivates
the necessary adjustments to `pg_type`.
I would greatly appreciate your feedback or any additional suggestions
you might have.
[1] https://www.postgresql.org/message-id/flat/CAJ7c6TOtAB0z1UrksvGTStNE-herK-43bj22%3D5xVBg7S4vr5rQ%40mail.gmail.com
Best regards,
Nikhil Veldanda