Re: [PATCH] Compression dictionaries for JSONB - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: [PATCH] Compression dictionaries for JSONB |
Date | |
Msg-id | 20230206193328.zl5bzx54y4uc4nu3@awork3.anarazel.de Whole thread Raw |
In response to | Re: [PATCH] Compression dictionaries for JSONB (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: [PATCH] Compression dictionaries for JSONB
Re: [PATCH] Compression dictionaries for JSONB |
List | pgsql-hackers |
Hi, On 2023-02-06 16:16:41 +0100, Matthias van de Meent wrote: > On Mon, 6 Feb 2023 at 15:03, Aleksander Alekseev > <aleksander@timescale.com> wrote: > > > > Hi, > > > > I see your point regarding the fact that creating dictionaries on a > > training set is too beneficial to neglect it. Can't argue with this. > > > > What puzzles me though is: what prevents us from doing this on a page > > level as suggested previously? > > The complexity of page-level compression is significant, as pages are > currently a base primitive of our persistency and consistency scheme. +many It's also not all a panacea performance-wise, datum-level decompression can often be deferred much longer than page level decompression. For things like json[b], you'd hopefully normally have some "pre-filtering" based on proper columns, before you need to dig into the json datum. It's also not necessarily that good, compression ratio wise. Particularly for wider datums you're not going to be able to remove much duplication, because there's only a handful of tuples. Consider the case of json keys - the dictionary will often do better than page level compression, because it'll have the common keys in the dictionary, which means the "full" keys never will have to appear on a page, whereas page-level compression will have the keys on it, at least once. Of course you can use a dictionary for page-level compression too, but the gains when it works well will often be limited, because in most OLTP usable page-compression schemes I'm aware of, you can't compress a page all that far down, because you need a small number of possible "compressed page sizes". > > More similar data you compress the more space and disk I/O you save. > > Additionally you don't have to compress/decompress the data every time > > you access it. Everything that's in shared buffers is uncompressed. > > Not to mention the fact that you don't care what's in pg_attribute, > > the fact that schema may change, etc. There is a table and a > > dictionary for this table that you refresh from time to time. Very > > simple. > > You cannot "just" refresh a dictionary used once to compress an > object, because you need it to decompress the object too. Right. That's what I was trying to refer to when mentioning that we might need to add a bit of additional information to the varlena header for datums compressed with a dictionary. Greetings, Andres Freund
pgsql-hackers by date: