Thread: Force re-compression with lz4
Hello,
I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.
I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.
I’ve altered the column to use the new lz4 compression, but that only applies to new rows.
What’s the recommended way of triggering the re-evaluation for pre-existing rows?
I tried wrapping a function like the following, but apparently each old record retains the compression applied.
text_corpus=(SELECT t.text from ...);
delete from t where id=;
insert into t(id, text) values (id, text_corpus);
Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.
Is there a smarter way to do this ?
Thanks,
Flo
On 10/17/21 10:12 AM, Florents Tselai wrote:
Because it's all in one transaction?
Even with in-place compression, you've got to read the uncompressed data.
Does your shell script process one record at a time? Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
COPY t FROM '/some/file.csv';
Hello,I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.I’ve altered the column to use the new lz4 compression, but that only applies to new rows.What’s the recommended way of triggering the re-evaluation for pre-existing rows?I tried wrapping a function like the following, but apparently each old record retains the compression applied.text_corpus=(SELECT t.text from ...); delete from t where id=; insert into t(id, text) values (id, text_corpus);
Because it's all in one transaction?
Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.Is there a smarter way to do this ?
Even with in-place compression, you've got to read the uncompressed data.
Does your shell script process one record at a time? Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
COPY t FROM '/some/file.csv';
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
On 10/17/21 11:36 AM, Ron wrote:
VACUUM t;
On 10/17/21 10:12 AM, Florents Tselai wrote:I forgot to mention:Hello,I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.I’ve altered the column to use the new lz4 compression, but that only applies to new rows.What’s the recommended way of triggering the re-evaluation for pre-existing rows?I tried wrapping a function like the following, but apparently each old record retains the compression applied.text_corpus=(SELECT t.text from ...); delete from t where id=; insert into t(id, text) values (id, text_corpus);
Because it's all in one transaction?Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.Is there a smarter way to do this ?
Even with in-place compression, you've got to read the uncompressed data.
Does your shell script process one record at a time? Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
VACUUM t;
COPY t FROM '/some/file.csv';
--
Angular momentum makes the world go 'round.
Angular momentum makes the world go 'round.
On Sun, Oct 17, 2021 at 5:12 PM Florents Tselai <florents.tselai@gmail.com> wrote:
Hello,I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.I’ve altered the column to use the new lz4 compression, but that only applies to new rows.What’s the recommended way of triggering the re-evaluation for pre-existing rows?I tried wrapping a function like the following, but apparently each old record retains the compression applied.text_corpus=(SELECT t.text from ...);
delete from t where id=;
insert into t(id, text) values (id, text_corpus);Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.Is there a smarter way to do this ?
It should be enough to VACUUM FULL the table. (but it has to be VACUUM FULL, not a regular vacuum). Or CLUSTER.
On 10/17/21 10:17, Magnus Hagander wrote: > On Sun, Oct 17, 2021 at 5:12 PM Florents Tselai > <florents.tselai@gmail.com <mailto:florents.tselai@gmail.com>> wrote: > > Is there a smarter way to do this ? > > > It should be enough to VACUUM FULL the table. (but it has to be VACUUM > FULL, not a regular vacuum). Or CLUSTER. With the proviso that this will require double the existing space, ~670GB, until the operation is completed. > > -- > Magnus Hagander > Me: https://www.hagander.net/ <http://www.hagander.net/> > Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/> -- Adrian Klaver adrian.klaver@aklaver.com
Yes, That COPY-delete-COPY sequence is what I ended up doing.
Unfortunately can’t use ranges as the PK its a text string.
On 17 Oct 2021, at 7:36 PM, Ron <ronljohnsonjr@gmail.com> wrote:On 10/17/21 10:12 AM, Florents Tselai wrote:Hello,I have a table storing mostly text data (40M+ rows) that has pg_total_relation_size ~670GB.I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 compression.I’ve altered the column to use the new lz4 compression, but that only applies to new rows.What’s the recommended way of triggering the re-evaluation for pre-existing rows?I tried wrapping a function like the following, but apparently each old record retains the compression applied.text_corpus=(SELECT t.text from ...); delete from t where id=; insert into t(id, text) values (id, text_corpus);
Because it's all in one transaction?Fttb, I resorted to preparing an external shell script to execute against the db but that’s too slow as it moves data in&out the db.Is there a smarter way to do this ?
Even with in-place compression, you've got to read the uncompressed data.
Does your shell script process one record at a time? Maybe do ranges:
COPY (SELECT * FROM t WHERE id BETWEEN x AND y) TO '/some/file.csv';
DELETE FROM t WHERE id BETWEEN x AND y;
COPY t FROM '/some/file.csv';--
Angular momentum makes the world go 'round.
I did look into VACUUM(full) for it’s PROCESS_TOAST option which makes sense, but the thing is I already had a cron-ed VACUUM(full) which I ended up disabling a while back; exactly because of the double-space requirement. The DB has already a 1TB size and occupying another 600MB would require some hassle. Thus, the external script approach makesmore sense. > On 17 Oct 2021, at 8:28 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote: > > On 10/17/21 10:17, Magnus Hagander wrote: >> On Sun, Oct 17, 2021 at 5:12 PM Florents Tselai <florents.tselai@gmail.com <mailto:florents.tselai@gmail.com>> wrote: > >> Is there a smarter way to do this ? >> It should be enough to VACUUM FULL the table. (but it has to be VACUUM FULL, not a regular vacuum). Or CLUSTER. > > With the proviso that this will require double the existing space, ~670GB, until the operation is completed. > >> -- >> Magnus Hagander >> Me: https://www.hagander.net/ <http://www.hagander.net/> >> Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/> > > > -- > Adrian Klaver > adrian.klaver@aklaver.com
Florents Tselai wrote: > I have a table storing mostly text data (40M+ rows) that has > pg_total_relation_size ~670GB. > I’ve just upgraded to postgres 14 and I’m now eager to try the new LZ4 > compression. You could start experimenting with data samples rather than the full contents. FWIW, in my case I've found that the compression ratio of lz4 was only marginally better than pglz (like 2% on text). As for decompression time, it doesn't seem to differ significantly from pglz, so overall, recompressing existing data did not seem worth the trouble. However lz4 appears to be much faster to compress than pglz, so its benefit is clear in terms of CPU usage for future insertions. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
On Sun, Oct 17, 2021 at 10:13:48PM +0300, Florents Tselai wrote: > I did look into VACUUM(full) for it’s PROCESS_TOAST option which > makes sense, but the thing is I already had a cron-ed VACUUM (full) > which I ended up disabling a while back; exactly because of the > double-space requirement. Please note that VACUUM FULL does not enforce a recompression on existing values. See commit dbab0c0, that disabled this choice as it introduced a noticeable performance penality in some cases when looking at the compression type of the vacuumed table attributes: =# CREATE TABLE cmdata(f1 text COMPRESSION pglz); CREATE TABLE =# INSERT INTO cmdata VALUES(repeat('1234567890', 1000)); INSERT 0 1 =# SELECT pg_column_compression(f1) FROM cmdata; pg_column_compression ----------------------- pglz (1 row) =# ALTER TABLE cmdata ALTER COLUMN f1 SET COMPRESSION lz4; ALTER TABLE =# VACUUM FULL cmdata; VACUUM =# SELECT pg_column_compression(f1) FROM cmdata; pg_column_compression ----------------------- pglz (1 row) -- Michael
Attachment
On Sun, 17 Oct 2021 at 21:04, Florents Tselai <florents.tselai@gmail.com> wrote: > Yes, That COPY-delete-COPY sequence is what I ended up doing. > Unfortunately can’t use ranges as the PK its a text string. Unless you have a really weird PK and have trouble calculating bounds, text strings are sortable and fine to use as ranges. If you use half-open intervals you no not even need to query them all (i.e., do a batch fo key<'aa', then one for >='aa', <'ab', repeat, terminate with a >='zz' ( substitute letter pairs for whatever you think will partition your keys in adequate chunks ) ( you can find adequate bounds scanning the pkindex and skipping, just rememberto sue half-open intervals and cover all the key domain ). Francisco Olarte.
Oh, that’s good to know then. So besides ALTER COMPRESSION for future inserts there’s not much one can do for pre-existingvalues I think it makes sense to update/ add more info to the docs on this as well, since other people in the thread expected thisto work that way too. Maybe at some point, even allow an explicit option to be defined during VACUUM ? > On 18 Oct 2021, at 8:18 AM, Michael Paquier <michael@paquier.xyz> wrote: > > On Sun, Oct 17, 2021 at 10:13:48PM +0300, Florents Tselai wrote: >> I did look into VACUUM(full) for it’s PROCESS_TOAST option which >> makes sense, but the thing is I already had a cron-ed VACUUM (full) >> which I ended up disabling a while back; exactly because of the >> double-space requirement. > > Please note that VACUUM FULL does not enforce a recompression on > existing values. See commit dbab0c0, that disabled this choice as it > introduced a noticeable performance penality in some cases when > looking at the compression type of the vacuumed table attributes: > =# CREATE TABLE cmdata(f1 text COMPRESSION pglz); > CREATE TABLE > =# INSERT INTO cmdata VALUES(repeat('1234567890', 1000)); > INSERT 0 1 > =# SELECT pg_column_compression(f1) FROM cmdata; > pg_column_compression > ----------------------- > pglz > (1 row) > =# ALTER TABLE cmdata ALTER COLUMN f1 SET COMPRESSION lz4; > ALTER TABLE > =# VACUUM FULL cmdata; > VACUUM > =# SELECT pg_column_compression(f1) FROM cmdata; > pg_column_compression > ----------------------- > pglz > (1 row) > -- > Michael
On Mon, Oct 18, 2021 at 09:57:11AM +0300, Florents Tselai wrote: > Oh, that’s good to know then. So besides ALTER COMPRESSION for > future inserts there’s not much one can do for pre-existing values The posting style of the mailing list is to not top-post, so if you could avoid breaking the logic of the thread, that would be nice :) > I think it makes sense to update/ add more info to the docs on this > as well, since other people in the thread expected this to work that > way too. There is some documentation, as changing the compression for an existing table is part of ALTER TABLE: https://www.postgresql.org/docs/current/sql-altertable.html "This does not cause the table to be rewritten, so existing data may still be compressed with other compression methods. If the table is restored with pg_restore, then all values are rewritten with the configured compression method." > Maybe at some point, even allow an explicit option to be defined during VACUUM ? That's a part where we disagreed as it should not be VACUUM's work to do that. The option would have a limited impact as it comes to users that would do a one-time operation most likely part of an upgrade, so I don't think that this would be adapted to have anyway. -- Michael
Attachment
On Mon, Oct 18, 2021 at 7:18 AM Michael Paquier <michael@paquier.xyz> wrote:
On Sun, Oct 17, 2021 at 10:13:48PM +0300, Florents Tselai wrote:
> I did look into VACUUM(full) for it’s PROCESS_TOAST option which
> makes sense, but the thing is I already had a cron-ed VACUUM (full)
> which I ended up disabling a while back; exactly because of the
> double-space requirement.
Please note that VACUUM FULL does not enforce a recompression on
existing values. See commit dbab0c0, that disabled this choice as it
introduced a noticeable performance penality in some cases when
looking at the compression type of the vacuumed table attributes:
Oh dang, I missed that this was reverted. Thanks for pointing that out!
On 10/18/21 01:07, Michael Paquier wrote: > CPU-speaking, LZ4 is*much* faster than pglz when it comes to > compression or decompression with its default options. The > compression ratio is comparable between both, still LZ4 compresses in > average less than PGLZ. > -- > Michael LZ4 works much better with deduplication tools like Data Domain or Data Domain Boost (client side deduplication). With zip or gzip compression, deduplication ratios are much lower than with LZ4. Most of the modern backup tools (DD, Veeam, Rubrik, Commvault) support deduplication. LZ4 algorithm uses less CPU than zip, gzip or bzip2 and works much better with deduplication algorithms employed by the backup tools. This is actually a very big and positive change. Disclosure: I used to work for Commvault as a senior PS engineer. Commvault was the first tool on the market to combine LZ4 and deduplication. Regards -- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com
On 10/18/21 06:41, Mladen Gogala wrote: > > On 10/18/21 01:07, Michael Paquier wrote: >> CPU-speaking, LZ4 is*much* faster than pglz when it comes to >> compression or decompression with its default options. The >> compression ratio is comparable between both, still LZ4 compresses in >> average less than PGLZ. >> -- >> Michael > > LZ4 works much better with deduplication tools like Data Domain or Data > Domain Boost (client side deduplication). With zip or gzip compression, > deduplication ratios are much lower than with LZ4. Most of the modern > backup tools (DD, Veeam, Rubrik, Commvault) support deduplication. LZ4 > algorithm uses less CPU than zip, gzip or bzip2 and works much better > with deduplication algorithms employed by the backup tools. This is > actually a very big and positive change. Not sure how much this applies to the Postgres usage of lz4. As I understand it, this is only used internally for table compression. When using pg_dump compression gzip is used. Unless you pipe plain text output through some other program. > > Disclosure: > > I used to work for Commvault as a senior PS engineer. Commvault was the > first tool on the market to combine LZ4 and deduplication. > > Regards > > -- Adrian Klaver adrian.klaver@aklaver.com
On 10/18/21 11:01, Adrian Klaver wrote: > ot sure how much this applies to the Postgres usage of lz4. As I > understand it, this is only used internally for table compression. > When using pg_dump compression gzip is used. Unless you pipe plain > text output through some other program. This applies when using a 3rd party commercial backup tool with deduplication. You'd be surprised how many people do that. One tool to backup them all, one tool to find them and on the LTO cartridge backup them. I apologize for this cheesy paraphrase of Tolkien, but I couldn't resist. Long story short, for the 3rd party backup tools LZO4 compression will yield better deduplication ratios than other forms of compression, thereby saving you space. -- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com