Thread: Compress ReorderBuffer spill files using LZ4

Compress ReorderBuffer spill files using LZ4

From
Julien Tachoires
Date:
Hi,

When the content of a large transaction (size exceeding
logical_decoding_work_mem) and its sub-transactions has to be
reordered during logical decoding, then, all the changes are written
on disk in temporary files located in pg_replslot/<slot_name>.
Decoding very large transactions by multiple replication slots can
lead to disk space saturation and high I/O utilization.

When compiled with LZ4 support (--with-lz4), this patch enables data
compression/decompression of these temporary files. Each transaction
change that must be written on disk (ReorderBufferDiskChange) is now
compressed and encapsulated in a new structure.

3 different compression strategies are implemented:

1. LZ4 streaming compression is the preferred one and works
   efficiently for small individual changes.
2. LZ4 regular compression when the changes are too large for using
   the streaming API.
3. No compression when compression fails, the change is then stored
   not compressed.

When not using compression, the following case generates 1590MB of
spill files:

  CREATE TABLE t (i INTEGER PRIMARY KEY, t TEXT);
  INSERT INTO t
    SELECT i, 'Hello number n°'||i::TEXT
    FROM generate_series(1, 10000000) as i;

With LZ4 compression, it creates 653MB of spill files: 58.9% less
disk space usage.

Open items:

1. The spill_bytes column from pg_stat_get_replication_slot() still returns
plain data size, not the compressed data size. Should we expose the
compressed data size when compression occurs?

2. Do we want a GUC to switch compression on/off?

Regards,

JT

Attachment

Re: Compress ReorderBuffer spill files using LZ4

From
Amit Kapila
Date:
On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
>
> When the content of a large transaction (size exceeding
> logical_decoding_work_mem) and its sub-transactions has to be
> reordered during logical decoding, then, all the changes are written
> on disk in temporary files located in pg_replslot/<slot_name>.
> Decoding very large transactions by multiple replication slots can
> lead to disk space saturation and high I/O utilization.
>

Why can't one use 'streaming' option to send changes to the client
once it reaches the configured limit of 'logical_decoding_work_mem'?

>
> 2. Do we want a GUC to switch compression on/off?
>

It depends on the overhead of decoding. Did you try to measure the
decoding overhead of decompression when reading compressed files?

--
With Regards,
Amit Kapila.



Re: Compress ReorderBuffer spill files using LZ4

From
Dilip Kumar
Date:
On Thu, Jun 6, 2024 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
> >
> > When the content of a large transaction (size exceeding
> > logical_decoding_work_mem) and its sub-transactions has to be
> > reordered during logical decoding, then, all the changes are written
> > on disk in temporary files located in pg_replslot/<slot_name>.
> > Decoding very large transactions by multiple replication slots can
> > lead to disk space saturation and high I/O utilization.
> >
>
> Why can't one use 'streaming' option to send changes to the client
> once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> >
> > 2. Do we want a GUC to switch compression on/off?
> >
>
> It depends on the overhead of decoding. Did you try to measure the
> decoding overhead of decompression when reading compressed files?

I think it depends on the trade-off between the I/O savings from
reducing the data size and the performance cost of compressing and
decompressing the data. This balance is highly dependent on the
hardware. For example, if you have a very slow disk and a powerful
processor, compression could be advantageous. Conversely, if the disk
is very fast, the I/O savings might be minimal, and the compression
overhead could outweigh the benefits. Additionally, the effectiveness
of compression also depends on the compression ratio, which varies
with the type of data being compressed.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Compress ReorderBuffer spill files using LZ4

From
Julien Tachoires
Date:
Le jeu. 6 juin 2024 à 04:13, Amit Kapila <amit.kapila16@gmail.com> a écrit :
>
> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
> >
> > When the content of a large transaction (size exceeding
> > logical_decoding_work_mem) and its sub-transactions has to be
> > reordered during logical decoding, then, all the changes are written
> > on disk in temporary files located in pg_replslot/<slot_name>.
> > Decoding very large transactions by multiple replication slots can
> > lead to disk space saturation and high I/O utilization.
> >
>
> Why can't one use 'streaming' option to send changes to the client
> once it reaches the configured limit of 'logical_decoding_work_mem'?

That's right, setting subscription's option 'streaming' to 'on' moves
the problem away from the publisher to the subscribers. This patch
tries to improve the default situation when 'streaming' is set to
'off'.

> > 2. Do we want a GUC to switch compression on/off?
> >
>
> It depends on the overhead of decoding. Did you try to measure the
> decoding overhead of decompression when reading compressed files?

Quick benchmarking executed on my laptop shows 1% overhead.

Table DDL:
CREATE TABLE t (i INTEGER PRIMARY KEY, t TEXT);

Data generated with:
INSERT INTO t SELECT i, 'Text number n°'||i::TEXT FROM
generate_series(1, 10000000) as i;

Restoration duration measured using timestamps of log messages:
"DEBUG:  restored XXXX/YYYY changes from disk"

HEAD: 25.54s, 25.94s, 25.516s, 26.267s, 26.11s / avg=25.874s
Patch: 26.872s, 26.311s, 25.753s, 26.003, 25.843s / avg=26.156s

Regards,

JT



Re: Compress ReorderBuffer spill files using LZ4

From
Amit Kapila
Date:
On Thu, Jun 6, 2024 at 6:22 PM Julien Tachoires <julmon@gmail.com> wrote:
>
> Le jeu. 6 juin 2024 à 04:13, Amit Kapila <amit.kapila16@gmail.com> a écrit :
> >
> > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
> > >
> > > When the content of a large transaction (size exceeding
> > > logical_decoding_work_mem) and its sub-transactions has to be
> > > reordered during logical decoding, then, all the changes are written
> > > on disk in temporary files located in pg_replslot/<slot_name>.
> > > Decoding very large transactions by multiple replication slots can
> > > lead to disk space saturation and high I/O utilization.
> > >
> >
> > Why can't one use 'streaming' option to send changes to the client
> > once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> That's right, setting subscription's option 'streaming' to 'on' moves
> the problem away from the publisher to the subscribers. This patch
> tries to improve the default situation when 'streaming' is set to
> 'off'.
>

Can we think of changing the default to 'parallel'? BTW, it would be
better to use 'parallel' for the 'streaming' option, if the workload
has large transactions. Is there a reason to use a default value in
this case?

> > > 2. Do we want a GUC to switch compression on/off?
> > >
> >
> > It depends on the overhead of decoding. Did you try to measure the
> > decoding overhead of decompression when reading compressed files?
>
> Quick benchmarking executed on my laptop shows 1% overhead.
>

Thanks. We probably need different types of data (say random data in
bytea column, etc.) for this.

--
With Regards,
Amit Kapila.



Re: Compress ReorderBuffer spill files using LZ4

From
Alvaro Herrera
Date:
On 2024-Jun-06, Amit Kapila wrote:

> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
> >
> > When the content of a large transaction (size exceeding
> > logical_decoding_work_mem) and its sub-transactions has to be
> > reordered during logical decoding, then, all the changes are written
> > on disk in temporary files located in pg_replslot/<slot_name>.
> > Decoding very large transactions by multiple replication slots can
> > lead to disk space saturation and high I/O utilization.

I like the general idea of compressing the output of logical decoding.
It's not so clear to me that we only want to do so for spilling to disk;
for instance, if the two nodes communicate over a slow network, it may
even be beneficial to compress when streaming, so to this question:

> Why can't one use 'streaming' option to send changes to the client
> once it reaches the configured limit of 'logical_decoding_work_mem'?

I would say that streaming doesn't necessarily have to mean we don't
want compression, because for some users it might be beneficial.

I think a GUC would be a good idea.  Also, what if for whatever reason
you want a different compression algorithm or different compression
parameters?  Looking at the existing compression UI we offer in
pg_basebackup, perhaps you could add something like this:

compress_logical_decoding = none
compress_logical_decoding = lz4:42
compress_logical_decoding = spill-zstd:99

"none" says to never use compression (perhaps should be the default),
"lz4:42" says to use lz4 with parameters 42 on both spilling and
streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
only for spilling to disk.

(I don't mean to say that you should implement Zstd compression with
this patch, only that you should choose the implementation so that
adding Zstd support (or whatever) later is just a matter of adding some
branches here and there.  With the current #ifdef you propose, it's hard
to do that.  Maybe separate the parts that depend on the specific
algorithm to algorithm-agnostic functions.)

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/



Re: Compress ReorderBuffer spill files using LZ4

From
Julien Tachoires
Date:
Le jeu. 6 juin 2024 à 06:40, Amit Kapila <amit.kapila16@gmail.com> a écrit :
>
> On Thu, Jun 6, 2024 at 6:22 PM Julien Tachoires <julmon@gmail.com> wrote:
> >
> > Le jeu. 6 juin 2024 à 04:13, Amit Kapila <amit.kapila16@gmail.com> a écrit :
> > >
> > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
> > > >
> > > > When the content of a large transaction (size exceeding
> > > > logical_decoding_work_mem) and its sub-transactions has to be
> > > > reordered during logical decoding, then, all the changes are written
> > > > on disk in temporary files located in pg_replslot/<slot_name>.
> > > > Decoding very large transactions by multiple replication slots can
> > > > lead to disk space saturation and high I/O utilization.
> > > >
> > >
> > > Why can't one use 'streaming' option to send changes to the client
> > > once it reaches the configured limit of 'logical_decoding_work_mem'?
> >
> > That's right, setting subscription's option 'streaming' to 'on' moves
> > the problem away from the publisher to the subscribers. This patch
> > tries to improve the default situation when 'streaming' is set to
> > 'off'.
> >
>
> Can we think of changing the default to 'parallel'? BTW, it would be
> better to use 'parallel' for the 'streaming' option, if the workload
> has large transactions. Is there a reason to use a default value in
> this case?

You're certainly right, if using the streaming API helps to avoid bad
situations and there is no downside, it could be used by default.

> > > > 2. Do we want a GUC to switch compression on/off?
> > > >
> > >
> > > It depends on the overhead of decoding. Did you try to measure the
> > > decoding overhead of decompression when reading compressed files?
> >
> > Quick benchmarking executed on my laptop shows 1% overhead.
> >
>
> Thanks. We probably need different types of data (say random data in
> bytea column, etc.) for this.

Yes, good idea, will run new tests in that sense.

Thank you!

Regards,

JT



Re: Compress ReorderBuffer spill files using LZ4

From
Julien Tachoires
Date:
Le jeu. 6 juin 2024 à 07:24, Alvaro Herrera <alvherre@alvh.no-ip.org> a écrit :
>
> On 2024-Jun-06, Amit Kapila wrote:
>
> > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
> > >
> > > When the content of a large transaction (size exceeding
> > > logical_decoding_work_mem) and its sub-transactions has to be
> > > reordered during logical decoding, then, all the changes are written
> > > on disk in temporary files located in pg_replslot/<slot_name>.
> > > Decoding very large transactions by multiple replication slots can
> > > lead to disk space saturation and high I/O utilization.
>
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
>
> > Why can't one use 'streaming' option to send changes to the client
> > once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.

Interesting idea, will try to evaluate how to compress/decompress data
transiting via streaming and how good the compression ratio would be.

> I think a GUC would be a good idea.  Also, what if for whatever reason
> you want a different compression algorithm or different compression
> parameters?  Looking at the existing compression UI we offer in
> pg_basebackup, perhaps you could add something like this:
>
> compress_logical_decoding = none
> compress_logical_decoding = lz4:42
> compress_logical_decoding = spill-zstd:99
>
> "none" says to never use compression (perhaps should be the default),
> "lz4:42" says to use lz4 with parameters 42 on both spilling and
> streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
> only for spilling to disk.

I agree, if the server was compiled with support of multiple
compression libraries, users should be able to choose which one they
want to use.

> (I don't mean to say that you should implement Zstd compression with
> this patch, only that you should choose the implementation so that
> adding Zstd support (or whatever) later is just a matter of adding some
> branches here and there.  With the current #ifdef you propose, it's hard
> to do that.  Maybe separate the parts that depend on the specific
> algorithm to algorithm-agnostic functions.)

Makes sense, will rework this patch in that way.

Thank you!

Regards,

JT



Re: Compress ReorderBuffer spill files using LZ4

From
Dilip Kumar
Date:
On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Jun-06, Amit Kapila wrote:
>
> > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
> > >
> > > When the content of a large transaction (size exceeding
> > > logical_decoding_work_mem) and its sub-transactions has to be
> > > reordered during logical decoding, then, all the changes are written
> > > on disk in temporary files located in pg_replslot/<slot_name>.
> > > Decoding very large transactions by multiple replication slots can
> > > lead to disk space saturation and high I/O utilization.
>
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
>
> > Why can't one use 'streaming' option to send changes to the client
> > once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.

+1

> I think a GUC would be a good idea.  Also, what if for whatever reason
> you want a different compression algorithm or different compression
> parameters?  Looking at the existing compression UI we offer in
> pg_basebackup, perhaps you could add something like this:
>
> compress_logical_decoding = none
> compress_logical_decoding = lz4:42
> compress_logical_decoding = spill-zstd:99
>
> "none" says to never use compression (perhaps should be the default),
> "lz4:42" says to use lz4 with parameters 42 on both spilling and
> streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
> only for spilling to disk.
>

I think the compression option should be supported at the CREATE
SUBSCRIPTION level instead of being controlled by a GUC. This way, we
can decide on compression for each subscription individually rather
than applying it to all subscribers. It makes more sense for the
subscriber to control this, especially when we are planning to
compress the data sent downstream.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Compress ReorderBuffer spill files using LZ4

From
Alvaro Herrera
Date:
On 2024-Jun-07, Dilip Kumar wrote:

> I think the compression option should be supported at the CREATE
> SUBSCRIPTION level instead of being controlled by a GUC. This way, we
> can decide on compression for each subscription individually rather
> than applying it to all subscribers. It makes more sense for the
> subscriber to control this, especially when we are planning to
> compress the data sent downstream.

True.  (I think we have some options that are in GUCs for the general
behavior and can be overridden by per-subscription options for specific
tailoring; would that make sense here?  I think it does, considering
that what we mostly want is to save disk space in the publisher when
spilling to disk.)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I can't go to a restaurant and order food because I keep looking at the
fonts on the menu.  Five minutes later I realize that it's also talking
about food" (Donald Knuth)



Re: Compress ReorderBuffer spill files using LZ4

From
Dilip Kumar
Date:
On Fri, Jun 7, 2024 at 2:39 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Jun-07, Dilip Kumar wrote:
>
> > I think the compression option should be supported at the CREATE
> > SUBSCRIPTION level instead of being controlled by a GUC. This way, we
> > can decide on compression for each subscription individually rather
> > than applying it to all subscribers. It makes more sense for the
> > subscriber to control this, especially when we are planning to
> > compress the data sent downstream.
>
> True.  (I think we have some options that are in GUCs for the general
> behavior and can be overridden by per-subscription options for specific
> tailoring; would that make sense here?  I think it does, considering
> that what we mostly want is to save disk space in the publisher when
> spilling to disk.)

Yeah, that makes sense.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Compress ReorderBuffer spill files using LZ4

From
Amit Kapila
Date:
On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2024-Jun-06, Amit Kapila wrote:
>
> > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
> > >
> > > When the content of a large transaction (size exceeding
> > > logical_decoding_work_mem) and its sub-transactions has to be
> > > reordered during logical decoding, then, all the changes are written
> > > on disk in temporary files located in pg_replslot/<slot_name>.
> > > Decoding very large transactions by multiple replication slots can
> > > lead to disk space saturation and high I/O utilization.
>
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
>
> > Why can't one use 'streaming' option to send changes to the client
> > once it reaches the configured limit of 'logical_decoding_work_mem'?
>
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.
>

Fair enough. it would be an interesting feature if we see the wider
usefulness of compression/decompression of logical changes. For
example, if this can improve the performance of applying large
transactions (aka reduce the apply lag for them) even when the
'streaming' option is 'parallel' then it would have a much wider
impact.

--
With Regards,
Amit Kapila.



Re: Compress ReorderBuffer spill files using LZ4

From
Amit Kapila
Date:
On Fri, Jun 7, 2024 at 2:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I think the compression option should be supported at the CREATE
> SUBSCRIPTION level instead of being controlled by a GUC. This way, we
> can decide on compression for each subscription individually rather
> than applying it to all subscribers. It makes more sense for the
> subscriber to control this, especially when we are planning to
> compress the data sent downstream.
>

Yes, that makes sense. However, we then need to provide this option
via SQL APIs as well for other plugins.

--
With Regards,
Amit Kapila.



Re: Compress ReorderBuffer spill files using LZ4

From
Tomas Vondra
Date:
On 6/6/24 16:24, Alvaro Herrera wrote:
> On 2024-Jun-06, Amit Kapila wrote:
> 
>> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote:
>>>
>>> When the content of a large transaction (size exceeding
>>> logical_decoding_work_mem) and its sub-transactions has to be
>>> reordered during logical decoding, then, all the changes are written
>>> on disk in temporary files located in pg_replslot/<slot_name>.
>>> Decoding very large transactions by multiple replication slots can
>>> lead to disk space saturation and high I/O utilization.
> 
> I like the general idea of compressing the output of logical decoding.
> It's not so clear to me that we only want to do so for spilling to disk;
> for instance, if the two nodes communicate over a slow network, it may
> even be beneficial to compress when streaming, so to this question:
> 
>> Why can't one use 'streaming' option to send changes to the client
>> once it reaches the configured limit of 'logical_decoding_work_mem'?
> 
> I would say that streaming doesn't necessarily have to mean we don't
> want compression, because for some users it might be beneficial.
> 
> I think a GUC would be a good idea.  Also, what if for whatever reason
> you want a different compression algorithm or different compression
> parameters?  Looking at the existing compression UI we offer in
> pg_basebackup, perhaps you could add something like this:
> 
> compress_logical_decoding = none
> compress_logical_decoding = lz4:42
> compress_logical_decoding = spill-zstd:99
> 
> "none" says to never use compression (perhaps should be the default),
> "lz4:42" says to use lz4 with parameters 42 on both spilling and
> streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but
> only for spilling to disk.
> 
> (I don't mean to say that you should implement Zstd compression with
> this patch, only that you should choose the implementation so that
> adding Zstd support (or whatever) later is just a matter of adding some
> branches here and there.  With the current #ifdef you propose, it's hard
> to do that.  Maybe separate the parts that depend on the specific
> algorithm to algorithm-agnostic functions.)
> 

I haven't been following the "libpq compression" thread, but wouldn't
that also do compression for the streaming case? That was my assumption,
at least, and it seems like the right way - we probably don't want to
patch every place that sends data over network independently, right?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Compress ReorderBuffer spill files using LZ4

From
Tomas Vondra
Date:
On 6/6/24 12:58, Julien Tachoires wrote:
> ...
>
> When compiled with LZ4 support (--with-lz4), this patch enables data
> compression/decompression of these temporary files. Each transaction
> change that must be written on disk (ReorderBufferDiskChange) is now
> compressed and encapsulated in a new structure.
> 

I'm a bit confused, but why tie this to having lz4? Why shouldn't this
be supported even for pglz, or whatever algorithms we add in the future?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Compress ReorderBuffer spill files using LZ4

From
Julien Tachoires
Date:
Le ven. 7 juin 2024 à 05:59, Tomas Vondra
<tomas.vondra@enterprisedb.com> a écrit :
>
> On 6/6/24 12:58, Julien Tachoires wrote:
> > ...
> >
> > When compiled with LZ4 support (--with-lz4), this patch enables data
> > compression/decompression of these temporary files. Each transaction
> > change that must be written on disk (ReorderBufferDiskChange) is now
> > compressed and encapsulated in a new structure.
> >
>
> I'm a bit confused, but why tie this to having lz4? Why shouldn't this
> be supported even for pglz, or whatever algorithms we add in the future?

That's right, reworking this patch in that sense.

Regards,

JT



Re: Compress ReorderBuffer spill files using LZ4

From
Julien Tachoires
Date:
Hi,

Le ven. 7 juin 2024 à 06:18, Julien Tachoires <julmon@gmail.com> a écrit :
>
> Le ven. 7 juin 2024 à 05:59, Tomas Vondra
> <tomas.vondra@enterprisedb.com> a écrit :
> >
> > On 6/6/24 12:58, Julien Tachoires wrote:
> > > ...
> > >
> > > When compiled with LZ4 support (--with-lz4), this patch enables data
> > > compression/decompression of these temporary files. Each transaction
> > > change that must be written on disk (ReorderBufferDiskChange) is now
> > > compressed and encapsulated in a new structure.
> > >
> >
> > I'm a bit confused, but why tie this to having lz4? Why shouldn't this
> > be supported even for pglz, or whatever algorithms we add in the future?
>
> That's right, reworking this patch in that sense.

Please find a new version of this patch adding support for LZ4, pglz
and ZSTD. It introduces the new GUC logical_decoding_spill_compression
which is used to set the compression method. In order to stay aligned
with the other server side GUCs related to compression methods
(wal_compression, default_toast_compression), the compression level is
not exposed to users.

The last patch of this set is still in WIP, it adds the machinery
required for setting the compression methods as a subscription option:
CREATE SUBSCRIPTION ... WITH (spill_compression = ...);
I think there is a major problem with this approach: the logical
decoding context is tied to one replication slot, but multiple
subscriptions can use the same replication slot. How should this work
if 2 subscriptions want to use the same replication slot but different
compression methods?

At this point, compression is only available for the changes spilled
on disk. It is still not clear to me if the compression of data
transiting through the streaming protocol should be addressed by this
patch set or by another one. Thought ?

Regards,

JT

Attachment

Re: Compress ReorderBuffer spill files using LZ4

From
Tomas Vondra
Date:
On 7/15/24 20:50, Julien Tachoires wrote:
> Hi,
> 
> Le ven. 7 juin 2024 à 06:18, Julien Tachoires <julmon@gmail.com> a écrit :
>>
>> Le ven. 7 juin 2024 à 05:59, Tomas Vondra
>> <tomas.vondra@enterprisedb.com> a écrit :
>>>
>>> On 6/6/24 12:58, Julien Tachoires wrote:
>>>> ...
>>>>
>>>> When compiled with LZ4 support (--with-lz4), this patch enables data
>>>> compression/decompression of these temporary files. Each transaction
>>>> change that must be written on disk (ReorderBufferDiskChange) is now
>>>> compressed and encapsulated in a new structure.
>>>>
>>>
>>> I'm a bit confused, but why tie this to having lz4? Why shouldn't this
>>> be supported even for pglz, or whatever algorithms we add in the future?
>>
>> That's right, reworking this patch in that sense.
> 
> Please find a new version of this patch adding support for LZ4, pglz
> and ZSTD. It introduces the new GUC logical_decoding_spill_compression
> which is used to set the compression method. In order to stay aligned
> with the other server side GUCs related to compression methods
> (wal_compression, default_toast_compression), the compression level is
> not exposed to users.
> 

Sounds reasonable. I wonder if it might be useful to allow specifying
the compression level in those places, but that's clearly not something
this patch needs to do.

> The last patch of this set is still in WIP, it adds the machinery
> required for setting the compression methods as a subscription option:
> CREATE SUBSCRIPTION ... WITH (spill_compression = ...);
> I think there is a major problem with this approach: the logical
> decoding context is tied to one replication slot, but multiple
> subscriptions can use the same replication slot. How should this work
> if 2 subscriptions want to use the same replication slot but different
> compression methods?
> 

Do we really support multiple subscriptions sharing the same slot? I
don't think we do, but maybe I'm missing something.

> At this point, compression is only available for the changes spilled
> on disk. It is still not clear to me if the compression of data
> transiting through the streaming protocol should be addressed by this
> patch set or by another one. Thought ?
> 

I'd stick to only compressing the data spilled to disk. It might be
useful to compress the streamed data too, but why shouldn't we compress
the regular (non-streamed) transactions too? Yeah, it's more efficient
to compress larger chunks, but we can fit quite large transactions into
logical_decoding_work_mem without spilling.

FWIW I'd expect that to be handled at the libpq level - there's already
a patch for that, but I haven't checked if it would handle this. But
maybe more importantly, I think compressing streamed data might need to
handle some sort of negotiation of the compression algorithm, which
seems fairly complex.

To conclude, I'd leave this out of scope for this patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Compress ReorderBuffer spill files using LZ4

From
Julien Tachoires
Date:
Le lun. 15 juil. 2024 à 12:28, Tomas Vondra
<tomas.vondra@enterprisedb.com> a écrit :
>
> On 7/15/24 20:50, Julien Tachoires wrote:
> > The last patch of this set is still in WIP, it adds the machinery
> > required for setting the compression methods as a subscription option:
> > CREATE SUBSCRIPTION ... WITH (spill_compression = ...);
> > I think there is a major problem with this approach: the logical
> > decoding context is tied to one replication slot, but multiple
> > subscriptions can use the same replication slot. How should this work
> > if 2 subscriptions want to use the same replication slot but different
> > compression methods?
> >
>
> Do we really support multiple subscriptions sharing the same slot? I
> don't think we do, but maybe I'm missing something.

You are right, it's not supported, the following error is raised in this case:
ERROR:  replication slot "sub1" is active for PID 51735

I was distracted by the fact that nothing prevents the configuration
of multiple subscriptions sharing the same replication slot.

Thanks,

JT



Re: Compress ReorderBuffer spill files using LZ4

From
Amit Kapila
Date:
On Tue, Jul 16, 2024 at 12:58 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/15/24 20:50, Julien Tachoires wrote:
> > Hi,
> >
> > Le ven. 7 juin 2024 à 06:18, Julien Tachoires <julmon@gmail.com> a écrit :
> >>
> >> Le ven. 7 juin 2024 à 05:59, Tomas Vondra
> >> <tomas.vondra@enterprisedb.com> a écrit :
> >>>
> >>> On 6/6/24 12:58, Julien Tachoires wrote:
> >>>> ...
> >>>>
> >>>> When compiled with LZ4 support (--with-lz4), this patch enables data
> >>>> compression/decompression of these temporary files. Each transaction
> >>>> change that must be written on disk (ReorderBufferDiskChange) is now
> >>>> compressed and encapsulated in a new structure.
> >>>>
> >>>
> >>> I'm a bit confused, but why tie this to having lz4? Why shouldn't this
> >>> be supported even for pglz, or whatever algorithms we add in the future?
> >>
> >> That's right, reworking this patch in that sense.
> >
> > Please find a new version of this patch adding support for LZ4, pglz
> > and ZSTD. It introduces the new GUC logical_decoding_spill_compression
> > which is used to set the compression method. In order to stay aligned
> > with the other server side GUCs related to compression methods
> > (wal_compression, default_toast_compression), the compression level is
> > not exposed to users.
> >
>
> Sounds reasonable. I wonder if it might be useful to allow specifying
> the compression level in those places, but that's clearly not something
> this patch needs to do.
>
> > The last patch of this set is still in WIP, it adds the machinery
> > required for setting the compression methods as a subscription option:
> > CREATE SUBSCRIPTION ... WITH (spill_compression = ...);
> > I think there is a major problem with this approach: the logical
> > decoding context is tied to one replication slot, but multiple
> > subscriptions can use the same replication slot. How should this work
> > if 2 subscriptions want to use the same replication slot but different
> > compression methods?
> >
>
> Do we really support multiple subscriptions sharing the same slot? I
> don't think we do, but maybe I'm missing something.
>
> > At this point, compression is only available for the changes spilled
> > on disk. It is still not clear to me if the compression of data
> > transiting through the streaming protocol should be addressed by this
> > patch set or by another one. Thought ?
> >
>
> I'd stick to only compressing the data spilled to disk. It might be
> useful to compress the streamed data too, but why shouldn't we compress
> the regular (non-streamed) transactions too? Yeah, it's more efficient
> to compress larger chunks, but we can fit quite large transactions into
> logical_decoding_work_mem without spilling.
>
> FWIW I'd expect that to be handled at the libpq level - there's already
> a patch for that, but I haven't checked if it would handle this. But
> maybe more importantly, I think compressing streamed data might need to
> handle some sort of negotiation of the compression algorithm, which
> seems fairly complex.
>
> To conclude, I'd leave this out of scope for this patch.
>

Your point sounds reasonable to me. OTOH, if we want to support
compression for spill case then shouldn't there be a question how
frequent such an option would be required? Users currently have an
option to stream large transactions for parallel apply or otherwise in
which case no spilling is required. I feel sooner or later we will
make such behavior (streaming=parallel) as default, and then spilling
should happen in very few cases. Is it worth adding this new option
and GUC if that is true?

--
With Regards,
Amit Kapila.



Re: Compress ReorderBuffer spill files using LZ4

From
Tomas Vondra
Date:

On 7/16/24 14:52, Amit Kapila wrote:
> On Tue, Jul 16, 2024 at 12:58 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>> On 7/15/24 20:50, Julien Tachoires wrote:
>>> Hi,
>>>
>>> Le ven. 7 juin 2024 à 06:18, Julien Tachoires <julmon@gmail.com> a écrit :
>>>>
>>>> Le ven. 7 juin 2024 à 05:59, Tomas Vondra
>>>> <tomas.vondra@enterprisedb.com> a écrit :
>>>>>
>>>>> On 6/6/24 12:58, Julien Tachoires wrote:
>>>>>> ...
>>>>>>
>>>>>> When compiled with LZ4 support (--with-lz4), this patch enables data
>>>>>> compression/decompression of these temporary files. Each transaction
>>>>>> change that must be written on disk (ReorderBufferDiskChange) is now
>>>>>> compressed and encapsulated in a new structure.
>>>>>>
>>>>>
>>>>> I'm a bit confused, but why tie this to having lz4? Why shouldn't this
>>>>> be supported even for pglz, or whatever algorithms we add in the future?
>>>>
>>>> That's right, reworking this patch in that sense.
>>>
>>> Please find a new version of this patch adding support for LZ4, pglz
>>> and ZSTD. It introduces the new GUC logical_decoding_spill_compression
>>> which is used to set the compression method. In order to stay aligned
>>> with the other server side GUCs related to compression methods
>>> (wal_compression, default_toast_compression), the compression level is
>>> not exposed to users.
>>>
>>
>> Sounds reasonable. I wonder if it might be useful to allow specifying
>> the compression level in those places, but that's clearly not something
>> this patch needs to do.
>>
>>> The last patch of this set is still in WIP, it adds the machinery
>>> required for setting the compression methods as a subscription option:
>>> CREATE SUBSCRIPTION ... WITH (spill_compression = ...);
>>> I think there is a major problem with this approach: the logical
>>> decoding context is tied to one replication slot, but multiple
>>> subscriptions can use the same replication slot. How should this work
>>> if 2 subscriptions want to use the same replication slot but different
>>> compression methods?
>>>
>>
>> Do we really support multiple subscriptions sharing the same slot? I
>> don't think we do, but maybe I'm missing something.
>>
>>> At this point, compression is only available for the changes spilled
>>> on disk. It is still not clear to me if the compression of data
>>> transiting through the streaming protocol should be addressed by this
>>> patch set or by another one. Thought ?
>>>
>>
>> I'd stick to only compressing the data spilled to disk. It might be
>> useful to compress the streamed data too, but why shouldn't we compress
>> the regular (non-streamed) transactions too? Yeah, it's more efficient
>> to compress larger chunks, but we can fit quite large transactions into
>> logical_decoding_work_mem without spilling.
>>
>> FWIW I'd expect that to be handled at the libpq level - there's already
>> a patch for that, but I haven't checked if it would handle this. But
>> maybe more importantly, I think compressing streamed data might need to
>> handle some sort of negotiation of the compression algorithm, which
>> seems fairly complex.
>>
>> To conclude, I'd leave this out of scope for this patch.
>>
> 
> Your point sounds reasonable to me. OTOH, if we want to support
> compression for spill case then shouldn't there be a question how
> frequent such an option would be required? Users currently have an
> option to stream large transactions for parallel apply or otherwise in
> which case no spilling is required. I feel sooner or later we will
> make such behavior (streaming=parallel) as default, and then spilling
> should happen in very few cases. Is it worth adding this new option
> and GUC if that is true?
> 

I don't know, but streaming is 'off' by default, and I'm not aware of
any proposals to change this, so when you suggest "sooner or later"
we'll change this, I'd probably bet on "later or never".

I haven't been following the discussions about parallel apply very
closely, but my impression from dealing with similar stuff in other
tools is that it's rather easy to run into issues with some workloads,
which just makes me more skeptical about "streamin=parallel" by default.
But as I said, I'm out of the loop so I may be wrong ...

As for whether the GUC is needed, I don't know. I guess we might do the
same thing we do for streaming - we don't have a GUC to enable this, but
we default to 'off' and the client has to request that when opening the
replication connection. So it'd be specified at the subscription level,
more or less.

But then how would we specify compression for cases that invoke decoding
directly by pg_logical_slot_get_changes()? Through options?

BTW if we specify this at subscription level, will it be possible to
change the compression method?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Compress ReorderBuffer spill files using LZ4

From
Amit Kapila
Date:
On Tue, Jul 16, 2024 at 7:31 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 7/16/24 14:52, Amit Kapila wrote:
> > On Tue, Jul 16, 2024 at 12:58 AM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> >>
> >> FWIW I'd expect that to be handled at the libpq level - there's already
> >> a patch for that, but I haven't checked if it would handle this. But
> >> maybe more importantly, I think compressing streamed data might need to
> >> handle some sort of negotiation of the compression algorithm, which
> >> seems fairly complex.
> >>
> >> To conclude, I'd leave this out of scope for this patch.
> >>
> >
> > Your point sounds reasonable to me. OTOH, if we want to support
> > compression for spill case then shouldn't there be a question how
> > frequent such an option would be required? Users currently have an
> > option to stream large transactions for parallel apply or otherwise in
> > which case no spilling is required. I feel sooner or later we will
> > make such behavior (streaming=parallel) as default, and then spilling
> > should happen in very few cases. Is it worth adding this new option
> > and GUC if that is true?
> >
>
> I don't know, but streaming is 'off' by default, and I'm not aware of
> any proposals to change this, so when you suggest "sooner or later"
> we'll change this, I'd probably bet on "later or never".
>
> I haven't been following the discussions about parallel apply very
> closely, but my impression from dealing with similar stuff in other
> tools is that it's rather easy to run into issues with some workloads,
> which just makes me more skeptical about "streamin=parallel" by default.
> But as I said, I'm out of the loop so I may be wrong ...
>

It is difficult to say whether enabling it by default will have issues
or not but till now we haven't seen many reports for the streaming =
'parallel' option. It could be due to the reason that not many people
enable it in their workloads. We can probably find out by enabling it
by default.

> As for whether the GUC is needed, I don't know. I guess we might do the
> same thing we do for streaming - we don't have a GUC to enable this, but
> we default to 'off' and the client has to request that when opening the
> replication connection. So it'd be specified at the subscription level,
> more or less.
>
> But then how would we specify compression for cases that invoke decoding
> directly by pg_logical_slot_get_changes()? Through options?
>

If we decide to go with this then yeah that is one way, another
possibility is to make it a slot's property, so we can allow to take a
new parameter in pg_create_logical_replication_slot(). We can even
think of inventing a new API to alter the slot's properties if we
decide to go this route.

> BTW if we specify this at subscription level, will it be possible to
> change the compression method?
>

This needs analysis but offhand I can't see the problems with it.

--
With Regards,
Amit Kapila.



Re: Compress ReorderBuffer spill files using LZ4

From
Julien Tachoires
Date:
Le mer. 17 juil. 2024 à 02:12, Amit Kapila <amit.kapila16@gmail.com> a écrit :
>
> On Tue, Jul 16, 2024 at 7:31 PM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > On 7/16/24 14:52, Amit Kapila wrote:
> > > On Tue, Jul 16, 2024 at 12:58 AM Tomas Vondra
> > > <tomas.vondra@enterprisedb.com> wrote:
> > >>
> > >> FWIW I'd expect that to be handled at the libpq level - there's already
> > >> a patch for that, but I haven't checked if it would handle this. But
> > >> maybe more importantly, I think compressing streamed data might need to
> > >> handle some sort of negotiation of the compression algorithm, which
> > >> seems fairly complex.
> > >>
> > >> To conclude, I'd leave this out of scope for this patch.
> > >>
> > >
> > > Your point sounds reasonable to me. OTOH, if we want to support
> > > compression for spill case then shouldn't there be a question how
> > > frequent such an option would be required? Users currently have an
> > > option to stream large transactions for parallel apply or otherwise in
> > > which case no spilling is required. I feel sooner or later we will
> > > make such behavior (streaming=parallel) as default, and then spilling
> > > should happen in very few cases. Is it worth adding this new option
> > > and GUC if that is true?
> > >
> >
> > I don't know, but streaming is 'off' by default, and I'm not aware of
> > any proposals to change this, so when you suggest "sooner or later"
> > we'll change this, I'd probably bet on "later or never".
> >
> > I haven't been following the discussions about parallel apply very
> > closely, but my impression from dealing with similar stuff in other
> > tools is that it's rather easy to run into issues with some workloads,
> > which just makes me more skeptical about "streamin=parallel" by default.
> > But as I said, I'm out of the loop so I may be wrong ...
> >
>
> It is difficult to say whether enabling it by default will have issues
> or not but till now we haven't seen many reports for the streaming =
> 'parallel' option. It could be due to the reason that not many people
> enable it in their workloads. We can probably find out by enabling it
> by default.
>
> > As for whether the GUC is needed, I don't know. I guess we might do the
> > same thing we do for streaming - we don't have a GUC to enable this, but
> > we default to 'off' and the client has to request that when opening the
> > replication connection. So it'd be specified at the subscription level,
> > more or less.
> >
> > But then how would we specify compression for cases that invoke decoding
> > directly by pg_logical_slot_get_changes()? Through options?
> >
>
> If we decide to go with this then yeah that is one way, another
> possibility is to make it a slot's property, so we can allow to take a
> new parameter in pg_create_logical_replication_slot(). We can even
> think of inventing a new API to alter the slot's properties if we
> decide to go this route.

Please find a new version of this patch set. The compression method is
now set on subscriber level via CREATE SUBSCRIPTION or ALTER
SUBSCRIPTION and can be passed to
pg_logical_slot_get_changes()/pg_logical_slot_get_binary_changes()
through the option spill_compression.

> > BTW if we specify this at subscription level, will it be possible to
> > change the compression method?
> >
>
> This needs analysis but offhand I can't see the problems with it.

I didn't notice any issue, the compression method can be changed even
when a decoding is in progress, in this case, the replication worker
restart due to parameter change.

JT

Attachment

Re: Compress ReorderBuffer spill files using LZ4

From
Julien Tachoires
Date:
Hi Tomas,

Le lun. 23 sept. 2024 à 18:13, Tomas Vondra <tomas@vondra.me> a écrit :
>
> Hi,
>
> I've spent a bit more time on this, mostly running tests to get a better
> idea of the practical benefits.

Thank you for your code review and testing!

> Firstly, I think there's a bug in ReorderBufferCompress() - it's legal
> for pglz_compress() to return -1. This can happen if the data is not
> compressible, and would not fit into the output buffer. The code can't
> just do elog(ERROR) in this case, it needs to handle that by storing the
> raw data. The attached fixup patch makes this work for me - I'm not
> claiming this is the best way to handle this, but it works.
>
> FWIW I find it strange the tests included in the patch did not trigger
> this. That probably means the tests are not quite sufficient.
>
>
> Now, to the testing. Attached are two scripts, testing different cases:
>
> test-columns.sh - Table with a variable number of 'float8' columns.
>
> test-toast.sh - Table with a single text column.
>
> The script always sets up a publication/subscription on two instances,
> generates certain amount of data (~1GB for columns, ~3.2GB for TOAST),
> waits for it to be replicated to the replica, and measures how much data
> was spilled to disk with the different compression methods (off, pglz
> and lz4). There's a couple more metrics, but that's irrelevant here.

It would be interesting to run the same tests with zstd: in my early
testing I found that zstd was able to provide a better compression
ratio than lz4, but seemed to use more CPU resources/is slower.

> For the "column" test, it looks like this (this is in MB):
>
>     rows    columns    distribution    off     pglz    lz4
>   ========================================================
>   100000       1000    compressible    778      20       9
>                              random    778     778      16
>   --------------------------------------------------------
>   1000000       100    compressible    916     116      62
>                              random    916     916      67
>
> It's very clear that for the "compressible" data (which just copies the
> same value into all columns), both pglz and lz4 can significantly reduce
> the amount of data. For 1000 columns it's 780MB -> 20MB/9MB, for 100
> columns it's a bit less efficient, but still good.
>
> For the "random" data (where every column gets a random value, but rows
> are copied), it's a very different story - pglz does not help at all,
> while lz4 still massively reduces the amount of spilled data.
>
> I think the explanation is very simple - for pglz, we compress each row
> on it's own, there's no concept of streaming/context. If a row is
> compressible, it works fine, but when the row gets random, pglz can't
> compress it at all. For lz4, this does not matter, because with the
> streaming mode it still sees that rows are just repeated, and so can
> compress them efficiently.

That's correct.

> For TOAST test, the results look like this:
>
>   distribution     repeats        toast       off    pglz     lz4
>   ===============================================================
>   compressible       10000        lz4          14       2       1
>                                   pglz         40       4       3
>                       1000        lz4          32      16       9
>                                   pglz         54      17      10
>         ---------------------------------------------------------
>         random       10000        lz4        3305    3305    3157
>                                   pglz       3305    3305    3157
>                       1000        lz4        3166    3162    1580
>                                   pglz       3334    3326    1745
>        ----------------------------------------------------------
>        random2       10000        lz4        3305    3305    3157
>                                   pglz       3305    3305    3158
>                       1000        lz4        3160    3156    3010
>                                   pglz       3334    3326    3172
>
> The "repeats" value means how long the string is - it's the number of
> "md5" hashes added to the string. The number of rows is calculated to
> keep the total amount of data the same. The "toast" column tracks what
> compression was used for TOAST, I was wondering if it matters.
>
> This time there are three data distributions - compressible means that
> each TOAST value is nicely compressible, "random" means each value is
> random (not compressible), but the rows are just copy of the same value
> (so on the whole there's a lot of redundancy). And "random2" means each
> row is random and unique (so not compressible at all).
>
> The table shows that with compressible TOAST values, compressing the
> spill file is rather useless. The reason is that ReorderBufferCompress
> is handling raw TOAST data, which is already compressed. Yes, it may
> further reduce the amount of data, but it's negligible when compared to
> the original amount of data.
>
> For the random cases, the spill compression is rather pointless. Yes,
> lz4 can reduce it to 1/2 for the shorter strings, but other than that
> it's not very useful.

It's still interesting to confirm that data already compressed or
random data cannot be significantly compressed.

> For a while I was thinking this approach is flawed, because it only sees
> and compressed changes one by one, and that seeing a batch of changes
> would improve this (e.g. we'd see the copied rows). But I realized lz4
> already does that (in the streaming mode at least), and yet it does not
> help very much. Presumably that depends on how large the context is. If
> the random string is long enough, it won't help.
>
> So maybe this approach is fine, and doing the compression at a lower
> layer (for the whole file), would not really improve this. Even then
> we'd only see a limited amount of data.
>
> Maybe the right answer to this is that compression does not help cases
> where most of the replicated data is TOAST, and that it can help cases
> with wide (and redundant) rows, or repeated rows. And that lz4 is a
> clearly superior choice. (This also raises the question if we want to
> support REORDER_BUFFER_STRAT_LZ4_REGULAR. I haven't looked into this,
> but doesn't that behave more like pglz, i.e. no context?)

I'm working on a new version of this patch set that will include the
changes you suggested in your review. About using LZ4 regular API, the
goal was to use it when we cannot use the streaming API due to raw
data larger than LZ4 ring buffer. But this is something I'm going to
delete in the new version because I'm planning to use a similar
approach as we do in astreamer_lz4.c: using frames, not blocks. LZ4
frame API looks very similar to ZSTD's streaming API.

> FWIW when doing these tests, it made me realize how useful would it be
> to track both the "raw" and "spilled" amounts. That is before/after
> compression. It'd make calculating compression ratio much easier.

Yes, that's why I tried to "fix" the spill_bytes counter.

Regards,

JT



Re: Compress ReorderBuffer spill files using LZ4

From
Tomas Vondra
Date:

On 9/23/24 21:58, Julien Tachoires wrote:
> Hi Tomas,
> 
> Le lun. 23 sept. 2024 à 18:13, Tomas Vondra <tomas@vondra.me> a écrit :
>>
>> Hi,
>>
>> I've spent a bit more time on this, mostly running tests to get a better
>> idea of the practical benefits.
> 
> Thank you for your code review and testing!
> 
>> Firstly, I think there's a bug in ReorderBufferCompress() - it's legal
>> for pglz_compress() to return -1. This can happen if the data is not
>> compressible, and would not fit into the output buffer. The code can't
>> just do elog(ERROR) in this case, it needs to handle that by storing the
>> raw data. The attached fixup patch makes this work for me - I'm not
>> claiming this is the best way to handle this, but it works.
>>
>> FWIW I find it strange the tests included in the patch did not trigger
>> this. That probably means the tests are not quite sufficient.
>>
>>
>> Now, to the testing. Attached are two scripts, testing different cases:
>>
>> test-columns.sh - Table with a variable number of 'float8' columns.
>>
>> test-toast.sh - Table with a single text column.
>>
>> The script always sets up a publication/subscription on two instances,
>> generates certain amount of data (~1GB for columns, ~3.2GB for TOAST),
>> waits for it to be replicated to the replica, and measures how much data
>> was spilled to disk with the different compression methods (off, pglz
>> and lz4). There's a couple more metrics, but that's irrelevant here.
> 
> It would be interesting to run the same tests with zstd: in my early
> testing I found that zstd was able to provide a better compression
> ratio than lz4, but seemed to use more CPU resources/is slower.
> 

Oh, I completely forgot about zstd. I don't think it'd substantially
change the conclusions, though. It might compress better/worse for some
cases, but the overall behavior would remain the same.

I can't test this right now, the testmachine is busy with some other
stuff. But it should not be difficult to update the test scripts I
attached and get results yourself. There's a couple hard-coded paths
that need to be updated, ofc.

>> For the "column" test, it looks like this (this is in MB):
>>
>>     rows    columns    distribution    off     pglz    lz4
>>   ========================================================
>>   100000       1000    compressible    778      20       9
>>                              random    778     778      16
>>   --------------------------------------------------------
>>   1000000       100    compressible    916     116      62
>>                              random    916     916      67
>>
>> It's very clear that for the "compressible" data (which just copies the
>> same value into all columns), both pglz and lz4 can significantly reduce
>> the amount of data. For 1000 columns it's 780MB -> 20MB/9MB, for 100
>> columns it's a bit less efficient, but still good.
>>
>> For the "random" data (where every column gets a random value, but rows
>> are copied), it's a very different story - pglz does not help at all,
>> while lz4 still massively reduces the amount of spilled data.
>>
>> I think the explanation is very simple - for pglz, we compress each row
>> on it's own, there's no concept of streaming/context. If a row is
>> compressible, it works fine, but when the row gets random, pglz can't
>> compress it at all. For lz4, this does not matter, because with the
>> streaming mode it still sees that rows are just repeated, and so can
>> compress them efficiently.
> 
> That's correct.
> 
>> For TOAST test, the results look like this:
>>
>>   distribution     repeats        toast       off    pglz     lz4
>>   ===============================================================
>>   compressible       10000        lz4          14       2       1
>>                                   pglz         40       4       3
>>                       1000        lz4          32      16       9
>>                                   pglz         54      17      10
>>         ---------------------------------------------------------
>>         random       10000        lz4        3305    3305    3157
>>                                   pglz       3305    3305    3157
>>                       1000        lz4        3166    3162    1580
>>                                   pglz       3334    3326    1745
>>        ----------------------------------------------------------
>>        random2       10000        lz4        3305    3305    3157
>>                                   pglz       3305    3305    3158
>>                       1000        lz4        3160    3156    3010
>>                                   pglz       3334    3326    3172
>>
>> The "repeats" value means how long the string is - it's the number of
>> "md5" hashes added to the string. The number of rows is calculated to
>> keep the total amount of data the same. The "toast" column tracks what
>> compression was used for TOAST, I was wondering if it matters.
>>
>> This time there are three data distributions - compressible means that
>> each TOAST value is nicely compressible, "random" means each value is
>> random (not compressible), but the rows are just copy of the same value
>> (so on the whole there's a lot of redundancy). And "random2" means each
>> row is random and unique (so not compressible at all).
>>
>> The table shows that with compressible TOAST values, compressing the
>> spill file is rather useless. The reason is that ReorderBufferCompress
>> is handling raw TOAST data, which is already compressed. Yes, it may
>> further reduce the amount of data, but it's negligible when compared to
>> the original amount of data.
>>
>> For the random cases, the spill compression is rather pointless. Yes,
>> lz4 can reduce it to 1/2 for the shorter strings, but other than that
>> it's not very useful.
> 
> It's still interesting to confirm that data already compressed or
> random data cannot be significantly compressed.
> 
>> For a while I was thinking this approach is flawed, because it only sees
>> and compressed changes one by one, and that seeing a batch of changes
>> would improve this (e.g. we'd see the copied rows). But I realized lz4
>> already does that (in the streaming mode at least), and yet it does not
>> help very much. Presumably that depends on how large the context is. If
>> the random string is long enough, it won't help.
>>
>> So maybe this approach is fine, and doing the compression at a lower
>> layer (for the whole file), would not really improve this. Even then
>> we'd only see a limited amount of data.
>>
>> Maybe the right answer to this is that compression does not help cases
>> where most of the replicated data is TOAST, and that it can help cases
>> with wide (and redundant) rows, or repeated rows. And that lz4 is a
>> clearly superior choice. (This also raises the question if we want to
>> support REORDER_BUFFER_STRAT_LZ4_REGULAR. I haven't looked into this,
>> but doesn't that behave more like pglz, i.e. no context?)
> 
> I'm working on a new version of this patch set that will include the
> changes you suggested in your review. About using LZ4 regular API, the
> goal was to use it when we cannot use the streaming API due to raw
> data larger than LZ4 ring buffer. But this is something I'm going to
> delete in the new version because I'm planning to use a similar
> approach as we do in astreamer_lz4.c: using frames, not blocks. LZ4
> frame API looks very similar to ZSTD's streaming API.
> 
>> FWIW when doing these tests, it made me realize how useful would it be
>> to track both the "raw" and "spilled" amounts. That is before/after
>> compression. It'd make calculating compression ratio much easier.
> 
> Yes, that's why I tried to "fix" the spill_bytes counter.
> 

But I think the 'fixed' counter only tracks the data after the new
compression, right? I'm suggesting to have two counters - one for "raw"
data (before compression) and "compressed" (after compression).


regards

-- 
Tomas Vondra