Thread: Re: Proposal: Adding compression of temporary files

Re: Proposal: Adding compression of temporary files

From
Tomas Vondra
Date:
Hi,

On 11/18/24 22:58, Filip Janus wrote:
> ...
>     Hi all,
>     Postgresql supports data compression nowadays, but the compression of
>     temporary files has not been implemented yet. The huge queries can 
>     produce a significant amount of temporary data that needs to
>     be stored on disk 
>     and cause many expensive I/O operations.
>     I am attaching a proposal of the patch to enable temporary files
>     compression for
>     hashjoins for now. Initially, I've chosen the LZ4 compression
>     algorithm. It would
>     probably make better sense to start with pglz, but I realized it late.
> 

Thanks for the idea & patch. I agree this might be quite useful for
workloads generating a lot of temporary files for stuff like sorts etc.
I think it will be interesting to think about the trade offs, i.e. how
to pick the compression level - at some point the compression ratio
stops improving while paying more and more CPU time. Not sure what the
right choice is, so using default seems fine.

I agree it'd be better to start with pglz, and only then add lz4 etc.
Firstly, pglz is simply the built-in compression, supported everywhere.
And it's also simpler to implement, I think.

>     # Future possible improvements
>     Reducing the number of memory allocations within the dumping and
>     loading of
>     the buffer. I have two ideas for solving this problem. I would
>     either add a buffer into
>     struct BufFile or provide the buffer as an argument from the caller.
>     For the sequential 
>     execution, I would prefer the second option.
> 

Yes, this would be good. Doing a palloc+pfree for each compression is
going to be expensive, especially because these buffers are going to be
large - likely larger than 8kB. Which means it's not cached in the
memory context, etc.

Adding it to the BufFile is not going to fly, because that doubles the
amount of memory per file. And we already have major issues with hash
joins consuming massive amounts of memory. But at the same time the
buffer is only needed during compression, and there's only one at a
time. So I agree with passing a single buffer as an argument.

>     # Future plan/open questions
>     In the future, I would like to add support for pglz and zstd.
>     Further, I plan to
>     extend the support of the temporary file compression also for
>     sorting, gist index creation, etc.
> 
>     Experimenting with the stream mode of compression algorithms. The
>     compression 
>     ratio of LZ4 in block mode seems to be satisfying, but the stream
>     mode could 
>     produce a better ratio, but it would consume more memory due to the
>     requirement to store
>     context for LZ4 stream compression.
> 

One thing I realized is that this only enables temp file compression for
a single place - hash join spill files. AFAIK this is because compressed
files don't support random access, and the other places might need that.

Is that correct? The patch does not explain this anywhere. If that's
correct, the patch probably should mention this in a comment for the
'compress' argument added to BufFileCreateTemp(), so that it's clear
when it's legal to set compress=true.

Which other places might compress temp files? Surely hash joins are not
the only place that could benefit from this, right?

Another thing is testing. If I run regression tests, it won't use
compression at all, because the GUC has "none" by default, right? But we
need some testing, so how would we do that? One option would be to add a
regression test that explicitly sets the GUC and does a hash join, but
that won't work with lz4 (because that may not be enabled).

Another option might be to add a PG_TEST_xxx environment variable that
determines compression to use. Something like PG_TEST_USE_UNIX_SOCKETS.
But perhaps there's a simpler way.

>     # Benchmark
>     I prepared three different databases to check expectations. Each 
>     dataset is described below. My testing demonstrates that my patch 
>     improves the execution time of huge hash joins. 
>     Also, my implementation should not
>     negatively affect performance within smaller queries. 
>     The usage of memory needed for temporary files was reduced in every
>      execution without a significant impact on execution time.
> 
>     *## Dataset A:*
>     Tables*
>     *
>     table_a(bigint id,text data_text,integer data_number) - 10000000 rows
>     table_b(bigint id, integer ref_id, numeric data_value, bytea
>     data_blob) - 10000000 rows
>     Query:  SELECT *  FROM table_a a JOIN table_b b ON a.id <http://
>     a.id> = b.id <http://b.id>;
> 
>     The tables contain highly compressible data.
>     The query demonstrated a reduction in the usage of the temporary 
>     files ~20GB -> 3GB, based on this reduction also caused the execution 
>     time of the query to be reduced by about ~10s.
> 
> 
>     *## Dataset B:*
>     Tables:*
>     *
>     table_a(integer id, text data_blob) - 1110000 rows
>     table_b(integer id, text data_blob) - 10000000 rows
>     Query:  SELECT *  FROM table_a a JOIN table_b b ON a.id <http://
>     a.id> = b.id <http://b.id>;
> 
>     The tables contain less compressible data. data_blob was generated
>     by a pseudo-random generator.
>     In this case, the data reduction was only ~50%. Also, the execution
>     time was reduced 
>     only slightly with the enabled compression.
> 
>     The second scenario demonstrates no overhead in the case of enabled 
>     compression and extended work_mem to avoid temp file usage.
> 
>     *## Dataset C:*
>     Tables
>     customers (integer,text,text,text,text)
>     order_items(integer,integer,integer,integer,numeric(10,2))
>     orders(integer,integer,timestamp,numeric(10,2))
>     products(integer,text,text,numeric(10,2),integer)
> 
>     Query: SELECT p.product_id, p.name <http://p.name>, p.price,
>     SUM(oi.quantity) AS total_quantity, AVG(oi.price) AS avg_item_price
>     FROM eshop.products p JOIN eshop.order_items oi ON p.product_id =
>     oi.product_id JOIN 
>     eshop.orders o ON oi.order_id = o.order_id WHERE o.order_date >
>     '2020-01-01' AND p.price > 50
>     GROUP BY p.product_id, p.name <http://p.name>, p.price HAVING
>     SUM(oi.quantity) > 1000
>     ORDER BY total_quantity DESC LIMIT 100;
> 
>     This scenario should demonstrate a more realistic usage of the database.
>     Enabled compression slightly reduced the temporary memory usage, but
>     the execution
>     time wasn't affected by compression.
> 
> 
>     +------------+-------------------------+-----------------------
>     +------------------------------+
>     |  Dataset   | Compression.       | temp_bytes         | Execution
>     Time (ms)   |      
>     +------------+-------------------------+-----------------------
>     +----------------------------- +
>     | A             | Yes                        |  3.09 GiB           
>     | 22s586ms           | work_mem  = 4MB
>     |                | No                         |  21.89 GiB         
>     | 35s                       | work_mem  = 4MB
>     +------------+-------------------------+-----------------------
>     +----------------------------------------
>     | B             | Yes                        |  333 MB              
>     | 1815.545 ms       | work_mem = 4MB
>     |                 | No                        |  146  MB           
>       | 1500.460 ms        | work_mem = 4MB
>     |                 | Yes                       |  0 MB              
>         | 3262.305 ms        | work_mem = 80MB
>     |                 | No                        |  0 MB               
>        | 3174.725 ms         | work_mem = 80MB
>     +-------------+------------------------+------------------------
>     +-------------------------------------
>     | C             | Yes                       | 40 MB                 
>     | 1011.020 ms        | work_mem = 1MB
>     |                | No                        |  53
>     MB                 |  1034.142 ms        | work_mem = 1MB
>     +------------+------------------------+------------------------
>     +--------------------------------------
> 
> 

Thanks. I'll try to do some benchmarks on my own.

Are these results fro ma single run, or an average of multiple runs? Do
you maybe have a script to reproduce this, including the data generation?

Also, can you share some information about the machine used for this? I
expect the impact to strongly depends on memory pressure - if the temp
file fits into page cache (and stays there), it may not benefit from the
compression, right?


regards

-- 
Tomas Vondra




Re: Proposal: Adding compression of temporary files

From
Alexander Korotkov
Date:
On Sun, Jan 5, 2025 at 1:43 AM Filip Janus <fjanus@redhat.com> wrote:
>
> I apologize for multiple messages, but I found a small bug in the previous version.
>
>     -Filip-

Great, thank you for your work.

I think the patches could use a pgindent run.

I don't see a reason why the temp file compression method should be
different from the wal compression methods, which we already have
in-tree.  Perhaps it would be nice to have a 0001 patch, which would
abstract the compression methods we now have for wal into a separate
file containing GUC option values and functions for
compress/decompress. Then, 0002 would apply this to temporary file
compression.

------
Regards,
Alexander Korotkov
Supabase



Re: Proposal: Adding compression of temporary files

From
Tomas Vondra
Date:
On 3/15/25 11:40, Alexander Korotkov wrote:
> On Sun, Jan 5, 2025 at 1:43 AM Filip Janus <fjanus@redhat.com> wrote:
>>
>> I apologize for multiple messages, but I found a small bug in the previous version.
>>
>>     -Filip-
> 
> Great, thank you for your work.
> 
> I think the patches could use a pgindent run.
> 
> I don't see a reason why the temp file compression method should be
> different from the wal compression methods, which we already have
> in-tree.  Perhaps it would be nice to have a 0001 patch, which would
> abstract the compression methods we now have for wal into a separate
> file containing GUC option values and functions for
> compress/decompress. Then, 0002 would apply this to temporary file
> compression.
> 

Not sure I understand the design you're proposing ...

AFAIK the WAL compression is not compressing the file data directly,
it's compressing backup blocks one by one, which then get written to WAL
as one piece of a record. So it's dealing with individual blocks, not
files, and we already have API to compress blocks (well, it's pretty
much the APIs for each compression method).

You're proposing abstracting that into a separate file - what would be
in that file? How would you abstract this to make it also useful for
file compression?

I can imagine a function CompressBufffer(method, dst, src, ...) wrapping
the various compression methods, unifying the error handling, etc. I can
imagine that, but that API is also limiting - e.g. how would that work
with stream compression, which seems irrelevant for WAL, but might be
very useful for tempfile compression.

IIRC this is mostly why we didn't try to do such generic API for pg_dump
compression, there's a local pg_dump-specific abstraction.


FWIW looking at the patch, I still don't quite understand why it needs
to correct the offset like this:

+    if (!file->compress)
+        file->curOffset -= (file->nbytes - file->pos);
+    else
+        if (nbytesOriginal - file->pos != 0)
+            /* curOffset must be corrected also if compression is
+             * enabled, nbytes was changed by compression but we
+             * have to use the original value of nbytes
+             */
+            file->curOffset-=bytestowrite;

It's not something introduced by the compression patch - the first part
is what we used to do before. But I find it a bit confusing - isn't it
mixing the correction of "logical file position" adjustment we did
before, and also the adjustment possibly needed due to compression?

In fact, isn't it going to fail if the code gets multiple loops in

    while (wpos < file->nbytes)
    {
       ...
    }

because bytestowrite will be the value from the last loop? I haven't
tried, but I guess writing wide tuples (more than 8k) might fail.


regards

-- 
Tomas Vondra




Re: Proposal: Adding compression of temporary files

From
Alexander Korotkov
Date:
On Tue, Mar 18, 2025 at 12:13 AM Tomas Vondra <tomas@vondra.me> wrote:
> On 3/15/25 11:40, Alexander Korotkov wrote:
> > On Sun, Jan 5, 2025 at 1:43 AM Filip Janus <fjanus@redhat.com> wrote:
> >>
> >> I apologize for multiple messages, but I found a small bug in the previous version.
> >>
> >>     -Filip-
> >
> > Great, thank you for your work.
> >
> > I think the patches could use a pgindent run.
> >
> > I don't see a reason why the temp file compression method should be
> > different from the wal compression methods, which we already have
> > in-tree.  Perhaps it would be nice to have a 0001 patch, which would
> > abstract the compression methods we now have for wal into a separate
> > file containing GUC option values and functions for
> > compress/decompress. Then, 0002 would apply this to temporary file
> > compression.
> >
>
> Not sure I understand the design you're proposing ...
>
> AFAIK the WAL compression is not compressing the file data directly,
> it's compressing backup blocks one by one, which then get written to WAL
> as one piece of a record. So it's dealing with individual blocks, not
> files, and we already have API to compress blocks (well, it's pretty
> much the APIs for each compression method).
>
> You're proposing abstracting that into a separate file - what would be
> in that file? How would you abstract this to make it also useful for
> file compression?
>
> I can imagine a function CompressBufffer(method, dst, src, ...) wrapping
> the various compression methods, unifying the error handling, etc. I can
> imagine that, but that API is also limiting - e.g. how would that work
> with stream compression, which seems irrelevant for WAL, but might be
> very useful for tempfile compression.

Yes, I was thinking about some generic API that provides a safe way to
compress some data chunk with given compression method.  It seems that
yet it should suit both WAL, toast and temp files in the current
implementation.  But, yes, if we would implement streaming compression
in future that would require another API.

------
Regards,
Alexander Korotkov
Supabase



Re: Proposal: Adding compression of temporary files

From
Filip Janus
Date:


    -Filip-


po 17. 3. 2025 v 23:13 odesílatel Tomas Vondra <tomas@vondra.me> napsal:
On 3/15/25 11:40, Alexander Korotkov wrote:
> On Sun, Jan 5, 2025 at 1:43 AM Filip Janus <fjanus@redhat.com> wrote:
>>
>> I apologize for multiple messages, but I found a small bug in the previous version.
>>
>>     -Filip-
>
> Great, thank you for your work.
>
> I think the patches could use a pgindent run.
>
> I don't see a reason why the temp file compression method should be
> different from the wal compression methods, which we already have
> in-tree.  Perhaps it would be nice to have a 0001 patch, which would
> abstract the compression methods we now have for wal into a separate
> file containing GUC option values and functions for
> compress/decompress. Then, 0002 would apply this to temporary file
> compression.
>

Not sure I understand the design you're proposing ...

AFAIK the WAL compression is not compressing the file data directly,
it's compressing backup blocks one by one, which then get written to WAL
as one piece of a record. So it's dealing with individual blocks, not
files, and we already have API to compress blocks (well, it's pretty
much the APIs for each compression method).

You're proposing abstracting that into a separate file - what would be
in that file? How would you abstract this to make it also useful for
file compression?

I can imagine a function CompressBufffer(method, dst, src, ...) wrapping
the various compression methods, unifying the error handling, etc. I can
imagine that, but that API is also limiting - e.g. how would that work
with stream compression, which seems irrelevant for WAL, but might be
very useful for tempfile compression.

IIRC this is mostly why we didn't try to do such generic API for pg_dump
compression, there's a local pg_dump-specific abstraction.


FWIW looking at the patch, I still don't quite understand why it needs
to correct the offset like this:

+    if (!file->compress)
+        file->curOffset -= (file->nbytes - file->pos);

This line of code is really confusing to me, and I wasn't able to fully understand why it must be done,
but I experimented with it, and if I remember correctly, it's triggered (the result differs from 0) mainly in the last call of 
BufFileDumpBuffer function for a single data chunk.


 
+    else
+        if (nbytesOriginal - file->pos != 0)
+            /* curOffset must be corrected also if compression is
+             * enabled, nbytes was changed by compression but we
+             * have to use the original value of nbytes
+             */
+            file->curOffset-=bytestowrite;

It's not something introduced by the compression patch - the first part
is what we used to do before. But I find it a bit confusing - isn't it
mixing the correction of "logical file position" adjustment we did
before, and also the adjustment possibly needed due to compression?

In fact, isn't it going to fail if the code gets multiple loops in

    while (wpos < file->nbytes)
    {
       ...
    }

because bytestowrite will be the value from the last loop? I haven't
tried, but I guess writing wide tuples (more than 8k) might fail.

I will definitely test it with larger tuples than 8K. 

Maybe I don't understand it correctly,
the adjustment is performed in the case that file->nbytes and file->pos differ.
So it must persist also if we are working with the compressed data, but the
problem is that data stored and compressed on disk has different sizes than
data incoming uncompressed ones, so what should be the correction value. 
By debugging, I realized that the correction should correspond to the size of
bytestowrite from the last iteration of the loop.


regards

--
Tomas Vondra