Thread: [Proposal] Page Compression for OLTP

[Proposal] Page Compression for OLTP

From
chenhj
Date:
Hello hackers!

This email is about the proposal to implement OLTP-oriented compression on PostgreSQL.

Although there are currently some alternative ways to achieve compression, such as using a file system that supports compression.
However, this depends on a specific file system and is not suitable for all deployment environments, but also increases the complexity of deployment and maintenance.

I hope this compression work can meet the following goals
1. In most scenarios, the compressed size of the table can be lower than 50% of the original table
2. Mainly oriented to the OLTP scenario, the performance impact on the load of frequent reads and writes is relatively small.
3. Does not rely on special external software or hardware that is difficult to obtain
4. Friendly to application developers and database managers
5. The transformation of PostgreSQL is small


I have noticed that there has been some discussion or work related to compression before, but they do not meet the above goals.
such as,
1. Use the sparse file[1]
   This is also the implemention method of MySQL 5.7's transparent page compression. However, sparse files may generate a lot of fragmentation inside the file system, and the "compressed" data file in sparse files will be restored to their original size after physical backup and restoration, unless our backup and recovery tools also support sparse files.
2. Use TAM (table access method interface) (pg_cryogen, zedstore) [2] [3]
   Neither storage engine is geared towards OLTP scenarios. It is best to make relatively small modifications to the existing code of the heap engine (by modify md.c and fd.c mainly).


The methods proposed by Postgres Pro Enterprise CFS [4] and Nikolay P [5] are close to my needs.
However, I would like to mention a slightly different implementation plan, which does not require space reclamation. Hope to get any suggestions.

# Premise assumption
1. Most of the pages in the compressed table can be compressed to less than 50% of the original size
   As long as you use an algorithm with a relatively high compression ratio (such as zlib, zstd), this first point should be easy to meet. Unless the table stores compressed data, such as pictures.
2. The compression ratio of most pages in the same table is relatively close


# Page storage

Configure 3 files for storing compressed data for each segment of each main fork. The main fork segment file(for example: 123456.2) still exists, but its size is 0.

-Compressed data file (for example: 123456.2.cd)
   Used to store the compressed page. The block size of this file is table level configurable. But it can only be 1/2, 1/4 or 1/8 of BLOCKSZ
-Compress overflow address file (for example: 123456.2.coa)
   When a page cannot be compressed to less than the size of the compressed block, this file is used to store the address of the overflow block.
-Compress overflow data file (for example: 123456.2.cod)
   When a page cannot be compressed to less than the size of the compressed block, this file is used to store the overflow block.

The following is an example when the compressed block size is 4K, which is 1/2 of BLOCKSZ.

## Scenario 1: The compressed size of the original page (including the header of the compressed page) is less than or equal to the compressed block size (4KB)

Compressed data files(123456.2.cd):
 0       1       2
 +=======+=======+=======+
 | data0 | data1 | data2 |  
 +=======+=======+=======+
->|   4K  |<-


## Scenario 2: The compressed size of the original page (including the header of the compressed page) is larger than the compressed block size (4KB)

If the compressed size of the original page (page 3 below) is greater than 4KB, it will not be compressed. The first 4KB of the original page is stored in the compressed data file, and the last 4KB is stored in the compress overflow data file.

Compressed data files(123456.2.cd):

 0       1       2       3
 +=======+=======+=======+=========+
 | data0 | data1 | data2 | data3_1 |
 +=======+=======+=======+=========+
                       ->| 1st 4K  |<-

Compress overflow address file(123456.2.coa):
The compress overflow address file stores the block number of the compress overflow block assigned to each block + 1
The size of the compressed block and the number of expanded blocks of the compress overflow data file are stored in the head of the compress overflow address file

         0       1       2       3
 +=======+=======+=======+=======+=======+
 | head  |       |       |       |   1   |
 +=======+=======+=======+=======+===|===+
                                     |
                                     |
Compress overflow data file:          |
      _______________________________|
     |
 0   |    1       2       3
 +===|=====+=========+==========+=========+
 | data3_2 |         |          |         |
 +=========+=========+==========+=========+
->| 2nd 4K  |<-


If the compressed block size is 1/4 or 1/8 of BLOCKSZ, each block that fails to compress may require multiple compress overflow block storage.
The following is an example when the compressed block size is 2K, which is 1/4 of BLOCKSZ.

## Scenario 3: The compressed size of the original page (including the header of the compressed page) is larger than 2KB(compressed page block size) but less than 6KB (BLOCKSZ - compressed page block size )

In this case, data files will store compressed data, and at least 2KB storage space can be saved.

Compressed data files(123456.2.cd):

 0       1       2       3
 +=======+=======+=======+=========+
 | data0 | data1 | data2 | data3_1 |
 +=======+=======+=======+=========+
                       ->| 1st 2K  |<-

Compress overflow address file(123456.2.coa):

         0       1       2       3
 +=======+=======+=======+=======+=======+
 | head  |       |       |       | 1,2   |
 +=======+=======+=======+=======+===|===+
                                     |
                                     |
Compress overflow data file :         |
      _______________________________|
     |
 0   |     1         2          3
 +===|=====+=========+==========+=========+
 | data3_2 | data3_3 |          |         |
 +=========+=========+==========+=========+
 | 2nd 2K  | 3rh 2K  |          |         |


## Scenario 4: The compressed size of the original page (including the header of the compressed page) is larger than 6KB (BLOCKSZ - compressed page block size )

In this case, data files will store original data(8KB). same as Scenario 2

Compressed data files(123456.2.cd):

 0       1       2       3
 +=======+=======+=======+=========+
 | data0 | data1 | data2 | data3_1 |
 +=======+=======+=======+=========+
                       ->| 1st 2K  |<-

Compress overflow address file(123456.2.coa):

         0       1       2       3
 +=======+=======+=======+=======+=======+
 | head  |       |       |       | 1,2,3 |
 +=======+=======+=======+=======+===|===+
                                     |
                                     |
Compress overflow data file :         |
      _______________________________|
     |
 0   |     1         2          3
 +===|=====+=========+==========+=========+
 | data3_2 | data3_3 | data3_4  |         |
 +=========+=========+==========+=========+
 | 2nd 2K  | 3rd 2K  | 4th 2K   |         |


# How to distinguish between compressed or uncompressed blocks in compressed data files?

The PostgreSQL heap file has a uniform header. At first, I considered adding compression-related flags to the header.
However, there will be a problem. When the size of the data in the page after compression changes, from compressed format to uncompressed format, or from uncompressed format to compressed format,
Need to modify the head of the original page, not only to recalculate the checksum, but also update the buffer.

However, I noticed that the first 4 bytes of each page are the high part of pg_lsn.
Therefore, use an oversized `lsn number` that cannot appear in the real environment as a sign of whether it is a magic of compressed pages.
The definition of the envisaged compressed header is as follows

typedef struct
{
   uint32    magic;       /* compress page magic number,must be 0xfffffffe */
   uint8    algorithm;     /* 1=pglz, 2=zstd ...*/
   uint8    flag;             /* reserved */
   uint16    size;           /* size after compressed */
} PageCompressHead;


# How to manage block space in compress overflow data files?

Once the overflow block x in the compress overflow data file is allocated to the block a, it will always belong to the block a, even if the size of the block a after compression becomes smaller and the overflow block x is no longer be used.

This approach simplifies the space management of compress overflow blocks, but fragmentation may occur, especially when the compressed block size is 1/4 or 1/8 BLOCKSZ.
However, the fragment will only appear in the scene where the size of the same block is frequently changed greatly after compression.

Consider the following situation. If only one record is inserted into a page and written to the disk, the compression rate must be very high, and only one compressed block is required.
After writing new rows in the future, the required compressed blocks will become 2, 3, 4 ... These overflow blocks are not allocated at a time, so it is likely that they are not sequential in the compress overflow data file, resulting in more fragmentation.

We can avoid this problem by setting a table-level number of pre-allocated compressed blocks.
When the number of compressed blocks required after the original page is compressed is less than this value, space is allocated according to the number of pre-allocated compressed blocks.

And no matter how severe the fragmentation, the total space occupied by the compressed table cannot be larger than the original table before compression.

# Impact on other parts of PostgreSQL?
1. pg_basebackup / pg_checksum needs to handle checksum verification according to the new compression format
2. pg_rewind needs to handle the copying of data blocks according to the new compression format

# Problems
This solution simplifies copy storage space management, but also has the following defects
1. The space saved by compression is limited by the size of the compressed block.
  For example, when the compressed block size is set to 4KB, up to 50% of space can be saved.
  For inserting only unmodified tables, you can set the compressed block size to BLOCKSZ / 8 to alleviate this problem; but for scenarios with frequent writes, it is easy to generate fragments and increase the number of IOs.
2. When accessing a page that can be compressed to a compressed block, only one IO is required; but when accessing a page that cannot be compressed to a compressed block, multiple IO is required
  Generally it is 3 times, the address file is very small, it should be almost in the memory, not counting the address file is 2 times.


I think the above issues are a necessary trade-off. Any suggestions are welcome.

# references
[1] https://www.postgresql.org/message-id/flat/op.ux8if71gcigqcu%40soyouz
[2] https://www.postgresql.org/message-id/CAONYFtNDNghfatnrhOJMOT=BXNbEiobHFABt2sx_cn2=5t=1_Q@mail.gmail.com
[3] https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w%40mail.gmail.com
[4] https://postgrespro.com/docs/enterprise/9.6/cfs
[5] https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net


Best Regards
Chen Hujaun

Re: [Proposal] Page Compression for OLTP

From
Fabien COELHO
Date:
Hello,

My 0.02€, some of which may just show some misunderstanding on my part:

  - you have clearly given quite a few thoughts about the what and how…
    which makes your message an interesting read.

  - Could this be proposed as some kind of extension, provided that enough
    hooks are available? ISTM that foreign tables and/or alternative
    storage engine (aka ACCESS METHOD) provide convenient APIs which could
    fit the need for these? Or are they not appropriate? You seem to
    suggest that there are not.

    If not, what could be done to improve API to allow what you are seeking
    to do? Maybe you need a somehow lower-level programmable API which does
    not exist already, or at least is not exported already, but could be
    specified and implemented with limited effort? Basically you would like
    to read/write pg pages to somewhere, and then there is the syncing
    issue to consider. Maybe such a "page storage" API could provide
    benefit for some specialized hardware, eg persistent memory stores,
    so there would be more reason to define it anyway? I think it might
    be valuable to give it some thoughts.

  - Could you maybe elaborate on how your plan differs from [4] and [5]?

  - Have you consider keeping page headers and compressing tuple data
    only?

  - I'm not sure there is a point in going below the underlying file
    system blocksize, quite often 4 KiB? Or maybe yes? Or is there
    a benefit to aim at 1/4 even if most pages overflow?

  - ISTM that your approach entails 3 "files". Could it be done with 2?
    I'd suggest that the possible overflow pointers (coa) could be part of
    the headers so that when reading the 3.1 page, then the header would
    tell where to find the overflow 3.2, without requiring an additional
    independent structure with very small data in it, most of it zeros.
    Possibly this is not possible, because it would require some available
    space in standard headers when the is page is not compressible, and
    there is not enough. Maybe creating a little room for that in
    existing headers (4 bytes could be enough?) would be a good compromise.
    Hmmm. Maybe the approach I suggest would only work for 1/2 compression,
    but not for other target ratios, but I think it could be made to work
    if the pointer can entail several blocks in the overflow table.

  - If one page is split in 3 parts, could it creates problems on syncing,
    if 1/3 or 2/3 pages get written, but maybe that is manageable with WAL
     as it would note that the page was not synced and that is enough for
     replay.

  - I'm unclear how you would manage the 2 representations of a page in
    memory. I'm afraid that a 8 KiB page compressed to 4 KiB would
    basically take 12 KiB, i.e. reduce the available memory for caching
    purposes. Hmmm. The current status is that a written page probably
    takes 16 KiB, once in shared buffers and once in the system caches,
    so it would be an improvement anyway.

  - Maybe the compressed and overflow table could become bloated somehow,
    which would require the vaccuuming implementation and add to the
    complexity of the implementation?

  - External tools should be available to allow page inspection, eg for
    debugging purposes.

-- 
Fabien.

Re: [Proposal] Page Compression for OLTP

From
chenhj
Date:

At 2020-05-21 15:04:55, "Fabien COELHO" <coelho@cri.ensmp.fr> wrote: > >Hello, > >My 0.02, some of which may just show some misunderstanding on my part: > > - Could this be proposed as some kind of extension, provided that enough > hooks are available? ISTM that foreign tables and/or alternative > storage engine (aka ACCESS METHOD) provide convenient APIs which could > fit the need for these? Or are they not appropriate? You seem to > suggest that there are not. > > If not, what could be done to improve API to allow what you are seeking > to do? Maybe you need a somehow lower-level programmable API which does > not exist already, or at least is not exported already, but could be > specified and implemented with limited effort? Basically you would like > to read/write pg pages to somewhere, and then there is the syncing > issue to consider. Maybe such a "page storage" API could provide > benefit for some specialized hardware, eg persistent memory stores, > so there would be more reason to define it anyway? I think it might > be valuable to give it some thoughts. Thank you for giving so many comments. In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things. A similar explanation was mentioned in Nikolay P's email. The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it. > - Could you maybe elaborate on how your plan differs from [4] and [5]? My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored. However, the most important difference is that I hope to avoid the need for GC through the design of the page layout. https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net >> The most difficult thing in CFS development is certainly >> defragmentation. In CFS it is done using background garbage collection, >> by one or one >> GC worker processes. The main challenges were to minimize its >> interaction with normal work of the system, make it fault tolerant and >> prevent unlimited growth of data segments. >> CFS is not introducing its own storage manager, it is mostly embedded in >> existed Postgres file access layer (fd.c, md.c). It allows to reused >> code responsible for mapping relations and file descriptors cache. As it >> was recently discussed in hackers, it may be good idea to separate the >> questions "how to map blocks to filenames and offsets" and "how to >> actually perform IO". In this it will be easier to implement compressed >> storage manager. > - Have you consider keeping page headers and compressing tuple data > only? In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page. When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified. This is unacceptable because it requires modifying the shared buffer and recalculating the checksum. However, it should be feasible to put this flag in the compressed address file. The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO. Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined. > - I'm not sure there is a point in going below the underlying file > system blocksize, quite often 4 KiB? Or maybe yes? Or is there > a benefit to aim at 1/4 even if most pages overflow? My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage. The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data. In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ. > - ISTM that your approach entails 3 "files". Could it be done with 2? > I'd suggest that the possible overflow pointers (coa) could be part of > the headers so that when reading the 3.1 page, then the header would > tell where to find the overflow 3.2, without requiring an additional > independent structure with very small data in it, most of it zeros. > Possibly this is not possible, because it would require some available > space in standard headers when the is page is not compressible, and > there is not enough. Maybe creating a little room for that in > existing headers (4 bytes could be enough?) would be a good compromise. > Hmmm. Maybe the approach I suggest would only work for 1/2 compression, > but not for other target ratios, but I think it could be made to work > if the pointer can entail several blocks in the overflow table. My solution is optimized for scenarios where the original page can be compressed to only need one compressed block to store, In this scenario, only 1 IO is required for reading and writing, and there is no need to access additional overflow address file and overflow data file. Your suggestion reminded me. The performance difference may not be as big as I thought (testing and comparison is required). If I give up the pursuit of "only one IO", the file layout can be simplified. For example, it is simplified to the following form, only two files (the following example uses a compressed block size of 4KB) # Page storage(Plan B) Use the compress address file to store the compressed block pointer, and the Compress data file stores the compressed block data. compress address file: 0 1 2 3 +=======+=======+=======+=======+=======+ | head | 1 | 2 | 3,4 | 5 | +=======+=======+=======+=======+=======+ compress address file saves the following information for each page -Compressed size (when size is 0, it means uncompressed format) -Block number occupied in Compress data file By the way, I want to access the compress address file through mmap, just like snapfs https://github.com/postgrespro/snapfs/blob/pg_snap/src/backend/storage/file/snapfs.c Compress data file: 0 1 2 3 4 +=========+=========+==========+=========+=========+ | data1 | data2 | data3_1 | data3_2 | data4 | +=========+=========+==========+=========+=========+ | 4K | # Page storage(Plan C) Further, since the size of the compress address file is fixed, the above address file and data file can also be combined into one file 0 1 2 123071 0 1 2 +=======+=======+=======+ +=======+=========+=========+ | head | 1 | 2 | ... | | data1 | data2 | ... +=======+=======+=======+ +=======+=========+=========+ head | address | data | If the difference in performance is so negligible, maybe Plan C is a better solution. (Are there any other problems?) > > - Maybe the compressed and overflow table could become bloated somehow, > which would require the vaccuuming implementation and add to the > complexity of the implementation? > Vacuuming is what I try to avoid. As I explained in the first email, even without vaccuum, bloating should not become a serious problem. >>However, the fragment will only appear in the scene where the size of the same block is frequently changed greatly after compression. >>... >>And no matter how severe the fragmentation, the total space occupied by the compressed table cannot be larger than the original table before compression. Best Regards Chen Huajun

Re: [Proposal] Page Compression for OLTP

From
chenhj
Date:
Sorry, There may be a problem with the display format of the previous mail. So resend it
----------------------------------------------------------------------------------------------------

At 2020-05-21 15:04:55, "Fabien COELHO" <coelho@cri.ensmp.fr> wrote:

>
>Hello,
>
>My 0.02, some of which may just show some misunderstanding on my part:
>
>  - Could this be proposed as some kind of extension, provided that enough
>    hooks are available? ISTM that foreign tables and/or alternative
>    storage engine (aka ACCESS METHOD) provide convenient APIs which could
>    fit the need for these? Or are they not appropriate? You seem to
>    suggest that there are not.
>
>    If not, what could be done to improve API to allow what you are seeking
>    to do? Maybe you need a somehow lower-level programmable API which does
>    not exist already, or at least is not exported already, but could be
>    specified and implemented with limited effort? Basically you would like
>    to read/write pg pages to somewhere, and then there is the syncing
>    issue to consider. Maybe such a "page storage" API could provide
>    benefit for some specialized hardware, eg persistent memory stores,
>    so there would be more reason to define it anyway? I think it might
>    be valuable to give it some thoughts.

Thank you for giving so many comments.
In my opinion, developing a foreign table or a new storage engine, in addition to compression, also needs to do a lot of extra things.
A similar explanation was mentioned in Nikolay P's email.

The "page storage" API may be a good choice, and I will consider it, but I have not yet figured out how to implement it.

>  - Could you maybe elaborate on how your plan differs from [4] and [5]?

My solution is similar to CFS, and it is also embedded in the file access layer (fd.c, md.c) to realize the mapping from block number to the corresponding file and location where compressed data is stored.

However, the most important difference is that I hope to avoid the need for GC through the design of the page layout.

https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net

>> The most difficult thing in CFS development is certainly
>> defragmentation. In CFS it is done using background garbage collection,
>> by one or one
>> GC worker processes. The main challenges were to minimize its
>> interaction with normal work of the system, make it fault tolerant and
>> prevent unlimited growth of data segments.

>> CFS is not introducing its own storage manager, it is mostly embedded in
>> existed Postgres file access layer (fd.c, md.c). It allows to reused
>> code responsible for mapping relations and file descriptors cache. As it
>> was recently discussed in hackers, it may be good idea to separate the
>> questions "how to map blocks to filenames and offsets" and "how to
>> actually perform IO". In this it will be easier to implement compressed
>> storage manager.


>  - Have you consider keeping page headers and compressing tuple data
>    only?

In that case, we must add some additional information in the page header to identify whether this is a compressed page or an uncompressed page.
When a compressed page becomes an uncompressed page, or vice versa, an uncompressed page becomes a compressed page, the original page header must be modified.
This is unacceptable because it requires modifying the shared buffer and recalculating the checksum.

However, it should be feasible to put this flag in the compressed address file.
The problem with this is that even if a page only occupies the size of one compressed block, the address file needs to be read, that is, from 1 IO to 2 IO.
Since the address file is very small, it is basically a memory access, this cost may not be as large as I had imagined.

>  - I'm not sure there is a point in going below the underlying file
>    system blocksize, quite often 4 KiB? Or maybe yes? Or is there
>    a benefit to aim at 1/4 even if most pages overflow?

My solution is mainly optimized for scenarios where the original page can be compressed to only require one compressed block of storage.
The scene where the original page is stored in multiple compressed blocks is suitable for scenarios that are not particularly sensitive to performance, but are more concerned about the compression rate, such as cold data.

In addition, users can also choose to compile PostgreSQL with 16KB or 32KB BLOCKSZ.

>  - ISTM that your approach entails 3 "files". Could it be done with 2?
>    I'd suggest that the possible overflow pointers (coa) could be part of
>    the headers so that when reading the 3.1 page, then the header would
>    tell where to find the overflow 3.2, without requiring an additional
>    independent structure with very small data in it, most of it zeros.
>    Possibly this is not possible, because it would require some available
>    space in standard headers when the is page is not compressible, and
>    there is not enough. Maybe creating a little room for that in
>    existing headers (4 bytes could be enough?) would be a good compromise.
>    Hmmm. Maybe the approach I suggest would only work for 1/2 compression,
>    but not for other target ratios, but I think it could be made to work
>    if the pointer can entail several blocks in the overflow table.

My solution is optimized for scenarios where the original page can be compressed to only need one compressed block to store,
In this scenario, only 1 IO is required for reading and writing, and there is no need to access additional overflow address file and overflow data file.

Your suggestion reminded me. The performance difference may not be as big as I thought (testing and comparison is required). If I give up the pursuit of "only one IO", the file layout can be simplified.

For example, it is simplified to the following form, only two files (the following example uses a compressed block size of 4KB)

# Page storage(Plan B)

Use the compress address file to store the compressed block pointer, and the Compress data file stores the compressed block data.

compress address file:
 
        0       1       2       3
+=======+=======+=======+=======+=======+
| head  |  1    |    2  | 3,4   |   5   |
+=======+=======+=======+=======+=======+

compress address file saves the following information for each page

-Compressed size (when size is 0, it means uncompressed format)
-Block number occupied in Compress data file

By the way, I want to access the compress address file through mmap, just like snapfs
https://github.com/postgrespro/snapfs/blob/pg_snap/src/backend/storage/file/snapfs.c

Compress data file:

0         1         2          3         4
+=========+=========+==========+=========+=========+
| data1   | data2   | data3_1  | data3_2 | data4   |
+=========+=========+==========+=========+=========+
|    4K   |


# Page storage(Plan C)

Further, since the size of the compress address file is fixed, the above address file and data file can also be combined into one file

        0       1       2     123071    0         1         2
+=======+=======+=======+     +=======+=========+=========+
| head  |  1    |    2  | ... |       | data1   | data2   | ...  
+=======+=======+=======+     +=======+=========+=========+
  head  |              address        |          data          |

If the difference in performance is so negligible, maybe Plan C is a better solution. (Are there any other problems?)

>
>  - Maybe the compressed and overflow table could become bloated somehow,
>    which would require the vaccuuming implementation and add to the
>    complexity of the implementation?
>

Vacuuming is what I try to avoid.

As I explained in the first email, even without vaccuum, bloating should not become a serious problem.

>>However, the fragment will only appear in the scene where the size of the same block is frequently changed greatly after compression.
>>...
>>And no matter how severe the fragmentation, the total space occupied by the compressed table cannot be larger than the original table before compression.

Best Regards
Chen Huajun

Re: [Proposal] Page Compression for OLTP

From
chenhj
Date:
Hi hackers,


> # Page storage(Plan C)
>
> Further, since the size of the compress address file is fixed, the above address file and data file can also be combined into one file
>
>         0       1       2     123071    0         1         2
> +=======+=======+=======+     +=======+=========+=========+
> | head  |  1    |    2  | ... |       | data1   | data2   | ...  
> +=======+=======+=======+     +=======+=========+=========+
>   head  |              address        |          data          |

I made a prototype according to the above storage method. Any suggestions are welcome.

# Page compress file storage related definitions

/*
* layout of Page Compress file:
*
* - PageCompressHeader
* - PageCompressAddr[]
* - chuncks of PageCompressData
*
*/
typedef struct PageCompressHeader
{
     pg_atomic_uint32     nblocks;     /* number of total blocks in this segment */
     pg_atomic_uint32     allocated_chunks;     /* number of total allocated chunks in data area */
     uint16                    chunk_size;     /* size of each chunk, must be 1/2 1/4 or 1/8 of BLCKSZ */
     uint8                    algorithm;     /* compress algorithm, 1=pglz, 2=lz4 */
} PageCompressHeader;

typedef struct PageCompressAddr
{
     uint8                    nchunks;               /* number of chunks for this block */
     uint8                    allocated_chunks;     /* number of allocated chunks for this block */

     /* variable-length fields, 1 based chunk no array for this block, size of the array must be 2, 4 or 8 */
     pc_chunk_number_t     chunknos[FLEXIBLE_ARRAY_MEMBER];
} PageCompressAddr;

typedef struct PageCompressData
{
     char     page_header[SizeOfPageHeaderData];     /* page header */
     uint32     size;                                        /* size of compressed data */
     char     data[FLEXIBLE_ARRAY_MEMBER];          /* compressed page, except for the page header */
} PageCompressData;


# Usage

Set whether to use compression through storage parameters of tables and indexes

- compress_type
 Set whether to compress and the compression algorithm used, supported values: none, pglz, zstd

- compress_chunk_size

 Chunk is the smallest unit of storage space allocated for compressed pages.
 The size of the chunk can only be 1/2, 1/4 or 1/8 of BLCKSZ

- compress_prealloc_chunks

  The number of chunks pre-allocated for each page. The maximum value allowed is: BLCKSZ/compress_chunk_size -1.
  If the number of chunks required for a compressed page is less than `compress_prealloc_chunks`,
  It allocates `compress_prealloc_chunks` chunks to avoid future storage fragmentation when the page needs more storage space.


# Sample

## requirement

- zstd

## build

./configure --with-zstd
make
make install

## create compressed table and index

create table tb1(id int,c1 text);
create table tb1_zstd(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024);
create table tb1_zstd_4(id int,c1 text) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

create index tb1_idx_id on tb1(id);
create index tb1_idx_id_zstd on tb1(id) with(compress_type=zstd,compress_chunk_size=1024);
create index tb1_idx_id_zstd_4 on tb1(id) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

create index tb1_idx_c1 on tb1(c1);
create index tb1_idx_c1_zstd on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024);
create index tb1_idx_c1_zstd_4 on tb1(c1) with(compress_type=zstd,compress_chunk_size=1024,compress_prealloc_chunks=4);

insert into tb1 select generate_series(1,1000000),md5(random()::text);
insert into tb1_zstd select generate_series(1,1000000),md5(random()::text);
insert into tb1_zstd_4 select generate_series(1,1000000),md5(random()::text);

## show size of table and index

postgres=# \d+
                            List of relations
Schema |    Name    | Type  |  Owner   | Persistence | Size  | Description
--------+------------+-------+----------+-------------+-------+-------------
public | tb1        | table | postgres | permanent   | 65 MB |
public | tb1_zstd   | table | postgres | permanent   | 37 MB |
public | tb1_zstd_4 | table | postgres | permanent   | 37 MB |
(3 rows)

postgres=# \di+
                                    List of relations
Schema |       Name        | Type  |  Owner   | Table | Persistence | Size  | Description
--------+-------------------+-------+----------+-------+-------------+-------+-------------
public | tb1_idx_c1        | index | postgres | tb1   | permanent   | 73 MB |
public | tb1_idx_c1_zstd   | index | postgres | tb1   | permanent   | 36 MB |
public | tb1_idx_c1_zstd_4 | index | postgres | tb1   | permanent   | 41 MB |
public | tb1_idx_id        | index | postgres | tb1   | permanent   | 21 MB |
public | tb1_idx_id_zstd   | index | postgres | tb1   | permanent   | 13 MB |
public | tb1_idx_id_zstd_4 | index | postgres | tb1   | permanent   | 15 MB |
(6 rows)


# pgbench performance testing(TPC-B)

Compress the pgbench_accounts table and its primary key index.
The compression parameters are (compress_type=zstd, compress_chunk_size=1024).
Then compare the performance difference between the original table and the compressed table.

test command:

   pgbench -i -s 1000
   pgbench -n -T 300 -c 16 -j 16 db1

tps comparison:

   original table  :20081
   compressed table:19984


Comparison of storage space:

                       original    compressed(before benchmark)   compressed(after benchmark*)
pgbench_accounts        13 GB       1660 MB                        1711 MB
pgbench_accounts_pkey   2142 MB     738 MB                         816 MB

*note:After the benchmark, there are some compressed pages that need 2 chuncks to store data


# TODO list

1. support setting of compress level
2. support ALTER TABLE/INDEX xx set(...)
3. support checksum in pg_basebackup, pg_checksum and replication
4. support pg_rewind
5. infomation output for compressed page's meta data


# Problem

When compress_chunk_size=1024, about 4MB of space is needed to store the address,
which will cause the space of the small file to become larger after compression.

The solutions considered are as follows:
The address and data of the compressed page are divided into two files, and the address file is also divided into disk space as needed, and at least one BLCKSZ is allocated for each expansion.


Best Regards
Chen Hujaun
Attachment

Re: [Proposal] Page Compression for OLTP

From
chenhj
Date:
Hi hackers,

I further improved this Patch, adjusted some of the design, and added related modifications
(pg_rewind,replication,checksum,backup) and basic tests. Any suggestions are welcome.

this patch can also be obtained from here
https://github.com/ChenHuajun/postgres/tree/page_compress_14

# 1. Page storage

The compressed data block is stored in one or more chunks of the compressed data file, 
and the size of each chunk is 1/8, 1/4, or 1/2 block size.
The storage location of each compressed data block is represented by an array of chunkno 
and stored in the compressed address file.

## 1.1 page compressed address file(_pca)

     blk0       1       2       3
+=======+=======+=======+=======+=======+
| head  |  1    |    2  | 3,4   |   5   |
+=======+=======+=======+=======+=======+

## 1.2 page compressed data file(_pcd)

chunk1    2         3          4         5
+=========+=========+==========+=========+=========+
| blk0    | blk2    | blk2_1   | blk2_2  | blk3    |
+=========+=========+==========+=========+=========+
|    4K   |


# 2. Usage

## 2.1 Set whether to use compression through storage parameters of tables and indexes

- compresstype
 Set whether to compress and the compression algorithm used, supported values: none, pglz, zstd

- compresslevel
  Set compress level(only zstd support)
 
- compress_chunk_size

 Chunk is the smallest unit of storage space allocated for compressed pages.
 The size of the chunk can only be 1/2, 1/4 or 1/8 of BLCKSZ

- compress_prealloc_chunks

  The number of chunks pre-allocated for each page. The maximum value allowed is: BLCKSZ/compress_chunk_size -1.
  If the number of chunks required for a compressed page is less than `compress_prealloc_chunks`,
  It allocates `compress_prealloc_chunks` chunks to avoid future storage fragmentation when the page needs more storage space.

example:
CREATE TABLE tbl_pc(id int, c1 text) WITH(compresstype=zstd, compresslevel=0, compress_chunk_size=1024, compress_prealloc_chunks=2);
CREATE INDEX tbl_pc_idx1 on tbl_pc(c1) WITH(compresstype=zstd, compresslevel=1, compress_chunk_size=4096, compress_prealloc_chunks=0);


## 2.2 Set default compression option when create table in specified tablespace

- default_compresstype
- default_compresslevel
- default_compress_chunk_size
- default_compress_prealloc_chunks

note:temp table and unlogged table will not be affected by the above 4 parameters

example:
ALTER TABLESPACE pg_default SET(default_compresstype=zstd, default_compresslevel=2, default_compress_chunk_size=1024, default_compress_prealloc_chunks=2);


## 2.3 View the storage location of each block of the compressed table

add some functions in pageinspect to inspect compressed relation

- get_compress_address_header(relname text, segno integer)
- get_compress_address_items(relname text, segno integer)

example:
SELECT nblocks, allocated_chunks, chunk_size, algorithm FROM get_compress_address_header('test_compressed',0);
 nblocks | allocated_chunks | chunk_size | algorithm 
---------+------------------+------------+-----------
       1 |               20 |       1024 |         1
(1 row)

SELECT * FROM get_compress_address_items('test_compressed',0);
 blkno | nchunks | allocated_chunks |   chunknos    
-------+---------+------------------+---------------
     0 |       0 |                4 | {1,2,3,4}
     1 |       0 |                4 | {5,6,7,8}
     2 |       0 |                4 | {9,10,11,12}
     3 |       0 |                4 | {13,14,15,16}
     4 |       0 |                4 | {17,18,19,20}
(5 rows)

## 2.4 Compare the compression ratio of different compression algorithms and compression levels

Use a new function in pageinspect can compare the compression ratio of different compression algorithms and compression levels.
This helps determine what compression parameters to use.

- page_compress(page bytea, algorithm text, level integer)

example:
postgres=# SELECT blk,octet_length(page_compress(get_raw_page('test_compressed', 'main', blk), 'pglz', 0)) compressed_size from generate_series(0,4) blk;
 blk | compressed_size
-----+-----------------
   0 |            3234
   1 |            3516
   2 |            3515
   3 |            3515
   4 |            1571
(5 rows)

postgres=# SELECT blk,octet_length(page_compress(get_raw_page('test_compressed', 'main', blk), 'zstd', 0)) compressed_size from generate_series(0,4) blk;
 blk | compressed_size
-----+-----------------
   0 |            1640
   1 |            1771
   2 |            1801
   3 |            1813
   4 |             806
(5 rows)


# 3. How to ensure crash safe
For the convenience of implementation, when the chunk space is allocated in the compressed address file, 
WAL is not written. Therefore, if postgres crashes during the space allocation process, 
incomplete data may remain in the compressed address file.

In order to ensure the data consistency of the compressed address file, the following measures have been taken

1. Divide the compressed address file into several 512-byte areas. The address data of each data block is stored in only one area, 
   and does not cross the area boundary to prevent half of the addresses from being persistent and the other half of the addresses not being persistent.
2. When allocating chunk space, write address information in a fixed order in the address file to avoid inconsistent data midway. details as follows

   -Accumulate the total number of allocated chunks in the Header (PageCompressHeader.allocated_chunks)
   -Write the chunkno array in the address corresponding to the data block (PageCompressAddr.chunknos)
   -Write the number of allocated chunks in the address corresponding to the written data block (PageCompressAddr.nchunks)
   -Update the global number of blocks in the Header (PageCompressHeader.nblocks)

typedef struct PageCompressHeader
{
pg_atomic_uint32 nblocks; /* number of total blocks in this segment */
pg_atomic_uint32 allocated_chunks; /* number of total allocated chunks in data area */
uint16 chunk_size; /* size of each chunk, must be 1/2 1/4 or 1/8 of BLCKSZ */
uint8 algorithm; /* compress algorithm, 1=pglz, 2=lz4 */
pg_atomic_uint32 last_synced_nblocks; /* last synced nblocks */
pg_atomic_uint32 last_synced_allocated_chunks; /* last synced allocated_chunks */
TimestampTz last_recovery_start_time; /* postmaster start time of last recovery */
} PageCompressHeader;

typedef struct PageCompressAddr
{
volatile uint8 nchunks; /* number of chunks for this block */
volatile uint8 allocated_chunks; /* number of allocated chunks for this block */

/* variable-length fields, 1 based chunk no array for this block, size of the array must be 2, 4 or 8 */
pc_chunk_number_t chunknos[FLEXIBLE_ARRAY_MEMBER];
} PageCompressAddr;

3. Once a chunk is allocated, it will always belong to a specific data block until the relation is truncated(or vacuum tail block), 
   avoiding frequent changes of address information.
4. When replaying WAL in the recovery phase after a postgres crash, check the address file of all compressed relations opened for the first time,
   and repair if inconsistent data (refer to the check_and_repair_compress_address function).


# 4. Problem

- When compress_chunk_size=1024, about 4MB of space is needed to store the address,
  which will cause the space of the small file to become larger after compression.
  Therefore, should avoid enabling compression for small tables.
- The zstd library needs to be installed separately. Could copy the source code of zstd to postgres?


# 5. TODO list

1. docs
2. optimize code style, error message and so on 
3. more test

BTW:
If anyone thinks this Patch is valuable, hope to improve it together.


Best Regards
Chen Hujaun

Attachment

Re: [Proposal] Page Compression for OLTP

From
chenhj
Date:
Hi, hackers

I want to know whether this patch can be accepted by the community, that is, whether it is necessary for me to continue working for this Patch. 
If you have any suggestions, please feedback to me.

Best Regards
Chen Huajun

Re: [Proposal] Page Compression for OLTP

From
Daniel Gustafsson
Date:
> On 16 Feb 2021, at 15:45, chenhj <chjischj@163.com> wrote:

> I want to know whether this patch can be accepted by the community, that is, whether it is necessary for me to
continueworking for this Patch.  
> If you have any suggestions, please feedback to me.

It doesn't seem like the patch has been registered in the commitfest app so it
may have been forgotten about, the number of proposed patches often outnumber
the code review bandwidth.  Please register it at:

    https://commitfest.postgresql.org/32/

..to make sure it doesn't get lost.

--
Daniel Gustafsson        https://vmware.com/




Re: [Proposal] Page Compression for OLTP

From
chenhj
Date:

At 2021-02-16 21:51:14, "Daniel Gustafsson" <daniel@yesql.se> wrote:

>> On 16 Feb 2021, at 15:45, chenhj <chjischj@163.com> wrote:
>
>> I want to know whether this patch can be accepted by the community, that is, whether it is necessary for me to continue working for this Patch. 
>> If you have any suggestions, please feedback to me.
>
>It doesn't seem like the patch has been registered in the commitfest app so it
>may have been forgotten about, the number of proposed patches often outnumber
>the code review bandwidth.  Please register it at:
>
>	https://commitfest.postgresql.org/32/
>
>..to make sure it doesn't get lost.
>
>--
>Daniel Gustafsson https://vmware.com/

Thanks, I will complete this patch and registered it later.
Chen Huajun

Re: [Proposal] Page Compression for OLTP

From
David Fetter
Date:
On Tue, Feb 16, 2021 at 11:15:36PM +0800, chenhj wrote:
> At 2021-02-16 21:51:14, "Daniel Gustafsson" <daniel@yesql.se> wrote:
> 
> >> On 16 Feb 2021, at 15:45, chenhj <chjischj@163.com> wrote:
> >
> >> I want to know whether this patch can be accepted by the community, that is, whether it is necessary for me to
continueworking for this Patch. 
 
> >> If you have any suggestions, please feedback to me.
> >
> >It doesn't seem like the patch has been registered in the commitfest app so it
> >may have been forgotten about, the number of proposed patches often outnumber
> >the code review bandwidth.  Please register it at:
> >
> >    https://commitfest.postgresql.org/32/
> >
> >..to make sure it doesn't get lost.
> >
> >--
> 
> >Daniel Gustafsson        https://vmware.com/
> 
> 
> Thanks, I will complete this patch and registered it later.
> Chen Huajun

The simplest way forward is to register it now so it doesn't miss the
window for the upcoming commitfest (CF), which closes at the end of
this month. That way, everybody has the entire time between now and
the end of the CF to review the patch, work on it, etc, and the CF bot
will be testing it against the changing code base to ensure people
know if such a change causes it to need a rebase.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: [Proposal] Page Compression for OLTP

From
chenhj
Date:

Hi hackers,

I have rebase this patch and made some improvements.


1. A header is added to each chunk in the pcd file, which records the chunk of which block the chunk belongs to, and the checksum of the chunk.

  Accordingly, all pages in a compressed relation are stored in compressed format, even if the compressed page is larger than BLCKSZ.

  The maximum space occupied by a compressed page is BLCKSZ + chunk_size (exceeding this range will report an error when writing the page).

2. Repair the pca file through the information recorded in the pcd when recovering from a crash

3. For compressed relation, do not release the free blocks at the end of the relation (just like what old_snapshot_threshold does), reducing the risk of data inconsistency between pcd and pca file.

4. During backup, only check the checksum in the chunk header for the pcd file, and avoid assembling and decompressing chunks into the original page.

5. bugfix, doc, code style and so on


And see src/backend/storage/smgr/README.compression for detail


Other

1. remove support of default compression option in tablespace, I'm not sure about the necessity of this feature, so don't support it for now.

2. pg_rewind currently does not support copying only changed blocks from pcd file. This feature is relatively independent and could be implemented later.


Best Regard

Chen Huajun

At 2021-02-18 23:12:57, "David Fetter" <david@fetter.org> wrote:
>On Tue, Feb 16, 2021 at 11:15:36PM +0800, chenhj wrote:
>> At 2021-02-16 21:51:14, "Daniel Gustafsson" <daniel@yesql.se> wrote:
>> 
>> >> On 16 Feb 2021, at 15:45, chenhj <chjischj@163.com> wrote:
>> >
>> >> I want to know whether this patch can be accepted by the community, that is, whether it is necessary for me to continue working for this Patch. 
>> >> If you have any suggestions, please feedback to me.
>> >
>> >It doesn't seem like the patch has been registered in the commitfest app so it
>> >may have been forgotten about, the number of proposed patches often outnumber
>> >the code review bandwidth.  Please register it at:
>> >
>> >	https://commitfest.postgresql.org/32/
>> >
>> >..to make sure it doesn't get lost.
>> >
>> >--
>> 
>> >Daniel Gustafsson		https://vmware.com/
>> 
>> 
>> Thanks, I will complete this patch and registered it later.
>> Chen Huajun
>
>The simplest way forward is to register it now so it doesn't miss the
>window for the upcoming commitfest (CF), which closes at the end of
>this month. That way, everybody has the entire time between now and
>the end of the CF to review the patch, work on it, etc, and the CF bot
>will be testing it against the changing code base to ensure people
>know if such a change causes it to need a rebase.
>
>Best,
>David.
>-- 
>David Fetter <david(at)fetter(dot)org> http://fetter.org/
>Phone: +1 415 235 3778
>
>Remember to vote!
>Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachment

Re: [Proposal] Page Compression for OLTP

From
Ian Lawrence Barwick
Date:
2022年7月27日(水) 2:47 chenhj <chjischj@163.com>:
>
> Hi hackers,
>
> I have rebase this patch and made some improvements.
>
>
> 1. A header is added to each chunk in the pcd file, which records the chunk of which block the chunk belongs to, and
thechecksum of the chunk. 
>
>   Accordingly, all pages in a compressed relation are stored in compressed format, even if the compressed page is
largerthan BLCKSZ. 
>
>   The maximum space occupied by a compressed page is BLCKSZ + chunk_size (exceeding this range will report an error
whenwriting the page). 
>
> 2. Repair the pca file through the information recorded in the pcd when recovering from a crash
>
> 3. For compressed relation, do not release the free blocks at the end of the relation (just like what
old_snapshot_thresholddoes), reducing the risk of data inconsistency between pcd and pca file. 
>
> 4. During backup, only check the checksum in the chunk header for the pcd file, and avoid assembling and
decompressingchunks into the original page. 
>
> 5. bugfix, doc, code style and so on
>
>
> And see src/backend/storage/smgr/README.compression for detail
>
>
> Other
>
> 1. remove support of default compression option in tablespace, I'm not sure about the necessity of this feature, so
don'tsupport it for now. 
>
> 2. pg_rewind currently does not support copying only changed blocks from pcd file. This feature is relatively
independentand could be implemented later. 

Hi

cfbot reports the patch no longer applies.  As CommitFest 2022-11 is
currently underway, this would be an excellent time to update the patch.

Thanks

Ian Barwick



Re: [Proposal] Page Compression for OLTP

From
vignesh C
Date:
On Fri, 4 Nov 2022 at 07:02, Ian Lawrence Barwick <barwick@gmail.com> wrote:
>
> 2022年7月27日(水) 2:47 chenhj <chjischj@163.com>:
> >
> > Hi hackers,
> >
> > I have rebase this patch and made some improvements.
> >
> >
> > 1. A header is added to each chunk in the pcd file, which records the chunk of which block the chunk belongs to,
andthe checksum of the chunk. 
> >
> >   Accordingly, all pages in a compressed relation are stored in compressed format, even if the compressed page is
largerthan BLCKSZ. 
> >
> >   The maximum space occupied by a compressed page is BLCKSZ + chunk_size (exceeding this range will report an error
whenwriting the page). 
> >
> > 2. Repair the pca file through the information recorded in the pcd when recovering from a crash
> >
> > 3. For compressed relation, do not release the free blocks at the end of the relation (just like what
old_snapshot_thresholddoes), reducing the risk of data inconsistency between pcd and pca file. 
> >
> > 4. During backup, only check the checksum in the chunk header for the pcd file, and avoid assembling and
decompressingchunks into the original page. 
> >
> > 5. bugfix, doc, code style and so on
> >
> >
> > And see src/backend/storage/smgr/README.compression for detail
> >
> >
> > Other
> >
> > 1. remove support of default compression option in tablespace, I'm not sure about the necessity of this feature, so
don'tsupport it for now. 
> >
> > 2. pg_rewind currently does not support copying only changed blocks from pcd file. This feature is relatively
independentand could be implemented later. 
>
> Hi
>
> cfbot reports the patch no longer applies.  As CommitFest 2022-11 is
> currently underway, this would be an excellent time to update the patch.

There has been no updates on this thread for some time, so this has
been switched as Returned with Feedback. Feel free to open it in the
next commitfest if you plan to continue on this.

Regards,
Vignesh