33.2. How compression is integrated in Postgres Pro Enterprise
To improve efficiency of disk IO, Postgres Pro Enterprise is working with files through buffer manager, which pins in memory most frequently used pages. Each page is fixed size (8kb by default). But if we compress page, then its size will depend on its content. So updated page can require more (or less) space than original page. So we may not always perform in-place update of the page. Instead of it we have to locate new space for the page and somehow release old space. There are two main approaches to solving this problem:
- Memory allocator
We should implement our own allocator of file space. Usually, to reduce fragmentation, fixed size block allocator is used. It means that we allocates place using some fixed quantum. For example if compressed page size is 932 bytes, then we will allocate 1024 bytes for it in the file.
- Garbage collector
We can always allocate space for the pages sequentially at the end of the file and periodically do compactification (defragmentation) of the file, moving all used pages to the beginning of the file. Such garbage collection process can be performed in background. As it was explained in the previous section, sequential write of the flushed pages can significantly increase IO speed and some increase performance. This is why we have used this approach in CFS.
As far as page location is not fixed and page can be moved, we can not anymore access the page directly by its address and need to use extra level of indirection to map logical address of the page to it's physical location on the disk. It is done using mapping files. In most cases the mapping will be kept in memory (size of the map is 1000 times smaller size of the file) and address translation adds almost no overhead to page access time. But we need to maintain these extra files: flush them during checkpoint, remove when table is dropped, include them in backup and so on...
Postgres Pro Enterprise stores relation in a set of files, size of each file is not exceeding 2Gb. Separate page map is constructed for each file. Garbage collection in CFS is done by several background workers. Number of this workers and pauses in their work can be configured by database administrator. These workers are splitting work based on inode hash, so them do not conflict with each other. Each file is proceeded separately. The files is blocked for access at the time of garbage collection but complete relation is not blocked. To ensure data consistency GC creates copies of original data and map files. Once them are flushed to the disk, new version of data file is atomically renamed to original file name. And then new page map data is copied to memory-mapped file and backup file for page map is removed. In case of recovery after crash we first inspect if there is backup of data file. If such file exists, then original file is not yet updated and we can safely remove backup files. If such file doesn't exist, then we check for presence of map file backup. If it is present, then defragmentation of this file was not completed because of crash and we complete this operation by copying map from backup file.
By default, CFS uses zstd compression library on Linux and zlib on Windows. For safety reasons, CFS checks that the compression algorithm used in a tablespace corresponds to this library.