32.1. Why database compression may be useful #
Databases are used to store larger number of text and duplicated information. This is why compression of most of databases can be quite efficient and reduce used storage size 3..5 times. Postgres Pro Enterprise performs compression of TOAST data, but small text fields which fits in the page are not compressed. Also not only heap pages can be compressed, indexes on text keys or indexes with larger number of duplicate values are also good candidates for compression.
Postgres Pro Enterprise is working with disk data through buffer pool which accumulates most frequently used buffers. Interface between buffer manager and file system is the most natural place for performing compression. Buffers are stored on the disk in compressed form for reducing disk usage and minimizing amount of data to be read. And in-memory buffer pool contains uncompressed buffers, providing access to the records at the same speed as without compression. As far as modern server have large enough size of RAM, substantial part of the database can be cached in memory and accessed without any compression overhead penalty.
Except obvious advantage: saving disk space, compression can also improve system performance. There are two main reasons for it:
- Reducing amount of disk IO
Compression helps to reduce size of data which should be written to the disk or read from it. Compression ratio 3 actually means that you need to read 3 times less data or same number of records can be fetched 3 times faster.
- Improving locality
When modified buffers are flushed from buffer pool to the disk, they are written to the random locations on the disk. Postgres Pro Enterprise cache replacement algorithm makes a decision about throwing away buffer from the pool based on its access frequency and ignoring its location on the disk. So two subsequently written buffers can be located in completely different parts of the disk. For HDD seek time is quite large - about 10msec, which corresponds to 100 random writes per second. And speed of sequential write can be about 100Mb/sec, which corresponds to 10000 buffers per second (100 times faster). For SSD gap between sequential and random write speed is smaller, but still sequential writers are more efficient. How it relates to data compression? Size of buffer in Postgres Pro Enterprise is fixed (8kb by default). Size of compressed buffer depends on the content of the buffer. So updated buffer can not always fit in its old location on the disk. This is why we can not access pages directly by its address. Instead of it we have to use map which translates logical address of the page to its physical location on the disk. Definitely this extra level of indirection adds overhead. But in most cases this map can fit in memory, so page lookup is nothing more than just accessing array element. But presence of this map also have positive effect: we can now write updated pages sequentially, just updating their map entries. Postgres Pro Enterprise is doing much to avoid "write storm" intensive flushing of data to the disk when buffer pool space is exhausted. Compression allows you to significantly reduce disk load.