Thread: Some questions about PostgreSQL’s design.
I’ve recently started exploring PostgreSQL implementation. I used to be a MySQL InnoDB developer, and I find the PostgreSQL community feels a bit strange. There are some areas where they’ve done really well, but there are also some obvious issues that haven’t been improved. For example, the B-link tree implementation in PostgreSQL is particularly elegant, and the code is very clean. But there are some clear areas that could be improved but haven’t been addressed, like the double memory problem where the buffer pool and page cache store the same page, using full-page writes to deal with torn page writes instead of something like InnoDB’s double write buffer. It seems like these issues have clear solutions, such as using DirectIO like InnoDB instead of buffered IO, or using a double write buffer instead of relying on the full-page write approach. Can anyone replay why? However, the PostgreSQL community’s mailing list is truly a treasure trove, where you can find really interesting discussions. For instance, this discussion on whether lock coupling is needed for B-link trees, etc. https://www.postgresql.org/message-id/flat/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k%2Bm9G0vffrYw%40mail.gmail.com -- --- Blog: https://baotiao.github.io/ Twitter: https://twitter.com/baotiao Git: https://github.com/baotiao
On Tue, Aug 20, 2024 at 8:55 PM 陈宗志 <baotiao@gmail.com> wrote: > It seems like these issues have clear solutions, such as using > DirectIO like InnoDB instead of buffered IO, For this part: we recently added an experimental option to use direct I/O (debug_io_direct). We are working on the infrastructure needed to make it work efficiently before removing the "debug_" prefix: prediction of future I/O through a "stream" abstraction which we have some early pieces of already, I/O combining (see new io_combine_limit setting), and asynchronous I/O (work in progress, basically I/O worker processes or io_uring or other OS-specific APIs).
On 20/08/2024 11:46, 陈宗志 wrote: > I’ve recently started exploring PostgreSQL implementation. I used to > be a MySQL InnoDB developer, and I find the PostgreSQL community feels > a bit strange. > > There are some areas where they’ve done really well, but there are > also some obvious issues that haven’t been improved. > > For example, the B-link tree implementation in PostgreSQL is > particularly elegant, and the code is very clean. > But there are some clear areas that could be improved but haven’t been > addressed, like the double memory problem where the buffer pool and > page cache store the same page, using full-page writes to deal with > torn page writes instead of something like InnoDB’s double write > buffer. > > It seems like these issues have clear solutions, such as using > DirectIO like InnoDB instead of buffered IO, or using a double write > buffer instead of relying on the full-page write approach. > Can anyone replay why? There are pros and cons. With direct I/O, you cannot take advantage of the kernel page cache anymore, so it becomes important to tune shared_buffers more precisely. That's a downside: the system requires more tuning. For many applications, squeezing the last ounce of performance just isn't that important. There are also scaling issues with the Postgres buffer cache, which might need to be addressed first. With double write buffering, there are also pros and cons. It also requires careful tuning. And replaying WAL that contains full-page images can be much faster, because you can write new page images "blindly" without reading the old pages first. We have WAL prefetching now, which alleviates that, but it's no panacea. In summary, those are good solutions but they're not obviously better in all circumstances. > However, the PostgreSQL community’s mailing list is truly a treasure > trove, where you can find really interesting discussions. For > instance, this discussion on whether lock coupling is needed for > B-link trees, etc. > https://www.postgresql.org/message-id/flat/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k%2Bm9G0vffrYw%40mail.gmail.com Yep, there are old threads and patches for double write buffers and direct IO too :-). -- Heikki Linnakangas Neon (https://neon.tech)
On Tue, Aug 20, 2024 at 04:46:54PM +0300, Heikki Linnakangas wrote: > There are pros and cons. With direct I/O, you cannot take advantage of the > kernel page cache anymore, so it becomes important to tune shared_buffers > more precisely. That's a downside: the system requires more tuning. For many > applications, squeezing the last ounce of performance just isn't that > important. There are also scaling issues with the Postgres buffer cache, > which might need to be addressed first. > > With double write buffering, there are also pros and cons. It also requires > careful tuning. And replaying WAL that contains full-page images can be much > faster, because you can write new page images "blindly" without reading the > old pages first. We have WAL prefetching now, which alleviates that, but > it's no panacea. 陈宗志, you mimght find this blog post helpful: https://momjian.us/main/blogs/pgblog/2017.html#June_5_2017 -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
For other approaches, such as whether to use an LRU list to manage the shared_buffer or to use a clock sweep for management, both methods have their pros and cons. But for these two issues, there is a clearly better solution. For example, using DirectIO avoids the problem of double-copying data, and the OS’s page cache LRU list is optimized for general scenarios, while the database kernel should use its own eviction algorithm. Regarding the other issue, full-page writes don’t actually reduce the number of page reads—it’s just a matter of whether those page reads come from data files or from the redo log; the amount of data read is essentially the same. However, the problem it introduces is significant write amplification on the critical write path, which severely impacts performance. As a result, PostgreSQL has to minimize the frequency of checkpoints as much as possible. I thought someone could write a demo to show it.. On Tue, Aug 20, 2024 at 9:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 20/08/2024 11:46, 陈宗志 wrote: > > I’ve recently started exploring PostgreSQL implementation. I used to > > be a MySQL InnoDB developer, and I find the PostgreSQL community feels > > a bit strange. > > > > There are some areas where they’ve done really well, but there are > > also some obvious issues that haven’t been improved. > > > > For example, the B-link tree implementation in PostgreSQL is > > particularly elegant, and the code is very clean. > > But there are some clear areas that could be improved but haven’t been > > addressed, like the double memory problem where the buffer pool and > > page cache store the same page, using full-page writes to deal with > > torn page writes instead of something like InnoDB’s double write > > buffer. > > > > It seems like these issues have clear solutions, such as using > > DirectIO like InnoDB instead of buffered IO, or using a double write > > buffer instead of relying on the full-page write approach. > > Can anyone replay why? > > There are pros and cons. With direct I/O, you cannot take advantage of > the kernel page cache anymore, so it becomes important to tune > shared_buffers more precisely. That's a downside: the system requires > more tuning. For many applications, squeezing the last ounce of > performance just isn't that important. There are also scaling issues > with the Postgres buffer cache, which might need to be addressed first. > > With double write buffering, there are also pros and cons. It also > requires careful tuning. And replaying WAL that contains full-page > images can be much faster, because you can write new page images > "blindly" without reading the old pages first. We have WAL prefetching > now, which alleviates that, but it's no panacea. > > In summary, those are good solutions but they're not obviously better in > all circumstances. > > > However, the PostgreSQL community’s mailing list is truly a treasure > > trove, where you can find really interesting discussions. For > > instance, this discussion on whether lock coupling is needed for > > B-link trees, etc. > > https://www.postgresql.org/message-id/flat/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k%2Bm9G0vffrYw%40mail.gmail.com > > Yep, there are old threads and patches for double write buffers and > direct IO too :-). > > -- > Heikki Linnakangas > Neon (https://neon.tech) > -- --- Blog: http://www.chenzongzhi.info Twitter: https://twitter.com/baotiao Git: https://github.com/baotiao
I disagree with the point made in the article. The article mentions that ‘prevents the kernel from reordering reads and writes to optimize performance,’ which might be referring to the file system’s IO scheduling and merging. However, this can be handled within the database itself, where IO scheduling and merging can be done even better. Regarding ‘does not allow free memory to be used as kernel cache,’ I believe the database itself should manage memory well, and most of the memory should be managed by the database rather than handed over to the operating system. Additionally, the database’s use of the page cache should be restricted. On Wed, Aug 21, 2024 at 12:55 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Tue, Aug 20, 2024 at 04:46:54PM +0300, Heikki Linnakangas wrote: > > There are pros and cons. With direct I/O, you cannot take advantage of the > > kernel page cache anymore, so it becomes important to tune shared_buffers > > more precisely. That's a downside: the system requires more tuning. For many > > applications, squeezing the last ounce of performance just isn't that > > important. There are also scaling issues with the Postgres buffer cache, > > which might need to be addressed first. > > > > With double write buffering, there are also pros and cons. It also requires > > careful tuning. And replaying WAL that contains full-page images can be much > > faster, because you can write new page images "blindly" without reading the > > old pages first. We have WAL prefetching now, which alleviates that, but > > it's no panacea. > > 陈宗志, you mimght find this blog post helpful: > > https://momjian.us/main/blogs/pgblog/2017.html#June_5_2017 > > -- > Bruce Momjian <bruce@momjian.us> https://momjian.us > EDB https://enterprisedb.com > > Only you can decide what is important to you. -- --- Blog: http://www.chenzongzhi.info Twitter: https://twitter.com/baotiao Git: https://github.com/baotiao
On 8/22/24 10:50 AM, 陈宗志 wrote: > I disagree with the point made in the article. The article mentions > that ‘prevents the kernel from reordering reads and writes to optimize > performance,’ which might be referring to the file system’s IO > scheduling and merging. However, this can be handled within the > database itself, where IO scheduling and merging can be done even > better. The database does not have all the information that the OS has, but that said I suspect that the advantages of direct IO outweigh the disadvantages in this regard. But the only way to know for sure would be fore someone to provide a benchmark. > Regarding ‘does not allow free memory to be used as kernel cache,’ I > believe the database itself should manage memory well, and most of the > memory should be managed by the database rather than handed over to > the operating system. Additionally, the database’s use of the page > cache should be restricted. That all depends on you use case. If the database is running alone or almost alone on a machine direct IO is likely the optional strategy but if more services are running on the same machine (e.g. if you run PostgreSQL on your personal laptop) you want to use buffered IO. But as far as I know the long term plan of the async IO project is to support both direct and buffered IO so people can pick the right choice for their workload. Andreas