Thread: Some questions about PostgreSQL’s design.

Some questions about PostgreSQL’s design.

From

陈宗志

Date:

20 August 2024, 11:46:37

I’ve recently started exploring PostgreSQL implementation. I used to
be a MySQL InnoDB developer, and I find the PostgreSQL community feels
a bit strange.

There are some areas where they’ve done really well, but there are
also some obvious issues that haven’t been improved.

For example, the B-link tree implementation in PostgreSQL is
particularly elegant, and the code is very clean.
But there are some clear areas that could be improved but haven’t been
addressed, like the double memory problem where the buffer pool and
page cache store the same page, using full-page writes to deal with
torn page writes instead of something like InnoDB’s double write
buffer.

It seems like these issues have clear solutions, such as using
DirectIO like InnoDB instead of buffered IO, or using a double write
buffer instead of relying on the full-page write approach.
Can anyone replay why?

However, the PostgreSQL community’s mailing list is truly a treasure
trove, where you can find really interesting discussions. For
instance, this discussion on whether lock coupling is needed for
B-link trees, etc.
https://www.postgresql.org/message-id/flat/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k%2Bm9G0vffrYw%40mail.gmail.com

--
---
Blog: https://baotiao.github.io/
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao

Re: Some questions about PostgreSQL’s design.

From

Thomas Munro

Date:

20 August 2024, 13:09:27

On Tue, Aug 20, 2024 at 8:55 PM 陈宗志 <baotiao@gmail.com> wrote:
> It seems like these issues have clear solutions, such as using
> DirectIO like InnoDB instead of buffered IO,

For this part: we recently added an experimental option to use direct
I/O (debug_io_direct).  We are working on the infrastructure needed to
make it work efficiently before removing the "debug_" prefix:
prediction of future I/O through a "stream" abstraction which we have
some early pieces of already, I/O combining (see new io_combine_limit
setting), and asynchronous I/O (work in progress, basically I/O worker
processes or io_uring or other OS-specific APIs).

Re: Some questions about PostgreSQL’s design.

From

Heikki Linnakangas

Date:

20 August 2024, 16:46:54

On 20/08/2024 11:46, 陈宗志 wrote:
> I’ve recently started exploring PostgreSQL implementation. I used to
> be a MySQL InnoDB developer, and I find the PostgreSQL community feels
> a bit strange.
> 
> There are some areas where they’ve done really well, but there are
> also some obvious issues that haven’t been improved.
> 
> For example, the B-link tree implementation in PostgreSQL is
> particularly elegant, and the code is very clean.
> But there are some clear areas that could be improved but haven’t been
> addressed, like the double memory problem where the buffer pool and
> page cache store the same page, using full-page writes to deal with
> torn page writes instead of something like InnoDB’s double write
> buffer.
> 
> It seems like these issues have clear solutions, such as using
> DirectIO like InnoDB instead of buffered IO, or using a double write
> buffer instead of relying on the full-page write approach.
> Can anyone replay why?

There are pros and cons. With direct I/O, you cannot take advantage of 
the kernel page cache anymore, so it becomes important to tune 
shared_buffers more precisely. That's a downside: the system requires 
more tuning. For many applications, squeezing the last ounce of 
performance just isn't that important. There are also scaling issues 
with the Postgres buffer cache, which might need to be addressed first.

With double write buffering, there are also pros and cons. It also 
requires careful tuning. And replaying WAL that contains full-page 
images can be much faster, because you can write new page images 
"blindly" without reading the old pages first. We have WAL prefetching 
now, which alleviates that, but it's no panacea.

In summary, those are good solutions but they're not obviously better in 
all circumstances.

> However, the PostgreSQL community’s mailing list is truly a treasure
> trove, where you can find really interesting discussions. For
> instance, this discussion on whether lock coupling is needed for
> B-link trees, etc.
> https://www.postgresql.org/message-id/flat/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k%2Bm9G0vffrYw%40mail.gmail.com

Yep, there are old threads and patches for double write buffers and 
direct IO too :-).

-- 
Heikki Linnakangas
Neon (https://neon.tech)

Re: Some questions about PostgreSQL’s design.

From

Bruce Momjian

Date:

20 August 2024, 19:55:31

On Tue, Aug 20, 2024 at 04:46:54PM +0300, Heikki Linnakangas wrote:
> There are pros and cons. With direct I/O, you cannot take advantage of the
> kernel page cache anymore, so it becomes important to tune shared_buffers
> more precisely. That's a downside: the system requires more tuning. For many
> applications, squeezing the last ounce of performance just isn't that
> important. There are also scaling issues with the Postgres buffer cache,
> which might need to be addressed first.
> 
> With double write buffering, there are also pros and cons. It also requires
> careful tuning. And replaying WAL that contains full-page images can be much
> faster, because you can write new page images "blindly" without reading the
> old pages first. We have WAL prefetching now, which alleviates that, but
> it's no panacea.

陈宗志, you mimght find this blog post helpful:

    https://momjian.us/main/blogs/pgblog/2017.html#June_5_2017

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: Some questions about PostgreSQL’s design.

From

陈宗志

Date:

22 August 2024, 11:49:55

For other approaches, such as whether to use an LRU list to manage the
shared_buffer or to use a clock sweep for management, both methods
have their pros and cons. But for these two issues, there is a clearly
better solution. For example, using DirectIO avoids the problem of
double-copying data, and the OS’s page cache LRU list is optimized for
general scenarios, while the database kernel should use its own
eviction algorithm. Regarding the other issue, full-page writes don’t
actually reduce the number of page reads—it’s just a matter of whether
those page reads come from data files or from the redo log; the amount
of data read is essentially the same. However, the problem it
introduces is significant write amplification on the critical write
path, which severely impacts performance. As a result, PostgreSQL has
to minimize the frequency of checkpoints as much as possible.

I thought someone could write a demo to show it..

On Tue, Aug 20, 2024 at 9:46 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 20/08/2024 11:46, 陈宗志 wrote:
> > I’ve recently started exploring PostgreSQL implementation. I used to
> > be a MySQL InnoDB developer, and I find the PostgreSQL community feels
> > a bit strange.
> >
> > There are some areas where they’ve done really well, but there are
> > also some obvious issues that haven’t been improved.
> >
> > For example, the B-link tree implementation in PostgreSQL is
> > particularly elegant, and the code is very clean.
> > But there are some clear areas that could be improved but haven’t been
> > addressed, like the double memory problem where the buffer pool and
> > page cache store the same page, using full-page writes to deal with
> > torn page writes instead of something like InnoDB’s double write
> > buffer.
> >
> > It seems like these issues have clear solutions, such as using
> > DirectIO like InnoDB instead of buffered IO, or using a double write
> > buffer instead of relying on the full-page write approach.
> > Can anyone replay why?
>
> There are pros and cons. With direct I/O, you cannot take advantage of
> the kernel page cache anymore, so it becomes important to tune
> shared_buffers more precisely. That's a downside: the system requires
> more tuning. For many applications, squeezing the last ounce of
> performance just isn't that important. There are also scaling issues
> with the Postgres buffer cache, which might need to be addressed first.
>
> With double write buffering, there are also pros and cons. It also
> requires careful tuning. And replaying WAL that contains full-page
> images can be much faster, because you can write new page images
> "blindly" without reading the old pages first. We have WAL prefetching
> now, which alleviates that, but it's no panacea.
>
> In summary, those are good solutions but they're not obviously better in
> all circumstances.
>
> > However, the PostgreSQL community’s mailing list is truly a treasure
> > trove, where you can find really interesting discussions. For
> > instance, this discussion on whether lock coupling is needed for
> > B-link trees, etc.
> > https://www.postgresql.org/message-id/flat/CALJbhHPiudj4usf6JF7wuCB81fB7SbNAeyG616k%2Bm9G0vffrYw%40mail.gmail.com
>
> Yep, there are old threads and patches for double write buffers and
> direct IO too :-).
>
> --
> Heikki Linnakangas
> Neon (https://neon.tech)
>


--
---
Blog: http://www.chenzongzhi.info
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao

Re: Some questions about PostgreSQL’s design.

From

陈宗志

Date:

22 August 2024, 11:50:16

I disagree with the point made in the article. The article mentions
that ‘prevents the kernel from reordering reads and writes to optimize
performance,’ which might be referring to the file system’s IO
scheduling and merging. However, this can be handled within the
database itself, where IO scheduling and merging can be done even
better.

Regarding ‘does not allow free memory to be used as kernel cache,’ I
believe the database itself should manage memory well, and most of the
memory should be managed by the database rather than handed over to
the operating system. Additionally, the database’s use of the page
cache should be restricted.

On Wed, Aug 21, 2024 at 12:55 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Tue, Aug 20, 2024 at 04:46:54PM +0300, Heikki Linnakangas wrote:
> > There are pros and cons. With direct I/O, you cannot take advantage of the
> > kernel page cache anymore, so it becomes important to tune shared_buffers
> > more precisely. That's a downside: the system requires more tuning. For many
> > applications, squeezing the last ounce of performance just isn't that
> > important. There are also scaling issues with the Postgres buffer cache,
> > which might need to be addressed first.
> >
> > With double write buffering, there are also pros and cons. It also requires
> > careful tuning. And replaying WAL that contains full-page images can be much
> > faster, because you can write new page images "blindly" without reading the
> > old pages first. We have WAL prefetching now, which alleviates that, but
> > it's no panacea.
>
> 陈宗志, you mimght find this blog post helpful:
>
>         https://momjian.us/main/blogs/pgblog/2017.html#June_5_2017
>
> --
>   Bruce Momjian  <bruce@momjian.us>        https://momjian.us
>   EDB                                      https://enterprisedb.com
>
>   Only you can decide what is important to you.

--
---
Blog: http://www.chenzongzhi.info
Twitter: https://twitter.com/baotiao
Git: https://github.com/baotiao

Re: Some questions about PostgreSQL’s design.

From

Andreas Karlsson

Date:

22 August 2024, 15:10:55

On 8/22/24 10:50 AM, 陈宗志 wrote:
> I disagree with the point made in the article. The article mentions
> that ‘prevents the kernel from reordering reads and writes to optimize
> performance,’ which might be referring to the file system’s IO
> scheduling and merging. However, this can be handled within the
> database itself, where IO scheduling and merging can be done even
> better.

The database does not have all the information that the OS has, but that 
said I suspect that the advantages of direct IO outweigh the 
disadvantages in this regard. But the only way to know for sure would be 
fore someone to provide a benchmark.

> Regarding ‘does not allow free memory to be used as kernel cache,’ I
> believe the database itself should manage memory well, and most of the
> memory should be managed by the database rather than handed over to
> the operating system. Additionally, the database’s use of the page
> cache should be restricted.

That all depends on you use case. If the database is running alone or 
almost alone on a machine direct IO is likely the optional strategy but 
if more services are running on the same machine (e.g. if you run 
PostgreSQL on your personal laptop) you want to use buffered IO.

But as far as I know the long term plan of the async IO project is to 
support both direct and buffered IO so people can pick the right choice 
for their workload.

Andreas