Re: [PoC] Non-volatile WAL buffer - Mailing list pgsql-hackers

From Takashi Menjo
Subject Re: [PoC] Non-volatile WAL buffer
Date
Msg-id CAOwnP3OnbzYUyC5QJCHDEGPsM8zx=CiVWzHgbxgCQKM8dTH9Qg@mail.gmail.com
Whole thread Raw
In response to RE: [PoC] Non-volatile WAL buffer  ("tsunakawa.takay@fujitsu.com" <tsunakawa.takay@fujitsu.com>)
List pgsql-hackers
Hi Tomas,

I'd answer your questions. (Not all for now, sorry.)


> Do I understand correctly that the patch removes "regular" WAL buffers and instead writes the data into the non-volatile PMEM buffer, without writing that to the WAL segments at all (unless in archiving mode)?
> Firstly, I guess many (most?) instances will have to write the WAL segments anyway because of PITR/backups, so I'm not sure we can save much here.

Mostly yes. My "non-volatile WAL buffer" patchset removes regular volatile WAL buffers and brings non-volatile ones. All the WAL will get into the non-volatile buffers and persist there. No write out of the buffers to WAL segment files is required. However in archiving mode or in a case of buffer full (described later), both of the non-volatile buffers and the segment files are used.

In archiving mode with my patchset, for each time one segment (16MB default) is fixed on the non-volatile buffers, that segment is written to a segment file asynchronously (by XLogBackgroundFlush). Then it will be archived by existing archiving functionality.


> But more importantly - doesn't that mean the nvwal_size value is essentially a hard limit? With max_wal_size, it's a soft limit i.e. we're allowed to temporarily use more WAL when needed. But with a pre-allocated file, that's clearly not possible. So what would happen in those cases?

Yes, nvwal_size is a hard limit, and I see it's a major weak point of my patchset.

When all non-volatile WAL buffers are filled, the oldest segment on the buffers is written (by XLogWrite) to a regular WAL segment file, then those buffers are cleared (by AdvanceXLInsertBuffer) for new records. All WAL record insertions to the buffers block until that write and clear are complete. Due to that, all write transactions also block.

To make the matter worse, if a checkpoint eventually occurs in such a buffer full case, record insertions would block for a certain time at the end of the checkpoint because a large amount of the non-volatile buffers will be cleared (see PreallocNonVolatileXlogBuffer). From a client view, it would look as if the postgres server freezes for a while.

Proper checkpointing would prevent such cases, but it could be hard to control. When I reproduced the Gang's case reported in this thread, such buffer full and freeze occured.


> Also, is it possible to change nvwal_size? I haven't tried, but I wonder what happens with the current contents of the file.

The value of nvwal_size should be equal to the actual size of nvwal_path file when postgres starts up. If not equal, postgres will panic at MapNonVolatileXLogBuffer (see nv_xlog_buffer.c), and the WAL contents on the file will remain as it was. So, if an admin accidentally changes the nvwal_size value, they just cannot get postgres up.

The file size may be extended/shrunk offline by truncate(1) command, but the WAL contents on the file also should be moved to the proper offset because the insertion/recovery offset is calculated by modulo, that is, record's LSN % nvwal_size; otherwise we lose WAL. An offline tool to do such an operation might be required, but is not yet.


> The way I understand the current design is that we're essentially switching from this architecture:
>
>    clients -> wal buffers (DRAM) -> wal segments (storage)
>
> to this
>
>    clients -> wal buffers (PMEM)
>
> (Assuming there we don't have to write segments because of archiving.)

Yes. Let me describe how current PostgreSQL design is and how the patchsets and works talked in this thread changes it, AFAIU:

  - Current PostgreSQL:
    clients -[memcpy]-> buffers (DRAM) -[write]-> segments (disk)

  - Patch "pmem-with-wal-buffers-master.patch" Tomas posted:
    clients -[memcpy]-> buffers (DRAM) -[pmem_memcpy]-> mmap-ed segments (PMEM)

  - My "non-volatile WAL buffer" patchset:
    clients -[pmem_memcpy(*)]-> buffers (PMEM)

  - My another patchset mmap-ing segments as buffers:
    clients -[pmem_memcpy(*)]-> mmap-ed segments as buffers (PMEM)

  - "Non-volatile Memory Logging" in PGcon 2016 [1][2][3]:
    clients -[memcpy]-> buffers (WC[4] DRAM as pseudo PMEM) -[async write]-> segments (disk)

  (* or memcpy + pmem_flush)

And I'd say that our previous work "Introducing PMDK into PostgreSQL" talked in PGCon 2018 [5] and its patchset [6 for the latest] are based on the same idea as Tomas's patch above.


That's all for this mail. Please be patient for the next mail.

Best regards,
Takashi

[1] https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
[2] https://github.com/meistervonperf/postgresql-NVM-logging
[3] https://github.com/meistervonperf/pseudo-pram
[4] https://www.kernel.org/doc/html/latest/x86/pat.html
[5] https://pgcon.org/2018/schedule/events/1154.en.html
[6] https://www.postgresql.org/message-id/CAOwnP3ONd9uXPXKoc5AAfnpCnCyOna1ru6sU=eY_4WfMjaKG9A@mail.gmail.com

--
Takashi Menjo <takashi.menjo@gmail.com>

pgsql-hackers by date:

Previous
From: Masahiro Ikeda
Date:
Subject: Re: About to add WAL write/fsync statistics to pg_stat_wal view
Next
From: Yugo NAGATA
Date:
Subject: Re: Is Recovery actually paused?