It was surely already discussed but why isn't postresql writing sequentially its cache in a temporary file?
If you do that, reads of the data will have to traverse that temporary file to assemble their data. You'll make every later reader pay the random I/O penalty that's being avoided right now. Checkpoints are already postponing these random writes as long as possible. You have to take care of them eventually though.
No the log file is only used at recovery time.
in check point code: - loop over cache, marks dirty buffers with BM_CHECKPOINT_NEEDED as in current code
- other workers can't write and evicted these marked buffers to disk, there's a race with fsync.
- check point fsync now or after the next step. - check point loop again save to log these buffers, clear BM_CHECKPOINT_NEEDED but *doesn't* clear BM_DIRTY, of course many buffers will be written again, as they are when check point isn't running.
- check point done.
During recovery you have to load the log in cache first before applying WAL.