Re: Unlogged relations and WAL-logging - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Unlogged relations and WAL-logging |
Date | |
Msg-id | 0fc6970f-b24d-1f3f-0ebc-2b9a0b14a497@iki.fi Whole thread Raw |
In response to | Re: Unlogged relations and WAL-logging (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Unlogged relations and WAL-logging
|
List | pgsql-hackers |
On 07/07/2023 18:21, Heikki Linnakangas wrote: > On 28/01/2022 15:57, Robert Haas wrote: >>> 4. Also, the smgrwrite() calls are performed before WAL-logging the >>> pages, so the page that's written to disk has 0/0 as the LSN, not the >>> LSN of the WAL record. That's harmless too, but seems a bit sloppy. >> >> That is also a mistake on my part. > > I'm still sitting on these fixes. I think the patch I posted still makes > sense, but I got carried away with a more invasive approach that > introduces a whole new set of functions for bulk-creating a relation, > which would handle WAL-logging, smgrimmedsync() and all that (see > below). We have some repetitive, error-prone code in all the index build > functions for that. But that's not backpatchable, so I'll rebase the > original approach next week. Committed this fix to master and v16. Didn't seem worth backpatching further than that, given that there is no live user-visible issue here. >>> 5. In heapam_relation_set_new_filenode(), we do this: >>> >>>> >>>> /* >>>> * If required, set up an init fork for an unlogged table so that it can >>>> * be correctly reinitialized on restart. An immediate sync is required >>>> * even if the page has been logged, because the write did not go through >>>> * shared_buffers and therefore a concurrent checkpoint may have moved the >>>> * redo pointer past our xlog record. Recovery may as well remove it >>>> * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE >>>> * record. Therefore, logging is necessary even if wal_level=minimal. >>>> */ >>>> if (persistence == RELPERSISTENCE_UNLOGGED) >>>> { >>>> Assert(rel->rd_rel->relkind == RELKIND_RELATION || >>>> rel->rd_rel->relkind == RELKIND_MATVIEW || >>>> rel->rd_rel->relkind == RELKIND_TOASTVALUE); >>>> smgrcreate(srel, INIT_FORKNUM, false); >>>> log_smgrcreate(newrnode, INIT_FORKNUM); >>>> smgrimmedsync(srel, INIT_FORKNUM); >>>> } >>> >>> The comment doesn't make much sense, we haven't written nor WAL-logged >>> any page here, with nor without the buffer cache. It made more sense >>> before commit fa0f466d53. >> >> Well, it seems to me (and perhaps I am just confused) that complaining >> that there's no page written here might be a technicality. The point >> is that there's no synchronization between the work we're doing here >> -- which is creating a fork, not writing a page -- and any concurrent >> checkpoint. So we both need to log it, and also sync it immediately. > > I see. I pushed the fix from the other thread that makes smgrcreate() > call register_dirty_segment (commit 4b4798e13). I believe that makes > this smgrimmedsync() unnecessary. If a concurrent checkpoint happens > with a redo pointer greater than this WAL record, it must've received > the fsync request created by smgrcreate(). That depends on the fact that > we write the WAL record *after* smgrcreate(). Subtle.. > > Hmm, we have a similar smgrimmedsync() call after index build, because > we have written pages directly with smgrextend(skipFsync=true). If no > checkpoints have occurred during the index build, we could call > register_dirty_segment() instead of smgrimmedsync(). That would avoid > the fsync() latency when creating an index on an empty or small index. > > This is all very subtle to get right though. That's why I'd like to > invent a new bulk-creation facility that would handle this stuff, and > make the callers less error-prone. Having a more generic and less error-prone bulk-creation mechanism is still on my long TODO list.. -- Heikki Linnakangas Neon (https://neon.tech)
pgsql-hackers by date: