Re: Unlogged relations and WAL-logging - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Unlogged relations and WAL-logging
Date
Msg-id 0fc6970f-b24d-1f3f-0ebc-2b9a0b14a497@iki.fi
Whole thread Raw
In response to Re: Unlogged relations and WAL-logging  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Unlogged relations and WAL-logging
List pgsql-hackers
On 07/07/2023 18:21, Heikki Linnakangas wrote:
> On 28/01/2022 15:57, Robert Haas wrote:
>>> 4. Also, the smgrwrite() calls are performed before WAL-logging the
>>> pages, so the page that's written to disk has 0/0 as the LSN, not the
>>> LSN of the WAL record. That's harmless too, but seems a bit sloppy.
>>
>> That is also a mistake on my part.
> 
> I'm still sitting on these fixes. I think the patch I posted still makes
> sense, but I got carried away with a more invasive approach that
> introduces a whole new set of functions for bulk-creating a relation,
> which would handle WAL-logging, smgrimmedsync() and all that (see
> below). We have some repetitive, error-prone code in all the index build
> functions for that. But that's not backpatchable, so I'll rebase the
> original approach next week.

Committed this fix to master and v16. Didn't seem worth backpatching 
further than that, given that there is no live user-visible issue here.

>>> 5. In heapam_relation_set_new_filenode(), we do this:
>>>
>>>>
>>>>         /*
>>>>          * If required, set up an init fork for an unlogged table so that it can
>>>>          * be correctly reinitialized on restart.  An immediate sync is required
>>>>          * even if the page has been logged, because the write did not go through
>>>>          * shared_buffers and therefore a concurrent checkpoint may have moved the
>>>>          * redo pointer past our xlog record.  Recovery may as well remove it
>>>>          * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
>>>>          * record. Therefore, logging is necessary even if wal_level=minimal.
>>>>          */
>>>>         if (persistence == RELPERSISTENCE_UNLOGGED)
>>>>         {
>>>>                 Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
>>>>                            rel->rd_rel->relkind == RELKIND_MATVIEW ||
>>>>                            rel->rd_rel->relkind == RELKIND_TOASTVALUE);
>>>>                 smgrcreate(srel, INIT_FORKNUM, false);
>>>>                 log_smgrcreate(newrnode, INIT_FORKNUM);
>>>>                 smgrimmedsync(srel, INIT_FORKNUM);
>>>>         }
>>>
>>> The comment doesn't make much sense, we haven't written nor WAL-logged
>>> any page here, with nor without the buffer cache. It made more sense
>>> before commit fa0f466d53.
>>
>> Well, it seems to me (and perhaps I am just confused) that complaining
>> that there's no page written here might be a technicality. The point
>> is that there's no synchronization between the work we're doing here
>> -- which is creating a fork, not writing a page -- and any concurrent
>> checkpoint. So we both need to log it, and also sync it immediately.
> 
> I see. I pushed the fix from the other thread that makes smgrcreate()
> call register_dirty_segment (commit 4b4798e13). I believe that makes
> this smgrimmedsync() unnecessary. If a concurrent checkpoint happens
> with a redo pointer greater than this WAL record, it must've received
> the fsync request created by smgrcreate(). That depends on the fact that
> we write the WAL record *after* smgrcreate(). Subtle..
> 
> Hmm, we have a similar smgrimmedsync() call after index build, because
> we have written pages directly with smgrextend(skipFsync=true). If no
> checkpoints have occurred during the index build, we could call
> register_dirty_segment() instead of smgrimmedsync(). That would avoid
> the fsync() latency when creating an index on an empty or small index.
> 
> This is all very subtle to get right though. That's why I'd like to
> invent a new bulk-creation facility that would handle this stuff, and
> make the callers less error-prone.

Having a more generic and less error-prone bulk-creation mechanism is 
still on my long TODO list..

-- 
Heikki Linnakangas
Neon (https://neon.tech)




pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: Schema variables - new implementation for Postgres 15
Next
From: Alvaro Herrera
Date:
Subject: using SysCacheGetAttrNotNull in place of heap_getattr