Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Andres Freund
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id 20190730080220.bvvlezte5lkxthd6@alap3.anarazel.de
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Andres Freund <andres@anarazel.de>)
Responses Re: POC: Cleaning up orphaned files using undo logs  (Amit Kapila <amit.kapila16@gmail.com>)
Re: POC: Cleaning up orphaned files using undo logs  (Robert Haas <robertmhaas@gmail.com>)
Re: POC: Cleaning up orphaned files using undo logs  (Andres Freund <andres@anarazel.de>)
Re: POC: Cleaning up orphaned files using undo logs  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
Hi,

Amit, short note: The patches aren't attached in patch order. Obviously
a miniscule thing, but still nicer if that's not the case.

Dilip, this also contains the start of a review for the undo record
interface further down.


On 2019-07-29 16:35:20 -0700, Andres Freund wrote:
> <more after some errands>

Here we go.


I'm a bit worried about expanding the use of
ReadBufferWithoutRelcache(). Not so much because of the relcache itself,
but because it requires doing separate smgropen() calls. While not
crazily expensive, it's also not free. Especially combined with closing
all such relations at transaction end (c.f. AtEOXact_SMgr).

I'm somewhat inclined to think that this requires a slightly bigger
refactoring than done in this patch. Imo at the very least the smgr
entries ought not to be unowned. But working towards not haven to
re-open the smgr entry for every single trival request ought to be part
of this too.


>  /*
> + * ForgetLocalBuffer - drop a buffer from local buffers
> + *
> + * This is similar to bufmgr.c's ForgetBuffer, except that we do not need
> + * to do any locking since this is all local.  As with that function, this
> + * must be used very carefully, since we'll cheerfully throw away dirty
> + * buffers without any attempt to write them.
> + */
> +void
> +ForgetLocalBuffer(SmgrId smgrid, RelFileNode rnode, ForkNumber forkNum,
> +                  BlockNumber blockNum)
> +{
> +    SMgrRelation smgr = smgropen(smgrid, rnode, BackendIdForTempRelations());
> +    BufferTag    tag;                    /* identity of target block */
> +    LocalBufferLookupEnt *hresult;
> +    BufferDesc *bufHdr;
> +    uint32        buf_state;
> +
> +    /*
> +     * If somehow this is the first request in the session, there's nothing to
> +     * do.  (This probably shouldn't happen, though.)
> +     */
> +    if (LocalBufHash == NULL)
> +        return;

Given that the call to ForgetLocalBuffer() currently is unconditional,
rather than checking the persistence of the undo log, I don't see why
this wouldn't happen?


> +    /* mark buffer invalid */
> +    bufHdr = GetLocalBufferDescriptor(hresult->id);
> +    CLEAR_BUFFERTAG(bufHdr->tag);
> +    buf_state = pg_atomic_read_u32(&bufHdr->state);
> +    buf_state &= ~(BM_VALID | BM_TAG_VALID | BM_DIRTY);
> +    pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);

Shouldn't this also clear out at least the usagecount? I'd probably just
use
    buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
like InvalidateBuffer() does.

I'd probably also add an assert ensuring that the refcount is zero.


> @@ -97,7 +116,6 @@ static dlist_head unowned_relns;
>  /* local function prototypes */
>  static void smgrshutdown(int code, Datum arg);
>
> -
>  /*
>   *    smgrinit(), smgrshutdown() -- Initialize or shut down storage
>   *                                  managers.

spurious change.


> +/*
> + * While md.c expects random access and has a small number of huge
> + * segments, undofile.c manages a potentially very large number of smaller
> + * segments and has a less random access pattern.  Therefore, instead of
> + * keeping a potentially huge array of vfds we'll just keep the most
> + * recently accessed N.
> + *
> + * For now, N == 1, so we just need to hold onto one 'File' handle.
> + */
> +typedef struct UndoFileState
> +{
> +    int        mru_segno;
> +    File    mru_file;
> +} UndoFileState;

IMO N==1 gotta change before this is committable. There's too many
design issues that could creep in without fixing this (e.g. not being
careful enough about closing cached file handles after certain
operations etc), that will be harder to fix later.


> +void
> +undofile_open(SMgrRelation reln)
> +{
> +    UndoFileState *state;
> +
> +    state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState));
> +    reln->private_data = state;
> +}

Hm, I don't quite like this 'private_data' design. Was that design
discussed anywhere?

Intuitively ISTM that it'd be better if SMgrRelation were embedded in a
per-SMGR type struct. Obviously that'd not quite work as things are set
up, because the size has to be constant due to SMgrRelationHash. But I
think it'd might be good anyway if that hash just stored a pointer to
the relevant SMgrRelation.


> +void
> +undofile_close(SMgrRelation reln, ForkNumber forknum)
> +{
> +}

Hm, aren't we leaking private_data right now?


> +void
> +undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
> +{
> +    /*
> +     * File creation is managed by undolog.c, but xlogutils.c likes to call
> +     * this just in case.  Ignore.
> +     */
> +}

Phew, this is not pretty.


> +bool
> +undofile_exists(SMgrRelation reln, ForkNumber forknum)
> +{
> +    elog(ERROR, "undofile_exists is not supported");
> +
> +    return false;        /* not reached */
> +}

This one I actually find bad. It seems pretty reasonable to just be able
for SMGR-kind agnostic code to be able to know whether a file exists or
not.


> +void
> +undofile_extend(SMgrRelation reln, ForkNumber forknum,
> +                BlockNumber blocknum, char *buffer,
> +                bool skipFsync)
> +{
> +    elog(ERROR, "undofile_extend is not supported");
> +}

This one I have much less problems with.


> +void
> +undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
> +              char *buffer)
> +{
> +    File        file;
> +    off_t        seekpos;
> +    int            nbytes;
> +
> +    Assert(forknum == MAIN_FORKNUM);
> +    file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
> +    seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
> +    Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);

I'd not name this seekpos, given that we're not seeking..


> +BlockNumber
> +undofile_nblocks(SMgrRelation reln, ForkNumber forknum)
> +{
> +    /*
> +     * xlogutils.c likes to call this to decide whether to read or extend; for
> +     * now we lie and say the relation is big as possible.
> +     */
> +    return UndoLogMaxSize / BLCKSZ;
> +}

That's imo not ok.




>  /*
> + *    check_for_live_undo_data()
> + *
> + *    Make sure there are no live undo records (aborted transactions that have
> + *    not been rolled back, or committed transactions whose undo data has not
> + *    yet been discarded).
> + */
> +static void
> +check_for_undo_data(ClusterInfo *cluster)
> +{
> +    PGresult   *res;
> +    PGconn       *conn = connectToServer(cluster, "template1");
> +
> +    if (GET_MAJOR_VERSION(old_cluster.major_version) < 1200)
> +        return;

Needs to be updated now.


> --- a/src/bin/pg_upgrade/exec.c
> +++ b/src/bin/pg_upgrade/exec.c
> @@ -351,6 +351,10 @@ check_data_dir(ClusterInfo *cluster)
>          check_single_dir(pg_data, "pg_clog");
>      else
>          check_single_dir(pg_data, "pg_xact");
> +
> +    /* pg_undo is new in v13 */
> +    if (GET_MAJOR_VERSION(cluster->major_version) >= 1200)
> +        check_single_dir(pg_data, "pg_undo");
>  }

The comment talks about v13, but code checks for v12?


> +++ b/src/bin/pg_upgrade/undo.c
> @@ -0,0 +1,292 @@
> +/*
> + *    undo.c
> + *
> + *    Support for upgrading undo logs.\
> + *    Copyright (c) 2019, PostgreSQL Global Development Group
> + *    src/bin/pg_upgrade/undo.c
> + */

A small design note here seems like a good idea.


> +/* Undo log statuses. */
> +typedef enum
> +{
> +    UNDO_LOG_STATUS_UNUSED = 0,
> +    UNDO_LOG_STATUS_ACTIVE,
> +    UNDO_LOG_STATUS_FULL,
> +    UNDO_LOG_STATUS_DISCARDED
> +} UndoLogStatus;

An explanation of what these mean would be good.


> +/*
> + * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence
> + * enumerator.
> + */
> +#define UndoPersistenceForRelPersistence(rp)                        \
> +    ((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT :            \
> +     (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP)
> +
> +/*
> + * Convert from UndoPersistence to a relpersistence value.
> + */
> +#define RelPersistenceForUndoPersistence(up)                \
> +    ((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT :    \
> +     (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED :        \
> +     RELPERSISTENCE_TEMP)

We shouldn't add macros with multiple evaluation hazards without need.



> +/*
> + * Properties of an undo log that don't have explicit WAL records logging
> + * their changes, to reduce WAL volume.  Instead, they change incrementally
> + * whenever data is inserted as a result of other WAL records.  Since the
> + * values recorded in an online checkpoint may be out of the sync (ie not the
> + * correct values as at the redo LSN), these are backed up in buffer data on
> + * first change after each checkpoint.
> + */

s/on first/on the first/?


> +/*
> + * Instantiate fast inline hash table access functions.  We use an identity
> + * hash function for speed, since we already have integers and don't expect
> + * many collisions.
> + */
> +#define SH_PREFIX undologtable
> +#define SH_ELEMENT_TYPE UndoLogTableEntry
> +#define SH_KEY_TYPE UndoLogNumber
> +#define SH_KEY number
> +#define SH_HASH_KEY(tb, key) (key)
> +#define SH_EQUAL(tb, a, b) ((a) == (b))
> +#define SH_SCOPE static inline
> +#define SH_DECLARE
> +#define SH_DEFINE
> +#include "lib/simplehash.h"
> +
> +extern PGDLLIMPORT undologtable_hash *undologtable_cache;

Why isn't this defined in a .c file? I've a bit of a hard time believing
that making UndoLogGetTableEntry() an extern function would be a
meaningful overhead compared to the operations this is used for. Not
exposing those details seems nicer to me.



> +/* Create a new undo log. */
> +typedef struct xl_undolog_create
> +{
> +    UndoLogNumber logno;
> +    Oid        tablespace;
> +    UndoPersistence persistence;
> +} xl_undolog_create;
> +
> +/* Extend an undo log by adding a new segment. */
> +typedef struct xl_undolog_extend
> +{
> +    UndoLogNumber logno;
> +    UndoLogOffset end;
> +} xl_undolog_extend;
> +
> +/* Discard space, and possibly destroy or recycle undo log segments. */
> +typedef struct xl_undolog_discard
> +{
> +    UndoLogNumber logno;
> +    UndoLogOffset discard;
> +    UndoLogOffset end;
> +    TransactionId latestxid;    /* latest xid whose undolog are discarded. */
> +    bool          entirely_discarded;
> +} xl_undolog_discard;
> +
> +/* Switch undo log. */
> +typedef struct xl_undolog_switch
> +{
> +    UndoLogNumber logno;
> +    UndoRecPtr prevlog_xact_start;
> +    UndoRecPtr prevlog_last_urp;
> +} xl_undolog_switch;

I'd add flags to these. Perhaps I'm overly cautious, but I found that
extremely valuable when having to fix bugs in already released
versions. And these aren't so frequent that that'd hurt.  Obviously
entirely_discarded would then be a flag.



> +extern void undofile_init(void);
> +extern void undofile_shutdown(void);
> +extern void undofile_open(SMgrRelation reln);
> +extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
> +extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
> +                            bool isRedo);
> +extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
> +extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
> +                            bool isRedo);
> +extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
> +         BlockNumber blocknum, char *buffer, bool skipFsync);
> +extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
> +           BlockNumber blocknum);
> +extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
> +                          BlockNumber blocknum, char *buffer);
> +extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
> +        BlockNumber blocknum, char *buffer, bool skipFsync);
> +extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
> +            BlockNumber blocknum, BlockNumber nblocks);
> +extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
> +extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
> +           BlockNumber nblocks);
> +extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
> +
> +/* Callbacks used by sync.c. */
> +extern int undofile_syncfiletag(const FileTag *tag, char *path);
> +extern bool undofile_filetagmatches(const FileTag *tag, const FileTag *candidate);
> +
> +/* Management of checkpointer requests. */
> +extern void undofile_request_sync(UndoLogNumber logno, BlockNumber segno,
> +                                  Oid tablespace);
> +extern void undofile_forget_sync(UndoLogNumber logno, BlockNumber segno,
> +                                 Oid tablespace);
> +extern void undofile_forget_sync_tablespace(Oid tablespace);
> +extern void undofile_request_sync_dir(Oid tablespace);

Istm that it'd be better to only have mdopen/... referenced from
smgrsw() and then have f_smgr be included (as a const f_smgr* const) as
part of SMgrRelation.  For one, that'll allow us to hide a lot more of
this into md.c/undofile.c. It's also a pathway to being more extensible.
I think performance ought to be at least as good (as we currently have
to read SMgrRelation->smgr_which, then read the callback from smgrsw
(which is probably combined into one op in many platforms), and then
call it).  Could then also just make the simple smgr* functions static
inline wrappers, as that doesn't require exposing f_smgr anymore.



> From 880f25a543783f8dc3784a51ab1c29b72f6b5b27 Mon Sep 17 00:00:00 2001
> From: Dilip Kumar <dilip.kumar@enterprisedb.com>
> Date: Fri, 7 Jun 2019 15:03:37 +0530
> Subject: [PATCH 06/14] Defect and enhancement in multi-log support

That's imo not a good thing to have in patch series intended to be
reviewed, especially relatively early in the series. At least the commit
message ought to include an explanation.


> Subject: [PATCH 07/14] Provide interfaces to store and fetch undo records.
>
> Add the capability to form undo records and store them in undo logs.  We
> also provide the capability to fetch the undo records.  This layer will use
> undo-log-storage to reserve the space for the undo records and buffer
> management routines to write and read the undo records.
>

> Undo records are stored in sequential order in the undo log.

Maybe "In each und log undo records are stored in sequential order."?



> +++ b/src/backend/access/undo/README.undointerface
> @@ -0,0 +1,29 @@
> +Undo record interface layer
> +---------------------------
> +This is the next layer which sits on top of the undo log storage, which will
> +provide an interface for prepare, insert, or fetch the undo records.  This
> +layer will use undo-log-storage to reserve the space for the undo records
> +and buffer management routine to write and read the undo records.

The reference to "undo log storage" kinda seems like a reference into
nothingness...


> +Writing an undo record
> +----------------------
> +To prepare an undo record, first, it will allocate required space using
> +undo log storage module.  Next, it will pin and lock the required buffers and
> +return an undo record pointer where it will insert the record.  Finally, it
> +calls the Insert routine for final insertion of prepared record.  Additionally,
> +there is a mechanism for multi-insert, wherein multiple records are prepared
> +and inserted at a time.

I'm not sure whta this is telling me. Who is "it"?

To me the filename ("interface"), and the title of this section,
suggests this provides documentation on how to write code to insert undo
records. But I don't think this does.


> +Fetching and undo record
> +------------------------
> +To fetch an undo record, a caller must provide a valid undo record pointer.
> +Optionally, the caller can provide a callback function with the information of
> +the block and offset, which will help in faster retrieval of undo record,
> +otherwise, it has to traverse the undo-chain.

> +There is also an interface to bulk fetch the undo records.  Where the caller
> +can provide a TO and FROM undo record pointer and the memory limit for storing
> +the undo records.  This API will return all the undo record between FROM and TO
> +undo record pointers if they can fit into provided memory limit otherwise, it
> +return whatever can fit into the memory limit.  And, the caller can call it
> +repeatedly until it fetches all the records.

There's a lot of  terminology in this file that's not been introduced. I
think this needs to be greatly expanded and restructured to allow people
unfamiliar with the code to benefit.


> +/*-------------------------------------------------------------------------
> + *
> + * undoaccess.c
> + *      entry points for inserting/fetching undo records

> + * NOTES:
> + * Undo record layout:
> + *
> + * Undo records are stored in sequential order in the undo log.  Each undo
> + * record consists of a variable length header, tuple data, and payload
> + * information.

Is that actually true? There's records without tuples, no?

> The first undo record of each transaction contains a
> + * transaction header that points to the next transaction's start
> header.

Seems like this needs to reference different persistence levels,
otherwise it seems misleading, given there can be multiple first records
in multiple undo logs?


> + * This allows us to discard the entire transaction's log at one-shot
> rather

s/at/in/

> + * than record-by-record.  The callers are not aware of transaction header,

s/of/of the/

> + * this is entirely maintained and used by undo record layer.   See

s/this/it/

> + * undorecord.h for detailed information about undo record header.

s/undo record/the undo record/


I think at the very least there's explanations missing for:
- what is the locking protocol for multiple buffers
- what are the contexts for insertion
- what phases an undo insertion happens in
- updating previous records in general
- what "packing" actually is


> +
> +/* Prototypes for static functions. */


Don't think we commonly include that...

> +static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
> +                 UndoRecPtr urp, RelFileNode rnode,
> +                 UndoPersistence persistence,
> +                 Buffer *prevbuf);
> +static int UndoRecordPrepareTransInfo(UndoRecordInsertContext *context,
> +                           UndoRecPtr xact_urp, int size, int offset);
> +static void UndoRecordUpdateTransInfo(UndoRecordInsertContext *context,
> +                          int idx);
> +static void UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context,
> +                                    UndoRecPtr urecptr, UndoRecPtr xact_urp);
> +static int UndoGetBufferSlot(UndoRecordInsertContext *context,
> +                  RelFileNode rnode, BlockNumber blk,
> +                  ReadBufferMode rbm);
> +static uint16 UndoGetPrevRecordLen(UndoRecPtr urp, Buffer input_buffer,
> +                     UndoPersistence upersistence);
> +
> +/*
> + * Structure to hold the prepared undo information.
> + */
> +struct PreparedUndoSpace
> +{
> +    UndoRecPtr    urp;            /* undo record pointer */
> +    UnpackedUndoRecord *urec;    /* undo record */
> +    uint16        size;            /* undo record size */
> +    int            undo_buffer_idx[MAX_BUFFER_PER_UNDO];    /* undo_buffer array
> +                                                         * index */
> +};
> +
> +/*
> + * This holds undo buffers information required for PreparedUndoSpace during
> + * prepare undo time.  Basically, during prepare time which is called outside
> + * the critical section we will acquire all necessary undo buffers pin and lock.
> + * Later, during insert phase we will write actual records into thse buffers.
> + */
> +struct PreparedUndoBuffer
> +{
> +    UndoLogNumber logno;        /* Undo log number */
> +    BlockNumber blk;            /* block number */
> +    Buffer        buf;            /* buffer allocated for the block */
> +    bool        zero;            /* new block full of zeroes */
> +};

Most files define datatypes before function prototypes, because
functions may reference the datatypes.


> +/*
> + * Prepare to update the transaction header
> + *
> + * It's a helper function for PrepareUpdateNext and
> + * PrepareUpdateUndoActionProgress

This doesn't really explain much.  PrepareUpdateUndoActionProgress
doesnt' exist. I assume it's UndoRecordPrepareApplyProgress from 0012?


> + * xact_urp  - undo record pointer to be updated.
> + * size - number of bytes to be updated.
> + * offset - offset in undo record where to start update.
> + */

These comments seem redundant with the parameter names.


> +static int
> +UndoRecordPrepareTransInfo(UndoRecordInsertContext *context,
> +                           UndoRecPtr xact_urp, int size, int offset)
> +{
> +    BlockNumber cur_blk;
> +    RelFileNode rnode;
> +    int            starting_byte;
> +    int            bufidx;
> +    int            index = 0;
> +    int            remaining_bytes;
> +    XactUndoRecordInfo *xact_info;
> +
> +    xact_info = &context->xact_urec_info[context->nxact_urec_info];
> +
> +    UndoRecPtrAssignRelFileNode(rnode, xact_urp);
> +    cur_blk = UndoRecPtrGetBlockNum(xact_urp);
> +    starting_byte = UndoRecPtrGetPageOffset(xact_urp);
> +
> +    /* Remaining bytes on the current block. */
> +    remaining_bytes = BLCKSZ - starting_byte;
> +
> +    /*
> +     * Is there some byte of the urec_next on the current block, if not then
> +     * start from the next block.
> +     */

This comment needs rephrasing.


> +    /* Loop until we have fetched all the buffers in which we need to write. */
> +    while (size > 0)
> +    {
> +        bufidx = UndoGetBufferSlot(context, rnode, cur_blk, RBM_NORMAL);
> +        xact_info->idx_undo_buffers[index++] = bufidx;
> +        size -= (BLCKSZ - starting_byte);
> +        starting_byte = UndoLogBlockHeaderSize;
> +        cur_blk++;
> +    }

So, this locks a very large number of undo buffers at the same time, do
I see that correctly?  What guarantees that there are no deadlocks due
to multiple buffers locked at the same time (I guess the order inside
the log)? What guarantees that this is a small enough number that we can
even lock all of them at the same time?

Why do we need to lock all of them at the same time? That's not clear to
me.

Also, why do we need code to lock an unbounded number here? It seems
hard to imagine we'd ever want to update more than something around 8
bytes? Shouldn't that at the most require two buffers?


> +/*
> + * Prepare to update the previous transaction's next undo pointer.
> + *
> + * We want to update the next transaction pointer in the previous transaction's
> + * header (first undo record of the transaction).  In prepare phase we will
> + * unpack that record and lock the necessary buffers which we are going to
> + * overwrite and store the unpacked undo record in the context.  Later,
> + * UndoRecordUpdateTransInfo will overwrite the undo record.
> + *
> + * xact_urp - undo record pointer of the previous transaction's header
> + * urecptr - current transaction's undo record pointer which need to be set in
> + *             the previous transaction's header.
> + */
> +static void
> +UndoRecordPrepareUpdateNext(UndoRecordInsertContext *context,
> +                            UndoRecPtr urecptr, UndoRecPtr xact_urp)

That name imo is confusing - it's not clear that it's not actually about
the next record or such.


> +{
> +    UndoLogSlot *slot;
> +    int            index = 0;
> +    int            offset;
> +
> +    /*
> +     * The absence of previous transaction's undo indicate that this backend

*indicates


> +    /*
> +     * Acquire the discard lock before reading the undo record so that discard
> +     * worker doesn't remove the record while we are in process of reading it.
> +     */

*the discard worker


> +    LWLockAcquire(&slot->discard_update_lock, LW_SHARED);
> +    /* Check if it is already discarded. */
> +    if (UndoLogIsDiscarded(xact_urp))
> +    {
> +        /* Release lock and return. */
> +        LWLockRelease(&slot->discard_update_lock);
> +        return;
> +    }

Ho, hum. I don't quite remember what we decided in the discussion about
not having to use the discard lock for this purpose.


> +    /* Compute the offset of the uur_next in the undo record. */
> +    offset = SizeOfUndoRecordHeader +
> +                    offsetof(UndoRecordTransaction, urec_next);
> +
> +    index = UndoRecordPrepareTransInfo(context, xact_urp,
> +                                       sizeof(UndoRecPtr), offset);
> +    /*
> +     * Set the next pointer in xact_urec_info, this will be overwritten in
> +     * actual undo record during update phase.
> +     */
> +    context->xact_urec_info[index].next = urecptr;

What does "this will be overwritten mean"? It sounds like "context->xact_urec_info[index].next"
would be overwritten, but that can't be true.


> +    /* We can now release the discard lock as we have read the undo record. */
> +    LWLockRelease(&slot->discard_update_lock);
> +}

Hm. Because you expect it to be blocked behind the content lwlocks for
the buffers?


> +/*
> + * Overwrite the first undo record of the previous transaction to update its
> + * next pointer.
> + *
> + * This will insert the already prepared record by UndoRecordPrepareTransInfo.

It doesn't actually appear to insert any records. At least not a record
in the way the rest of the file uses that term?


> + * This must be called under the critical section.

s/under the/in a/

Think that should be asserted.


> +    /*
> +     * Start writing directly from the write offset calculated during prepare
> +     * phase.  And, loop until we write required bytes.
> +     */

Why do we do offset calculations multiple times? Seems like all the
offsets, and the split, should be computed in exactly one place.


> +/*
> + * Find the block number in undo buffer array
> + *
> + * If it is present then just return its index otherwise search the buffer and
> + * insert an entry and lock the buffer in exclusive mode.
> + *
> + * Undo log insertions are append-only.  If the caller is writing new data
> + * that begins exactly at the beginning of a page, then there cannot be any
> + * useful data after that point.  In that case RBM_ZERO can be passed in as
> + * rbm so that we can skip a useless read of a disk block.  In all other
> + * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
> + * happen to be already in the buffer pool.
> + */
> +static int
> +UndoGetBufferSlot(UndoRecordInsertContext *context,
> +                  RelFileNode rnode,
> +                  BlockNumber blk,
> +                  ReadBufferMode rbm)
> +{
> +    int            i;
> +    Buffer        buffer;
> +    XLogRedoAction action = BLK_NEEDS_REDO;
> +    PreparedUndoBuffer *prepared_buffer;
> +    UndoPersistence persistence = context->alloc_context.persistence;
> +
> +    /* Don't do anything, if we already have a buffer pinned for the block. */

As the code stands, it's locked, not just pinned.


> +    for (i = 0; i < context->nprepared_undo_buffer; i++)
> +    {

How large do we expect this to get at most?


> +    /*
> +     * We did not find the block so allocate the buffer and insert into the
> +     * undo buffer array.
> +     */
> +    if (InRecovery)
> +        action = XLogReadBufferForRedoBlock(context->alloc_context.xlog_record,
> +                                            SMGR_UNDO,
> +                                            rnode,
> +                                            UndoLogForkNum,
> +                                            blk,
> +                                            rbm,
> +                                            false,
> +                                            &buffer);

Why is not locking the buffer correct here? Can't there be concurrent
reads during hot standby?


> +/*
> + * This function must be called before all the undo records which are going to
> + * get inserted under a single WAL record.

How can a function be called "before all the undo records"?


> + * nprepared - This defines the max number of undo records that can be
> + * prepared before inserting them.
> + */
> +void
> +BeginUndoRecordInsert(UndoRecordInsertContext *context,
> +                      UndoPersistence persistence,
> +                      int nprepared,
> +                      XLogReaderState *xlog_record)

There definitely needs to be explanation about xlog_record. But also
about memory management etc. Looks like one e.g. can't call this from a
short lived memory context.


> +/*
> + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
> + * intended to insert.  Upon return, the necessary undo buffers are pinned and
> + * locked.

Again, how is deadlocking / max number of buffers handled, and why do
they all need to be locked at the same time?


> +    /*
> +     * We don't yet know if this record needs a transaction header (ie is the
> +     * first undo record for a given transaction in a given undo log), because
> +     * you can only find out by allocating.  We'll resolve this circularity by
> +     * allocating enough space for a transaction header.  We'll only advance
> +     * by as many bytes as we turn out to need.
> +     */

Why can we only find this out by allocating?  This seems like an API
deficiency of the storage layer to me. The information is in the und log
slot's metadata, no?


> +    urec->uur_next = InvalidUndoRecPtr;
> +    UndoRecordSetInfo(urec);
> +    urec->uur_info |= UREC_INFO_TRANSACTION;
> +    urec->uur_info |= UREC_INFO_LOGSWITCH;
> +    size = UndoRecordExpectedSize(urec);
> +
> +    /* Allocate space for the record. */
> +    if (InRecovery)
> +    {
> +        /*
> +         * We'll figure out where the space needs to be allocated by
> +         * inspecting the xlog_record.
> +         */
> +        Assert(context->alloc_context.persistence == UNDO_PERMANENT);
> +        urecptr = UndoLogAllocateInRecovery(&context->alloc_context,
> +                                            XidFromFullTransactionId(txid),
> +                                            size,
> +                                            &need_xact_header,
> +                                            &last_xact_start,
> +                                            &prevlog_xact_start,
> +                                            &prevlogurp);
> +    }
> +    else
> +    {
> +        /* Allocate space for writing the undo record. */

That's basically the same comment as before the if.


> +        urecptr = UndoLogAllocate(&context->alloc_context,
> +                                  size,
> +                                  &need_xact_header, &last_xact_start,
> +                                  &prevlog_xact_start, &prevlog_insert_urp);
> +
> +        /*
> +         * If prevlog_xact_start is a valid undo record pointer that means
> +         * this transaction's undo records are split across undo logs.
> +         */
> +        if (UndoRecPtrIsValid(prevlog_xact_start))
> +        {
> +            uint16        prevlen;
> +
> +            /*
> +             * If undo log is switch during transaction then we must get a

"is switch" is right.

> +/*
> + * Insert a previously-prepared undo records.

s/a//


More tomorrow.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: [HACKERS] advanced partition matching algorithm forpartition-wise join
Next
From: Michael Paquier
Date:
Subject: Re: Possible race condition in pg_mkdir_p()?