Home > mailing lists
Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: POC: Cleaning up orphaned files using undo logs
Date	August 14, 2019 09:57:45
Msg-id	20190814065745.2faw3hirvfhbrdwe@alap3.anarazel.de Whole thread Raw
In response to	Re: POC: Cleaning up orphaned files using undo logs (Andres Freund <andres@anarazel.de>)
Responses	Re: POC: Cleaning up orphaned files using undo logs Re: POC: Cleaning up orphaned files using undo logs
List	pgsql-hackers
Tree view
Hi,

On 2019-08-06 14:18:42 -0700, Andres Freund wrote:
> Here's the last section of my low-leve review. Plan to write a higher
> level summary afterwards, now that I have a better picture of the code.

General comments:

- For a feature of this complexity, there's very little architectural
  documentation. Some individual parts have a bit, but there's basically
  nothing over-arching. That makes it extremely hard for anybody that is
  not already involved to understand the design constraints, and even
  for people involved it's hard to understand.

  I think it's very important for this to have a document that first
  explains what the goals, and non-goals, of this feature are. And then
  secondly explains the chosen architecture referencing those
  constraints.  Once that's there it's a lot easier to review this
  patchset, to discuss the overall architecture, etc.


- There are too many different email threads and git branches. The same
  patches are discussed in different threads, layers exist in slightly
  diverging versions in different git trees. Again making it very hard
  for anybody not primarily focussing on undo to join the discussion.

  I think most of the older git branches should be renamed into
  something indicating their historic status. The remaining branches
  should be referenced from a wiki page (linked to in each submission of
  a new patch version), explaining what they're about.  I don't think
  it's realistic to have people infer meaning from the current branch
  names (undo, proposal-undo-log, undo-log-storage, undo-log-storage-v2,
  undo_interface_v1, undoprocessing).

  Given the size of the the overall project it's quite possibly not
  realistic to manage the whole work in a single git branch. With
  separate git branches, as currently done, it's however hard to
  understand which version of a what layer is used.  I think at the very
  least higher layers need to indicate the version of the underlying
  layers is being used.  I suggest just adding a "v01: " style prefix to
  all commit subjects a branch is rebased onto.

  It's also currently hard to understand what version of a layer is
  being discussed. I think submissions all need to include a version
  number (c.f. git format-patch's -v option), and that version ought to
  be included in the subject line. Each new major version of a patch
  should be started as a reply to the first message of a thread, to
  keep the structure of a discussion in a managable shape. New versions
  should include explicit explanations about the major changes compared
  to the last version.


- The code quality of pretty significant parts of the patchset is not
  even close to being good enough. There are areas with a lot of code
  duplication. There are very few higher level explanations for
  interfaces. There's a lot of "i++; /* increment i to increment it */"
  style comments, but not enough higher level comments. There are
  significant issues with parts of the code that aren't noted anywhere
  in comments, leading to reviewers having to repeatedly re-discover
  them (and wasting time on that).

  There's different naming styles in related code without a discernible
  pattern (e.g. UndoRecordSetInfo being followed by
  get_undo_rec_cid_offset). The word-ordering of different layers is
  confusing (e.g. BeginUndoRecordInsert vs UndoLogBeginInsert vs
  PrepareUndoInsert). Different important externally visible functions
  have names that don't allow to determine which is supposed to do what
  (PrepareUndoInsert vs BeginUndoRecordInsert).


More specific comments:

- The whole sequencing of undo record insertion in combination with WAL
  logging does not appear to be right. It's a bit hard to say, because
  there's very little documentation on what the intended default
  sequence of operations is.

  My understanding is that the currently expected pattern is to:

  1) Collect information / perform work needed to perform the action
     that needs to be UNDO logged. E.g. perform visibility
     determinations, wait for lockers, compute new infomask, etc.

     This will likely end with the "content" page(s) (e.g. a table's
     page) being exclusively locked.

  2) Estimate space needed for all UNDO logging (BeginUndoRecordInsert)

  3) Prepare for each undo record, this includes building the
     content for each undo record. PrepareUndoInsert(). This acquires,
     pins and locks buffers for undo.

  4) begin a critical section

  5) modify content page, mark buffer dirty

  6) write UNDO, using InsertPreparedUndo()

  7) associate undo with WAL record (RegisterUndoLogBuffers)

  8) write WAL

  9) End critical section

  But despite reading through the code, including READMEs, I'm doubtful
  that's quite the intended pattern. It REALLY can't be right that one
  needs to parse many function headers to figure out how the most basic
  use of undo could possibly work.  There needs to be very clear
  documentation about how to write undo records. Functions that sound
  like they're involved, need actually useful comments, rather than just
  restatements of their function names (cf RegisterUndoLogBuffers,
  UndoLogBuffersSetLSN, UndoLogRegister).


- I think there's two fairly fundamental, and related, problems with
  the sequence outlined above:

  - We can't search for target buffers to store undo data, while holding
    the "table" page content locked. That can involve writing out
    multiple pages till we find a usable victim buffer. That can take a
    pretty long time. While that's happening the content page would
    currently be locked. Note how e.g. heapam.c is careful to not hold
    *any* content locks while potentially performing IO.  I think the
    current interface makes that hard.

    The easy way to solve would be to require sites performing UNDO
    logging to acquire victim pages before even acquiring other content
    locks. Perhaps the better approach could be for the undo layer to
    hold onto a number of clean buffers, and to keep the last page in an
    already written to undo log pinned.

  - We can't search for victim buffers for further undo data while
    already holding other undo pages content locked. Doing so means that
    while we're e.g. doing IO to clean out the new page, old undo data
    on the previous page can't be read.

    This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
    pinning and locking buffers, it'd need to be split into two
    operations. One that acquires buffers and pins them, and one that
    locks them.  I think it's quite possible that the locking operation
    could just be delayed until InsertPreparedUndo().  But if we solve
    the above problem, most of this might already be solved.


- To me the current split between the packed and unpacked UNDO record
  formats makes very little sense, the reasoning behind having them is
  poorly if at all documented, results in extremely verbose code, and
  isn't extensible.

  When preparing to insert an undo record the in-buffer size is computed
  with UndoRecordHeaderSize() (needs to know about all optional data)
  from within PrepareUndoInsert() (which has a bunch a bunch of
  additional knowledge about the record format). Then during insertion
  InsertPreparedUndo(), first copies the UnpackedUndoRecord into an
  UndoPackContext (again needing ...), and then, via InsertUndoData(),
  copies that in small increments into the respective buffers (again
  needing knowledge about the complete record format, two copies
  even). Beside the code duplication, that also means the memory copies
  are very inefficient, because they're all done in tiny increments,
  multiple times.

  When reading undo it's smilar: UnpackUndoData(), again in small
  chunks, reads the buffer data into an UndoPackContext (another full
  copy of the unpacked record format). But then FinishUnpackUndo()
  *again* copies all that data, into an actual UnpackedUndoRecord
  (again, with a copy of the record format, albeit slightly different
  looking).

  I'm not convinced by Heikki's argument that we shouldn't have
  structure within undo records. In my opinion that is a significant
  weakness of how WAL was initially designed, and even after Heikki's
  work, still is a problem.  But this isn't the right design either.

  Even if were to stay with the current fixed record format, I think
  the current code needs a substantial redesign:

  - I think 'packing' during insertion needs to serialize into a char*
    allocation during PrepareUndoInsert computing the size in parallel
    (or perhaps in InsertPreparedUndo, but probably not).  The size of
    the record should never be split across record boundaries
    (i.e. we'll leave that space unused if we otherwise would need to
    split the size).  The actual insertion would be a maximally sized
    memcpy() (so we'd as many memcpys as the buffer fits in, rather than
    one for each sub-type of a record).

    That allows to remove most of the duplicated knowledge of the record
    format, and makes insertions faster (by doing only large memcpys
    while holding exclusive content locks).

  - When reading an undo record, the whole stage of UnpackUndoData()
    reading data into a the UndoPackContext is omitted, reading directly
    into the UnpackedUndoRecord. That removes one further copy of the
    record format.

  - To avoid having separate copies of the record format logic, I'd
    probably encode it into *one* array of metadata. If we had
    {offsetoff(UnpackedUndoRecord, member),
     membersize(UnpackedUndoRecord, member),
     flag}
    we could fairly trivially remove most knowledge from the places
    currently knowing about the record format.


  I have some vague ideas for how to specify the format in a way that is
  more extensible, but with more structure than just a blob of data. But
  I don't think they're there yet.


- The interface to read undo also doesn't seem right to me. For one
  there's various different ways to read records, with associated code
  duplication (batch, batch in "one page" mode - but that's being
  removed now I think, single record mode).

  I think the batch mode is too restrictive. We might not need this
  during the first merged version, but I think before long we're going
  to want to be able to efficiently traverse all the undo chains we need
  to determine the visibility of all tuples on a page. Otherwise we'll
  cause a lot of additional synchronous read IO, and will repeatedly
  re-fetch information, especially during sequential scans for an older
  snapshot.  I think I briefly outlined this in an earlier email - my
  current though is that the batch interface (which the non-batch
  interface should just be a tiny helper around), should basically be a
  queue of "to-be-fetched" undo records. When batching reading an entire
  transaction, all blocks get put onto that queue. When traversing
  multiple chains, the chains are processed in a breadth-first fashion
  (by just looking at the queue, and pushing additional work to the
  end). That allows to efficiently issue prefetch requests for blocks to
  be read in the near future.

  I think that batch reading should just copy the underlying data into a
  char* buffer. Only the records that currently are being used by
  higher layers should get exploded into an unpacked record. That will
  reduce memory usage quite noticably (and I suspect it also drastically
  reduce the overhead due to a large context with a lot of small
  allocations that then get individually freed).  That will make the
  sorting of undo a bit more CPU inefficient, because individual records
  will need to be partially unpacked for comparison, but I think that's
  going to be a far smaller loss than the win.


- My reading of the current xact.c integration is that it's not workable
  as is. Undo is executed outside of a valid transaction state,
  exceptions aren't properly undone, logic would need to be duplicated
  to a significant degree, new kind of critical section.


Greetings,

Andres Freund
pgsql-hackers by date:
From: Alex
Date: 14 August 2019, 09:33:54
Subject: use valgrind --leak-check=yes to detect memory leak
From: David Rowley
Date: 14 August 2019, 10:25:10
Subject: Re: Speed up transaction completion faster after many relations areaccessed in a transaction
Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

Previous

Next