Re: Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers

From Melanie Plageman
Subject Re: Avoiding smgrimmedsync() during nbtree index builds
Date
Msg-id CAAKRu_ZQEpk6Q1WtNLgfXBdCmdU5xN_w0boVO6faO_Ax+ckjig@mail.gmail.com
Whole thread Raw
In response to Re: Avoiding smgrimmedsync() during nbtree index builds  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: Avoiding smgrimmedsync() during nbtree index builds  (Andres Freund <andres@anarazel.de>)
Re: Avoiding smgrimmedsync() during nbtree index builds  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On Wed, Mar 2, 2022 at 8:09 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> Rebased to appease cfbot.
>
> I ran these paches under a branch which shows code coverage in cirrus.  It
> looks good to my eyes.
> https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html

thanks for doing that and for the rebase! since I made updates, the
attached version 6 is also rebased.

To Dmitry's question:

On Sun, Feb 13, 2022 at 9:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Wed, Feb 09, 2022 at 01:49:30PM -0500, Melanie Plageman wrote:
>
> I don't see it in the discussion, so naturally curious -- why directmgr
> is not used for bloom index, e.g. in blbuildempty?

thanks for pointing this out. blbuildempty() is also included now. bloom
doesn't seem to use smgr* anywhere except blbuildempty(), so that is the
only place I made changes in bloom index build.

v6 has the following updates/changes:

- removed erroneous extra calls to unbuffered_prep() and
  unbuffered_finish() in GiST and btree index builds

- removed unnecessary includes

- some comments were updated to accurately reflect use of directmgr

- smgrwrite doesn't WAL-log anymore. This one I'm not sure about. I
  think it makes sense that unbuffered_extend() of non-empty pages of
  WAL-logged relations (or the init fork of unlogged relations) do
  log_newpage(), but I wasn't so sure about smgrwrite().

  Currently all callers of smgrwrite() do log_newpage() and anyone using
  mdwrite() will end up writing the whole page. However, it seems
  possible that a caller of the directmgr API might wish to do a write
  to a particular offset and, either because of that, or, for some other
  reason, require different logging than that done in log_newpage().

  I didn't want to make a separate wrapper function for WAL-logging in
  directmgr because it felt like one more step to forget.

- heapam_relation_set_new_filenode doesn't use directmgr API anymore for
  unlogged relations. It doesn't extend or write, so it seemed like a
  special case better left alone.

  Note that the ambuildempty() functions which also write to the init
  fork of an unlogged relation still use the directmgr API. It is a bit
  confusing because they pass do_wal=true to unbuffered_prep() even
  though they are unlogged relations because they must log and fsync.

- interface changes to unbuffered_prep()
  I removed the parameters to unbuffered_prep() which required the user
  to specify if fsync should be added to pendingOps or done with
  smgrimmedsync(). Understanding all of the combinations of these
  parameters and when they were needed was confusing and the interface
  felt like a foot gun. Special cases shouldn't use this interface.

  I prefer the idea that users of this API expect that
  1) empty pages won't be checksummed or WAL logged
  2) for relations that are WAL-logged, when the build is done, the
  relation will be fsync'd by the backend (unless the fsync optimization
  is being used)
  3) the only case in which fsync requests are added to the pendingOps
  queue is when the fsync optimization is being used (which saves the
  redo pointer and check it later to determine if it needs to fsync
  itself)

  I also added the parameter "do_wal" to unbuffered_prep() and the
  UnBufferedWriteState struct. This is used when extending the file to
  determine whether or not to log_newpage(). unbuffered_extend() and
  unbuffered_write() no longer take do_wal as a parameter.

  Note that callers need to pass do_wal=true/false to unbuffered_prep()
  based on whether or not they want log_newpage() called during
  unbuffered_extend()--not simply based on whether or not the relation
  in question is WAL-logged.

  do_wal is the only member of the UnBufferedWriteState struct in the
  first patch in the set, but I think it makes sense to keep the struct
  around since I anticipate that the patch containing the other members
  needed for the fsync optimization will be committed at the same time.

  One final note on unbuffered_prep() -- I am thinking of renaming
  "do_optimization" to "try_optimization" or maybe
  "request_fsync_optimization". The interface (of unbuffered_prep())
  would be better if we always tried to do the optimization when
  relevant (when the relation is WAL-logged).

- freespace map, visimap, and hash index don't use directmgr API anymore
  For most use cases writing/extending outside shared buffers, it isn't
  safe to rely on requesting fsync from checkpointer.

  visimap, fsm, and hash index all request fsync from checkpointer for
  the pages they write with smgrextend() and don't smgrimmedsync() when
  finished adding pages (even when the hash index is WAL-logged).

  Supporting these exceptions made the interface incoherent, so I cut
  them.

- added unbuffered_extend_range()
  This one is a bit unfortunate. Because GiST index build uses
  log_newpages() to log a whole page range but calls smgrextend() for
  each of those pages individually, I couldn't use the
  unbuffered_extend() function easily.

  So, I thought it might be useful in other contexts as well to have a
  function which calls smgrextend() for a range of pages and then calls
  log_newpages(). I've added this.

  However, there are two parts of GiST index build flush ready pages
  that didn't work with this either.

  The first is that it does an error check on the block numbers one at a
  time while looping through them to write the pages. To retain this
  check, I loop through the ready pages an extra time before calling
  unbuffered_extend(), which is probably not acceptable.

  Also, GiST needs to use a custom LSN for the pages. To achieve this, I
  added a "custom_lsn" parameter to unbuffered_extend_range(). This
  isn't great either. If this was a more common case, I could pass the
  custom LSN to unbuffered_prep().

- Melanie

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Regression tests failures on Windows Server 2019 - on master at commit # d816f366b
Next
From: Andres Freund
Date:
Subject: Re: Regression tests failures on Windows Server 2019 - on master at commit # d816f366b