Re: Improve WALRead() to suck data directly from WAL buffers when possible - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: Improve WALRead() to suck data directly from WAL buffers when possible
Date
Msg-id 4132fe48f831ed6f73a9eb191af5fe475384969c.camel@j-davis.com
Whole thread Raw
In response to Re: Improve WALRead() to suck data directly from WAL buffers when possible  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Improve WALRead() to suck data directly from WAL buffers when possible
List pgsql-hackers
On Fri, 2023-10-27 at 03:46 +0530, Bharath Rupireddy wrote:
>
> Almost all the available backend
> page_read callbacks read_local_xlog_page_no_wait,
> read_local_xlog_page, logical_read_xlog_page except XLogPageRead
> (which is used for recovery when WAL buffers aren't used at all) have
> one thing in common, that is, WALRead(). Therefore, it seemed a
> natural choice for me to call XLogReadFromBuffers. In other words,
> I'd
> say it's the responsibility of page_read callback implementers to
> decide if they want to read from WAL buffers or not and hence I don't
> think we need a separate XLogReaderRoutine.

I think I see what you are saying: WALRead() is at a lower level than
the XLogReaderRoutine callbacks, because it's used by the .page_read
callbacks.

That makes sense, but my first interpretation was that WALRead() is
above the XLogReaderRoutine callbacks because it calls .segment_open
and .segment_close. To me that sounds like a layering violation, but it
exists today without your patch.

I suppose the question is: should reading from the WAL buffers an
intentional thing that the caller does explicitly by specific callers?
Or is it an optimization that should be hidden from the caller?

I tend toward the former, at least for now. I suspect that when we do
some more interesting things, like replicating unflushed data, we will
want reading from buffers to be a separate step, not combined with
WALRead(). After things in this area settle a bit then we might want to
refactor and combine them again.

> If someone wants to read unflushed WAL, the typical way to implement
> it is to write a new page_read callback
> read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
> similar without WALRead() but just the new function
> XLogReadFromBuffers to read from WAL buffers and return.

Then why is it being called from WALRead() at all?

>
> I prefer to keep the partial hit handling as-is just
> in case:
>

So a "partial hit" is essentially a narrow race condition where one
page is read from buffers, and it's valid; and by the time it gets to
the next page, it has already been evicted (along with the previously
read page)? In other words, I think you are describing a case where
eviction is happening faster than the memcpy()s in a loop, which is
certainly possible due to scheduling or whatever, but doesn't seem like
the common case.

The case I'd expect for a partial read is when the startptr points to
an evicted page, but some later pages in the requested range are still
present in the buffers.

I'm not really sure whether either of these cases matters, but if we
implement one and not the other, there should be some explanation.

> Yes, I proposed that idea in another thread -
> https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com
> .
> If that looks okay, I can add it to the next version of this patch
> set.

The code in the email above still shows a call to:

  /*
   * Requested WAL is available in WAL buffers, so recheck the
existence
   * under the WALBufMappingLock and read if the page still exists,
otherwise
   * return.
   */
  LWLockAcquire(WALBufMappingLock, LW_SHARED);

and I don't think that's required. How about something like:

  endptr1 = XLogCtl->xlblocks[idx];
  /* Requested WAL isn't available in WAL buffers. */
  if (expectedEndPtr != endptr1)
      break;

  pg_read_barrier();
  ...
  memcpy(buf, data, bytes_read_this_loop);
  ...
  pg_read_barrier();
  endptr2 = XLogCtl->xlblocks[idx];
  if (expectedEndPtr != endptr2)
      break;

  ntotal += bytes_read_this_loop;
  /* success; move on to next page */

I'm not sure why GetXLogBuffer() doesn't just use pg_atomic_read_u64().
I suppose because xlblocks are not guaranteed to be 64-bit aligned?
Should we just align it to 64 bits so we can use atomics? (I don't
think it matters in this case, but atomics would avoid the need to
think about it.)
>

>
> FWIW, I found heapam.c using unlikely() extensively for safety
> checks.

OK, I won't object to the use of unlikely(), though I typically don't
use it without a fairly strong reason to think I should override what
the compiler thinks and/or what branch predictors can handle.

In this case, I think some of those errors are not really necessary
anyway, though:

  * XLogReadFromBuffers shouldn't take a timeline argument just to
demand that it's always equal to the wal insertion timeline.
  * Why check that startptr is earlier than the flush pointer, but not
startptr+count? Also, given that we intend to support reading unflushed
data, it would be good to comment that the function still works past
the flush pointer, and that it will be safe to remove later (right?).
  * An "Assert(!RecoveryInProgress())" would be more appropriate than
an error. Perhaps we will remove even that check in the future to
achieve cascaded replication of unflushed data.

Regards,
    Jeff Davis







pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: [17] Special search_path names "!pg_temp" and "!pg_catalog"
Next
From: Jeff Davis
Date:
Subject: Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }