Re: Improve WALRead() to suck data directly from WAL buffers when possible - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Improve WALRead() to suck data directly from WAL buffers when possible |
Date | |
Msg-id | 4132fe48f831ed6f73a9eb191af5fe475384969c.camel@j-davis.com Whole thread Raw |
In response to | Re: Improve WALRead() to suck data directly from WAL buffers when possible (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Responses |
Re: Improve WALRead() to suck data directly from WAL buffers when possible
|
List | pgsql-hackers |
On Fri, 2023-10-27 at 03:46 +0530, Bharath Rupireddy wrote: > > Almost all the available backend > page_read callbacks read_local_xlog_page_no_wait, > read_local_xlog_page, logical_read_xlog_page except XLogPageRead > (which is used for recovery when WAL buffers aren't used at all) have > one thing in common, that is, WALRead(). Therefore, it seemed a > natural choice for me to call XLogReadFromBuffers. In other words, > I'd > say it's the responsibility of page_read callback implementers to > decide if they want to read from WAL buffers or not and hence I don't > think we need a separate XLogReaderRoutine. I think I see what you are saying: WALRead() is at a lower level than the XLogReaderRoutine callbacks, because it's used by the .page_read callbacks. That makes sense, but my first interpretation was that WALRead() is above the XLogReaderRoutine callbacks because it calls .segment_open and .segment_close. To me that sounds like a layering violation, but it exists today without your patch. I suppose the question is: should reading from the WAL buffers an intentional thing that the caller does explicitly by specific callers? Or is it an optimization that should be hidden from the caller? I tend toward the former, at least for now. I suspect that when we do some more interesting things, like replicating unflushed data, we will want reading from buffers to be a separate step, not combined with WALRead(). After things in this area settle a bit then we might want to refactor and combine them again. > If someone wants to read unflushed WAL, the typical way to implement > it is to write a new page_read callback > read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or > similar without WALRead() but just the new function > XLogReadFromBuffers to read from WAL buffers and return. Then why is it being called from WALRead() at all? > > I prefer to keep the partial hit handling as-is just > in case: > So a "partial hit" is essentially a narrow race condition where one page is read from buffers, and it's valid; and by the time it gets to the next page, it has already been evicted (along with the previously read page)? In other words, I think you are describing a case where eviction is happening faster than the memcpy()s in a loop, which is certainly possible due to scheduling or whatever, but doesn't seem like the common case. The case I'd expect for a partial read is when the startptr points to an evicted page, but some later pages in the requested range are still present in the buffers. I'm not really sure whether either of these cases matters, but if we implement one and not the other, there should be some explanation. > Yes, I proposed that idea in another thread - > https://www.postgresql.org/message-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu%2Bwp%3DqkM9pdLZ5mwLaFg%40mail.gmail.com > . > If that looks okay, I can add it to the next version of this patch > set. The code in the email above still shows a call to: /* * Requested WAL is available in WAL buffers, so recheck the existence * under the WALBufMappingLock and read if the page still exists, otherwise * return. */ LWLockAcquire(WALBufMappingLock, LW_SHARED); and I don't think that's required. How about something like: endptr1 = XLogCtl->xlblocks[idx]; /* Requested WAL isn't available in WAL buffers. */ if (expectedEndPtr != endptr1) break; pg_read_barrier(); ... memcpy(buf, data, bytes_read_this_loop); ... pg_read_barrier(); endptr2 = XLogCtl->xlblocks[idx]; if (expectedEndPtr != endptr2) break; ntotal += bytes_read_this_loop; /* success; move on to next page */ I'm not sure why GetXLogBuffer() doesn't just use pg_atomic_read_u64(). I suppose because xlblocks are not guaranteed to be 64-bit aligned? Should we just align it to 64 bits so we can use atomics? (I don't think it matters in this case, but atomics would avoid the need to think about it.) > > > FWIW, I found heapam.c using unlikely() extensively for safety > checks. OK, I won't object to the use of unlikely(), though I typically don't use it without a fairly strong reason to think I should override what the compiler thinks and/or what branch predictors can handle. In this case, I think some of those errors are not really necessary anyway, though: * XLogReadFromBuffers shouldn't take a timeline argument just to demand that it's always equal to the wal insertion timeline. * Why check that startptr is earlier than the flush pointer, but not startptr+count? Also, given that we intend to support reading unflushed data, it would be good to comment that the function still works past the flush pointer, and that it will be safe to remove later (right?). * An "Assert(!RecoveryInProgress())" would be more appropriate than an error. Perhaps we will remove even that check in the future to achieve cascaded replication of unflushed data. Regards, Jeff Davis
pgsql-hackers by date: