Thread: Attempt to consolidate reading of XLOG page
While working on the instance encryption I found it annoying to apply decyption of XLOG page to three different functions. Attached is a patch that tries to merge them all into one function, XLogRead(). The existing implementations differ in the way new segment is opened. So I added a pointer to callback function as a new argument. This callback handles the specific ways to determine segment file name and to open the file. I can split the patch into multiple diffs to make detailed review easier, but first I'd like to hear if anything is seriously wrong about this design. Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
Hello. At Thu, 11 Apr 2019 18:05:42 +0200, Antonin Houska <ah@cybertec.at> wrote in <14984.1554998742@spoje.net> > While working on the instance encryption I found it annoying to apply > decyption of XLOG page to three different functions. Attached is a patch that > tries to merge them all into one function, XLogRead(). The existing > implementations differ in the way new segment is opened. So I added a pointer > to callback function as a new argument. This callback handles the specific > ways to determine segment file name and to open the file. > > I can split the patch into multiple diffs to make detailed review easier, but > first I'd like to hear if anything is seriously wrong about this > design. Thanks. This patch changes XLogRead to allow using other than BasicOpenFile to open a segment, and use XLogReaderState.private to hold a new struct XLogReadPos for the segment reader. The new struct is heavily duplicated with XLogReaderState and I'm not sure the rason why the XLogReadPos is needed. Anyway, in the first place, such two distinct-but-highly-related callbacks makes things too complex. Heikki said that the log reader stuff is better not using callbacks and I agree to that. I did that once for my own but the code is no longer applicable. But it seems to be the time to do that. https://www.postgresql.org/message-id/47215279-228d-f30d-35d1-16af695e53f3@iki.fi That would seems like follows. That refactoring separates log reader and page reader. for(;;) { rc = XLogReadRecord(reader, startptr, errormsg); if (rc == XLREAD_SUCCESS) { /* great, got record */ } if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD) { elog(ERROR, "invalid record"); } if (rc == XLREAD_NEED_DATA) { /* * Read a page from disk, and place it into reader->readBuf */ XLogPageRead(reader->readPagePtr, /* page to read */ reader->reqLen /* # of bytes to read */ ); /* * Now that we have read the data that XLogReadRecord() * requested, call it again. */ continue; } } DecodingContextFindStartpoint(ctx) do { read_local_xlog_page(....); rc = XLogReadRecord (reader); while (rc == XLREAD_NEED_DATA); I'm going to do that again. Any other opinions, or thoughts? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Apr 12, 2019 at 12:27:11PM +0900, Kyotaro HORIGUCHI wrote: > This patch changes XLogRead to allow using other than > BasicOpenFile to open a segment, and use XLogReaderState.private > to hold a new struct XLogReadPos for the segment reader. The new > struct is heavily duplicated with XLogReaderState and I'm not > sure the rason why the XLogReadPos is needed. > Any other opinions, or thoughts? The focus is on the stability of v12 for the next couple of months, so please make sure to register it to the next CF if you want feedback. Here are some basic thoughts after a very quick lookup. +/* + * Position in XLOG file while reading it. + */ +typedef struct XLogReadPos +{ + int segFile; /* segment file descriptor */ + XLogSegNo segNo; /* segment number */ + uint32 segOff; /* offset in the segment */ + TimeLineID tli; /* timeline ID of the currently open file */ + + char *dir; /* directory (only needed by frontends) */ +} XLogReadPos; Not sure if there is any point to split that with the XLOG reader status. +static void fatal_error(const char *fmt,...) pg_attribute_printf(1, 2); + +static void +fatal_error(const char *fmt,...) In this more confusion accumulates with something which has enough warts on HEAD when it comes to declare locally equivalents to the elog() for src/common/. +/* + * This is a front-end counterpart of XLogFileNameP. + */ +static char * +XLogFileNameFE(TimeLineID tli, XLogSegNo segno) +{ + char *result = palloc(MAXFNAMELEN); + + XLogFileName(result, tli, segno, WalSegSz); + return result; +} We could use a pointer to an allocated area. Or even better, just a static variable as this just gets used for error messages to store temporarily the segment name in a routine part of perhaps xlogreader.c. -- Michael
Attachment
On Fri, Apr 12, 2019 at 2:06 AM Antonin Houska <ah@cybertec.at> wrote:
While working on the instance encryption I found it annoying to apply
decyption of XLOG page to three different functions. Attached is a patch that
tries to merge them all into one function, XLogRead(). The existing
implementations differ in the way new segment is opened. So I added a pointer
to callback function as a new argument. This callback handles the specific
ways to determine segment file name and to open the file.
I can split the patch into multiple diffs to make detailed review easier, but
first I'd like to hear if anything is seriously wrong about this
design. Thanks.
I didn't check the code, but it is good to combine all the 3 page read functions
into one instead of spreading the logic.
Regards,
Haribabu Kommi
Fujitsu Australia
On 2019-Apr-11, Antonin Houska wrote: > While working on the instance encryption I found it annoying to apply > decyption of XLOG page to three different functions. Attached is a patch that > tries to merge them all into one function, XLogRead(). The existing > implementations differ in the way new segment is opened. So I added a pointer > to callback function as a new argument. This callback handles the specific > ways to determine segment file name and to open the file. > > I can split the patch into multiple diffs to make detailed review easier, but > first I'd like to hear if anything is seriously wrong about this > design. Thanks. I agree that xlog reading is pretty messy. I think ifdef'ing the way XLogRead reports errors is not great. Maybe we can pass a function pointer that is to be called in case of errors? Not sure about the walsize; maybe it can be a member in XLogReadPos, and given to XLogReadInitPos()? (Maybe rename XLogReadPos as XLogReadContext or something like that, indicating it's not just the read position.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello. > > At Thu, 11 Apr 2019 18:05:42 +0200, Antonin Houska <ah@cybertec.at> wrote in <14984.1554998742@spoje.net> > > While working on the instance encryption I found it annoying to apply > > decyption of XLOG page to three different functions. Attached is a patch that > > tries to merge them all into one function, XLogRead(). The existing > > implementations differ in the way new segment is opened. So I added a pointer > > to callback function as a new argument. This callback handles the specific > > ways to determine segment file name and to open the file. > > > > I can split the patch into multiple diffs to make detailed review easier, but > > first I'd like to hear if anything is seriously wrong about this > > design. Thanks. > > This patch changes XLogRead to allow using other than > BasicOpenFile to open a segment, Good point. The acceptable ways to open file on both frontend and backend side need to be documented. > and use XLogReaderState.private to hold a new struct XLogReadPos for the > segment reader. The new struct is heavily duplicated with XLogReaderState > and I'm not sure the rason why the XLogReadPos is needed. ok, I missed the fact that XLogReaderState already contains most of the info that I put into XLogReadPos. So XLogReadPos is not needed. > Anyway, in the first place, such two distinct-but-highly-related > callbacks makes things too complex. Heikki said that the log > reader stuff is better not using callbacks and I agree to that. I > did that once for my own but the code is no longer > applicable. But it seems to be the time to do that. > > https://www.postgresql.org/message-id/47215279-228d-f30d-35d1-16af695e53f3@iki.fi Thanks for the link. My understanding is that the drawback of the XLogReaderState.read_page callback is that it cannot easily switch between XLOG sources in order to handle failure because the caller of XLogReadRecord() usually controls those sources too. However the callback I pass to XLogRead() is different: if it fails, it simply raises ERROR. Since this indicates rather low-level problem, there's no reason for this callback to try to recover from the failure. > That would seems like follows. That refactoring separates log > reader and page reader. > > > for(;;) > { > rc = XLogReadRecord(reader, startptr, errormsg); > > if (rc == XLREAD_SUCCESS) > { > /* great, got record */ > } > if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD) > { > elog(ERROR, "invalid record"); > } > if (rc == XLREAD_NEED_DATA) > { > /* > * Read a page from disk, and place it into reader->readBuf > */ > XLogPageRead(reader->readPagePtr, /* page to read */ > reader->reqLen /* # of bytes to read */ ); > /* > * Now that we have read the data that XLogReadRecord() > * requested, call it again. > */ > continue; > } > } > > DecodingContextFindStartpoint(ctx) > do > { > read_local_xlog_page(....); > rc = XLogReadRecord (reader); > while (rc == XLREAD_NEED_DATA); > > I'm going to do that again. > > > Any other opinions, or thoughts? I don't see an overlap between what you do and what I do. It seems that even if you change the XLOG reader API, you don't care what read_local_xlog_page() does internally. What I try to fix is XLogRead(), and that is actually a subroutine of read_local_xlog_page(). -- Antonin Houska Web: https://www.cybertec-postgresql.com
Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Apr 12, 2019 at 12:27:11PM +0900, Kyotaro HORIGUCHI wrote: > > This patch changes XLogRead to allow using other than > > BasicOpenFile to open a segment, and use XLogReaderState.private > > to hold a new struct XLogReadPos for the segment reader. The new > > struct is heavily duplicated with XLogReaderState and I'm not > > sure the rason why the XLogReadPos is needed. > > Any other opinions, or thoughts? > > The focus is on the stability of v12 for the next couple of months, so > please make sure to register it to the next CF if you want feedback. ok, will do. (A link to mailing list is needed for the CF entry, so I had to post something anyway :-) Since I don't introduce any kind of "cool new feature" here, I believe it did not disturb much.) > Here are some basic thoughts after a very quick lookup. > ... Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I agree that xlog reading is pretty messy. > > I think ifdef'ing the way XLogRead reports errors is not great. Maybe > we can pass a function pointer that is to be called in case of errors? I'll try a bit harder to evaluate the existing approaches to report the same error on both backend and frontend side. > Not sure about the walsize; maybe it can be a member in XLogReadPos, and > given to XLogReadInitPos()? (Maybe rename XLogReadPos as > XLogReadContext or something like that, indicating it's not just the > read position.) As pointed out by others, XLogReadPos is not necessary. So if XLogRead() receives XLogReaderState instead, it can get the segment size from there. Thanks. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > Not sure about the walsize; maybe it can be a member in XLogReadPos, and > > given to XLogReadInitPos()? (Maybe rename XLogReadPos as > > XLogReadContext or something like that, indicating it's not just the > > read position.) > > As pointed out by others, XLogReadPos is not necessary. So if XLogRead() > receives XLogReaderState instead, it can get the segment size from there. Eventually I found out that it's good to have a separate structure for the read position because walsender calls the XLogRead() function directly, not via the XLOG reader. Currently the structure name is XLogSegment (maybe someone can propose better name) and it's a member of XLogReaderState. No field of the new structure is duplicated now. The next version of the patch is attached. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Thu, May 2, 2019 at 12:18 PM Antonin Houska <ah@cybertec.at> wrote: > The next version of the patch is attached. I don't think any of this looks acceptable: +#ifndef FRONTEND +/* + * Backend should have wal_segment_size variable initialized, segsize is not + * used. + */ +#define XLogFileNameCommon(tli, num, segsize) XLogFileNameP((tli), (num)) +#define xlr_error(...) ereport(ERROR, (errcode_for_file_access(), errmsg(__VA_ARGS__))) +#else +static char xlr_error_msg[MAXFNAMELEN]; +#define XLogFileNameCommon(tli, num, segsize) (XLogFileName(xlr_error_msg, (tli), (num), (segsize)),\ + xlr_error_msg) +#include "fe_utils/logging.h" +/* + * Frontend application (currently only pg_waldump.c) cannot catch and further + * process errors, so they simply treat them as fatal. + */ +#define xlr_error(...) do {pg_log_fatal(__VA_ARGS__); exit(EXIT_FAILURE); } while(0) +#endif The backend part doesn't look OK because depending on the value of a global variable instead of getting the information via parameters seems like a step backward. The frontend part doesn't look OK because it locks every application that uses the xlogreader stuff into using pg_log_fatal when an error occurs, which may not be what everybody wants to do. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-May-06, Robert Haas wrote: > On Thu, May 2, 2019 at 12:18 PM Antonin Houska <ah@cybertec.at> wrote: > > The next version of the patch is attached. > > I don't think any of this looks acceptable: I agree. I inteded to suggest upthread to pass an additional argument to XLogRead, which is a function that takes a message string and SQLSTATE; in backend, the function does errstart / errstate / errmsg / errfinish, and in frontend programs it does pg_log_fatal (and ignores sqlstate). The message must be sprintf'ed and translated by XLogRead. (xlogreader.c could itself provide a default error reporting callback, at least for frontend, to avoid repeating the code). That way, if a different frontend program wants to do something different, it's fairly easy to pass a different function pointer. BTW, having frontend's XLogFileNameCommon use a totally unrelated variable for its printing is naughty. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, May 6, 2019 at 2:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-May-06, Robert Haas wrote: > > On Thu, May 2, 2019 at 12:18 PM Antonin Houska <ah@cybertec.at> wrote: > > > The next version of the patch is attached. > > > > I don't think any of this looks acceptable: > > I agree. I inteded to suggest upthread to pass an additional argument > to XLogRead, which is a function that takes a message string and > SQLSTATE; in backend, the function does errstart / errstate / errmsg / > errfinish, and in frontend programs it does pg_log_fatal (and ignores > sqlstate). The message must be sprintf'ed and translated by XLogRead. > (xlogreader.c could itself provide a default error reporting callback, > at least for frontend, to avoid repeating the code). That way, if a > different frontend program wants to do something different, it's fairly > easy to pass a different function pointer. It seems to me that it's better to unwind the stack i.e. have the function return the error information to the caller and let the caller do as it likes. The other thread to which Horiguchi-san referred earlier in this thread seems to me to have basically concluded that the XLogPageReadCB callback to XLogReaderAllocate is a pain to use because it doesn't unwind the stack, and work is under way over there to get rid of that callback for just that reason. Adding a new callback for error-reporting would just be creating a new instance of the same issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > It seems to me that it's better to unwind the stack i.e. have the > function return the error information to the caller and let the caller > do as it likes. Thanks for a hint. The next version tries to do that. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Tue, May 21, 2019 at 9:12 PM Antonin Houska <ah@cybertec.at> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > > It seems to me that it's better to unwind the stack i.e. have the > > function return the error information to the caller and let the caller > > do as it likes. > > Thanks for a hint. The next version tries to do that. Hi Antonin, Could you please send a fresh rebase for the new Commitfest? Thanks, -- Thomas Munro https://enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, May 21, 2019 at 9:12 PM Antonin Houska <ah@cybertec.at> wrote: > > Robert Haas <robertmhaas@gmail.com> wrote: > > > It seems to me that it's better to unwind the stack i.e. have the > > > function return the error information to the caller and let the caller > > > do as it likes. > > > > Thanks for a hint. The next version tries to do that. > > Hi Antonin, > > Could you please send a fresh rebase for the new Commitfest? Rebased. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
Hi Antonin, could you please rebase again? Thanks -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Pushed 0001. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Hi Antonin, could you please rebase again? Attached. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
I was confused by the struct name XLogSegment -- the struct is used to represent a WAL segment while it's kept open, rather than just a WAL segment in abstract. Also, now that we've renamed everything to use the term WAL, it seems wrong to use the name XLog for new structs. I propose the name WALOpenSegment for the struct, which solves both problems. (Its initializer function would get the name WALOpenSegmentInit.) Now, the patch introduces a callback for XLogRead, the type of which is called XLogOpenSegment. If we rename it from XLog to WAL, both names end up the same. I propose to rename the function type to WALSegmentOpen, which in a "noun-verb" view of the world, represents the action of opening a WAL segment. I attach a patch for all this renaming, on top of your series. I wonder if each of those WALSegmentOpen callbacks should reset [at least some members of] the struct; they're already in charge of setting ->file, and apparently we're leaving the responsibility of setting the rest of the members to XLogRead. That seems weird. Maybe we should say that the CB should only open the segment and not touch the struct at all and XLogRead is in charge of everything. Perhaps the other way around -- the CB should set everything correctly ... I'm not sure which is best. But having half here and half there seems a recipe for confusion and bugs. Another thing I didn't like much is that everything seems to assume that the only error possible from XLogRead is a read error. Maybe that's okay, because it seems to be the current reality, but it seemed odd. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I was confused by the struct name XLogSegment -- the struct is used to > represent a WAL segment while it's kept open, rather than just a WAL > segment in abstract. Also, now that we've renamed everything to use the > term WAL, it seems wrong to use the name XLog for new structs. I > propose the name WALOpenSegment for the struct, which solves both > problems. (Its initializer function would get the name > WALOpenSegmentInit.) > > Now, the patch introduces a callback for XLogRead, the type of which is > called XLogOpenSegment. If we rename it from XLog to WAL, both names > end up the same. I propose to rename the function type to > WALSegmentOpen, which in a "noun-verb" view of the world, represents the > action of opening a WAL segment. > > I attach a patch for all this renaming, on top of your series. ok, thanks. In addition I renamed WalSndOpenSegment() to WalSndSegmentOpen() and read_local_xlog_page_open_segment() to read_local_xlog_page_segment_open() > I wonder if each of those WALSegmentOpen callbacks should reset [at > least some members of] the struct; they're already in charge of setting > ->file, and apparently we're leaving the responsibility of setting the > rest of the members to XLogRead. That seems weird. Maybe we should say > that the CB should only open the segment and not touch the struct at all > and XLogRead is in charge of everything. Perhaps the other way around > -- the CB should set everything correctly ... I'm not sure which is > best. But having half here and half there seems a recipe for confusion > and bugs. ok, I've changed the CB signature. Now it receives poiners to the two variables that it can change while the "seg" argument is documented as read-only. To indicate that the CB should determine timeline itself, I introduced a new constant InvalidTimeLineID, see the 0004 part. > Another thing I didn't like much is that everything seems to assume that > the only error possible from XLogRead is a read error. Maybe that's > okay, because it seems to be the current reality, but it seemed odd. In this case I only moved the ereport() code from XLogRead() away (so that both backend and frontend can call the function). Given that the code to open WAL segment is now in the callbacks, the only thing that XLogRead() can ereport is that read() failed. BTW, I introduced one more structure, XLogReadError, in this patch version. I think it's better than adding error-specific fields to the WALOpenSegment structure. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
I spent a couple of hours on this patchset today. I merged 0001 and 0002, and decided the result was still messier than I would have liked, so I played with it a bit more -- see attached. I think this is committable, but I'm afraid it'll cause quite a few conflicts with the rest of your series. I had two gripes, which I feel solved with my changes: 1. I didn't like that "dir" and "wal segment size" were part of the "currently open segment" supporting struct. It seemed that those two were slightly higher-level, since they apply to every segment that's going to be opened, not just the current one. My first thought was to put those as members of XLogReaderState, but that doesn't work because the physical walsender.c code does not use xlogreader at all, even though it is reading WAL. Anyway my solution was to create yet another struct, which for everything that uses xlogreader is just part of that state struct; and for walsender, it's just a separate one alongside sendSeg. All in all, this seems pretty clean. 2. Having the wal dir be #ifdef FRONTEND seemed out of place. I know the backend code does not use that, but eliding it is more "noisy" than just setting it to NULL. Also, the "Finalize the segment pointer" thingy seemed out of place. So my code passes the dir as an argument to XLogReaderAllocate, and if it's null then we just don't allocate it. Everybody else can use it to guide things. This results in cleaner code, because we don't have to handle it externally, which was causing quite some pain to pg_waldump. Note that ws_dir member is a char array in the struct, not just a pointer. This saves trouble trying to allocate it (I mainly did it this way because we don't have pstrdup_extended(MCXT_ALLOC_NO_OOM) ... yes, this could be made with palloc+snprintf, but eh, that doesn't seem worth the trouble.) Separately from those two API-wise points, there was one bug which meant that with your 0002+0003 the recovery tests did not pass -- code placement bug. I suppose the bug disappears with later patches in your series, which probably is why you didn't notice. This is the fix for that: - XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr, - state->seg.tli = pageTLI; + state->seg.ws_tli = pageTLI; + XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr, XLOG_BLCKSZ); ... Also, yes, I renamed all the struct members. If you don't have any strong dislikes for these changes, I'll push this part and let you rebase the remains on top. Regarding the other patches: 1. I think trying to do palloc(XLogReadError) is a bad idea ... for example, if the read fails because of system pressure, we might return "out of memory" during that palloc instead of the real read error. This particular problem you could forestall by changing to ErrorContext, but I have the impression that it might be better to have the error struct by stack-allocated in the caller stack. This forces you to limit the message string to a maximum size (say 128 bytes or maybe even 1000 bytes like MAX_ERRORMSG_LEN) but I don't have a problem with that. 2. Not a fan of the InvalidTimeLineID stuff offhand. Maybe it's okay ... not convinced yet either way. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Sep-23, Alvaro Herrera wrote: > I spent a couple of hours on this patchset today. I merged 0001 and > 0002, and decided the result was still messier than I would have liked, > so I played with it a bit more -- see attached. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I spent a couple of hours on this patchset today. I merged 0001 and > 0002, and decided the result was still messier than I would have liked, > so I played with it a bit more -- see attached. I think this is > committable, but I'm afraid it'll cause quite a few conflicts with the > rest of your series. > > I had two gripes, which I feel solved with my changes: > > 1. I didn't like that "dir" and "wal segment size" were part of the > "currently open segment" supporting struct. It seemed that those two > were slightly higher-level, since they apply to every segment that's > going to be opened, not just the current one. ok > My first thought was to put those as members of XLogReaderState, but > that doesn't work because the physical walsender.c code does not use > xlogreader at all, even though it is reading WAL. `I don't remember clearly but I think that this was the reason I tried to move "wal_segment_size" away from XLogReaderState. > Separately from those two API-wise points, there was one bug which meant > that with your 0002+0003 the recovery tests did not pass -- code > placement bug. I suppose the bug disappears with later patches in your > series, which probably is why you didn't notice. This is the fix for that: > > - XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr, > - state->seg.tli = pageTLI; > + state->seg.ws_tli = pageTLI; > + XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr, > XLOG_BLCKSZ); > Yes, it seems so - the following parts ensure that XLogRead() adjusts the timeline itself. I only checked that the each part of the series keeps the source tree compilable. Thanks for fixing. > ... Also, yes, I renamed all the struct members. > > > If you don't have any strong dislikes for these changes, I'll push this > part and let you rebase the remains on top. No objections here. > 2. Not a fan of the InvalidTimeLineID stuff offhand. Maybe it's okay ... > not convinced yet either way. Well, it seems that the individual callbacks only use this constant in Assert() statements. I'll consider if we really need it. The argument value should not determine whether the callback derives the TLI or not. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2019-Sep-24, Antonin Houska wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > If you don't have any strong dislikes for these changes, I'll push this > > part and let you rebase the remains on top. > > No objections here. oK, pushed. Please rebase the other parts. I made one small adjustment: in read_local_xlog_page() there was one *readTLI output parameter that was being changed to a local variable plus later assigment to the output struct member; I changed the code to continue to assign directly to the output variable instead. There was an error case in which the TLI was not assigned to; I suppose this doesn't really change things (we don't examine the TLI in that case, do we?), but it seemed dangerous to leave like that. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Sep-24, Antonin Houska wrote: > > > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > If you don't have any strong dislikes for these changes, I'll push this > > > part and let you rebase the remains on top. > > > > No objections here. > > oK, pushed. Please rebase the other parts. Thanks! > I made one small adjustment: in read_local_xlog_page() there was one > *readTLI output parameter that was being changed to a local variable > plus later assigment to the output struct member; I changed the code to > continue to assign directly to the output variable instead. There was > an error case in which the TLI was not assigned to; I suppose this > doesn't really change things (we don't examine the TLI in that case, do > we?), but it seemed dangerous to leave like that. I used the local variable to make some expressions simpler, but missed the fact that this way I can leave the ws_tli field unassigned if the function returns prematurely. Now that I look closer, I see that it can be a problem - in the case of ERROR, XLogReadRecord() does reset the state, but it does not reset the TLI: err: /* * Invalidate the read state. We might read from a different source after * failure. */ XLogReaderInvalReadState(state); Thus the TLI appears to be important even on ERROR, and what you've done is correct. Thanks for fixing that. One comment on the remaining part of the series: Before this refactoring, the walsender.c:XLogRead() function contained these lines /* * After reading into the buffer, check that what we read was valid. We do * this after reading, because even though the segment was present when we * opened it, it might get recycled or removed while we read it. The * read() succeeds in that case, but the data we tried to read might * already have been overwritten with new WAL records. */ XLByteToSeg(startptr, segno, segcxt->ws_segsize); CheckXLogRemoved(segno, ThisTimeLineID); but they don't fit into the new, generic implementation, so I copied these lines to the two places right after the call of the new XLogRead(). However I was not sure if ThisTimeLineID was ever correct here. It seems the original walsender.c:XLogRead() implementation did not update ThisTimeLineID (and therefore neither the new callback WalSndSegmentOpen() does), so both logical_read_xlog_page() and XLogSendPhysical() could read the data from another (historic) timeline. I think we should check the segment we really read data from: CheckXLogRemoved(segno, sendSeg->ws_tli); The rebased code is attached. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On 2019-Sep-26, Antonin Houska wrote: > One comment on the remaining part of the series: > > Before this refactoring, the walsender.c:XLogRead() function contained these > lines > > /* > * After reading into the buffer, check that what we read was valid. We do > * this after reading, because even though the segment was present when we > * opened it, it might get recycled or removed while we read it. The > * read() succeeds in that case, but the data we tried to read might > * already have been overwritten with new WAL records. > */ > XLByteToSeg(startptr, segno, segcxt->ws_segsize); > CheckXLogRemoved(segno, ThisTimeLineID); > > but they don't fit into the new, generic implementation, so I copied these > lines to the two places right after the call of the new XLogRead(). However I > was not sure if ThisTimeLineID was ever correct here. It seems the original > walsender.c:XLogRead() implementation did not update ThisTimeLineID (and > therefore neither the new callback WalSndSegmentOpen() does), so both > logical_read_xlog_page() and XLogSendPhysical() could read the data from > another (historic) timeline. I think we should check the segment we really > read data from: > > CheckXLogRemoved(segno, sendSeg->ws_tli); Hmm, okay. I hope we can get rid of ThisTimeLineID one day. You placed the errinfo in XLogRead's stack rather than its callers' ... I don't think that works, because as soon as XLogRead returns that memory is no longer guaranteed to exist. You need to allocate the struct in the callers stacks and pass its address to XLogRead. XLogRead can return NULL if everything's okay or the pointer to the errinfo struct. I've been wondering if it's really necessary to pass 'seg' to the openSegment() callback. Only walsender wants that, and it seems ... weird. Maybe that's not something for this patch series to fix, but it would be good to find a more decent way to do the TLI switch at some point. > + /* > + * If the function is called by the XLOG reader, the reader will > + * eventually set both "ws_segno" and "ws_off", however the XLOG > + * reader is not necessarily involved. Furthermore, we need to set > + * the current values for this function to work. > + */ > + seg->ws_segno = nextSegNo; > + seg->ws_off = 0; Why do we leave this responsibility to ReadPageInternal? Wouldn't it make more sense to leave XLogRead be always responsible for setting these correctly, and remove those lines from ReadPageInternal? (BTW "is called by the XLOG reader" is a bit strange in code that appears in xlogreader.c). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Sep-26, Antonin Houska wrote: > You placed the errinfo in XLogRead's stack rather than its callers' ... > I don't think that works, because as soon as XLogRead returns that > memory is no longer guaranteed to exist. I was aware of this problem, therefore I defined the field as static: +XLogReadError * +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p, + WALOpenSegment *seg, WALSegmentContext *segcxt, + WALSegmentOpen openSegment) +{ + char *p; + XLogRecPtr recptr; + Size nbytes; + static XLogReadError errinfo; > You need to allocate the struct in the callers stacks and pass its address > to XLogRead. XLogRead can return NULL if everything's okay or the pointer > to the errinfo struct. I didn't choose this approach because that would add one more argument to the function. > I've been wondering if it's really necessary to pass 'seg' to the > openSegment() callback. Only walsender wants that, and it seems ... > weird. Maybe that's not something for this patch series to fix, but it > would be good to find a more decent way to do the TLI switch at some > point. Good point. Since walsender.c already has the "sendSeg" global variable, maybe we can let WalSndSegmentOpen() use this one, and remove the "seg" argument from the callback. > > + /* > > + * If the function is called by the XLOG reader, the reader will > > + * eventually set both "ws_segno" and "ws_off", however the XLOG > > + * reader is not necessarily involved. Furthermore, we need to set > > + * the current values for this function to work. > > + */ > > + seg->ws_segno = nextSegNo; > > + seg->ws_off = 0; > > Why do we leave this responsibility to ReadPageInternal? Wouldn't it > make more sense to leave XLogRead be always responsible for setting > these correctly, and remove those lines from ReadPageInternal? I think there's no rule that ReadPageInternal() must use XLogRead(). If we do what you suggest, we need make this responsibility documented. I'll consider that. > (BTW "is called by the XLOG reader" is a bit strange in code that appears in > xlogreader.c). ok, "called by XLogPageReadCB callback" would be more accurate. Not sure if we'll eventually need this phrase in the comment at all. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2019-Sep-27, Antonin Houska wrote: > > You placed the errinfo in XLogRead's stack rather than its callers' ... > > I don't think that works, because as soon as XLogRead returns that > > memory is no longer guaranteed to exist. > > I was aware of this problem, therefore I defined the field as static: > > +XLogReadError * > +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p, > + WALOpenSegment *seg, WALSegmentContext *segcxt, > + WALSegmentOpen openSegment) > +{ > + char *p; > + XLogRecPtr recptr; > + Size nbytes; > + static XLogReadError errinfo; I see. > > You need to allocate the struct in the callers stacks and pass its address > > to XLogRead. XLogRead can return NULL if everything's okay or the pointer > > to the errinfo struct. > > I didn't choose this approach because that would add one more argument to the > function. Yeah, the signature does seem a bit unwieldy. But I wonder if that's too terrible a problem, considering that this code is incurring a bunch of syscalls in the best case anyway. BTW that tli_p business to the openSegment callback is horribly inconsistent. Some callers accept a NULL tli_p, others will outright crash, even though the API docs say that the callback must determine the timeline. This is made more complicated by us having the TLI in "seg" also. Unless I misread, the problem is again that the walsender code is doing nasty stuff with globals (endSegNo). As a very minor stylistic point, we prefer to have out params at the end of the signature. > > Why do we leave this responsibility to ReadPageInternal? Wouldn't it > > make more sense to leave XLogRead be always responsible for setting > > these correctly, and remove those lines from ReadPageInternal? > > I think there's no rule that ReadPageInternal() must use XLogRead(). If we do > what you suggest, we need make this responsibility documented. I'll consider > that. Hmm. Thanks. > > (BTW "is called by the XLOG reader" is a bit strange in code that appears in > > xlogreader.c). > > ok, "called by XLogPageReadCB callback" would be more accurate. Not sure if > we'll eventually need this phrase in the comment at all. I think that would be slightly clearer. But if we can force this code into actually making sense, that would be much better. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Sep-27, Antonin Houska wrote: >>> You placed the errinfo in XLogRead's stack rather than its callers' ... >>> I don't think that works, because as soon as XLogRead returns that >>> memory is no longer guaranteed to exist. >> I was aware of this problem, therefore I defined the field as static: >> >> +XLogReadError * >> +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p, >> + WALOpenSegment *seg, WALSegmentContext *segcxt, >> + WALSegmentOpen openSegment) >> +{ >> + char *p; >> + XLogRecPtr recptr; >> + Size nbytes; >> + static XLogReadError errinfo; > I see. That seems like an absolutely terrible "fix". We don't really want XLogRead to be defined in a way that forces it to be non-reentrant do we? regards, tom lane
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > BTW that tli_p business to the openSegment callback is horribly > inconsistent. Some callers accept a NULL tli_p, others will outright > crash, even though the API docs say that the callback must determine the > timeline. This is made more complicated by us having the TLI in "seg" > also. Unless I misread, the problem is again that the walsender code is > doing nasty stuff with globals (endSegNo). As a very minor stylistic > point, we prefer to have out params at the end of the signature. XLogRead() tests for NULL so it should not crash but I don't insist on doing it this way. XLogRead() actually does not have to care whether the "open segment callback" determines the TLI or not, so it (XLogRead) can always receive a valid pointer to seg.ws_tli. However that in turn implies that XLogRead() does not need the "tli" argument at all. > > > Why do we leave this responsibility to ReadPageInternal? Wouldn't it > > > make more sense to leave XLogRead be always responsible for setting > > > these correctly, and remove those lines from ReadPageInternal? > > > > I think there's no rule that ReadPageInternal() must use XLogRead(). If we do > > what you suggest, we need make this responsibility documented. I'll consider > > that. I think now we should not add any responsibility to XLogPageReadCB or its subroutines because some extensions might already have their implementation of XLogPageReadCB w/o XLogRead, and this change would break them. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > On 2019-Sep-27, Antonin Houska wrote: > >>> You placed the errinfo in XLogRead's stack rather than its callers' ... > >>> I don't think that works, because as soon as XLogRead returns that > >>> memory is no longer guaranteed to exist. > > >> I was aware of this problem, therefore I defined the field as static: > >> > >> +XLogReadError * > >> +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p, > >> + WALOpenSegment *seg, WALSegmentContext *segcxt, > >> + WALSegmentOpen openSegment) > >> +{ > >> + char *p; > >> + XLogRecPtr recptr; > >> + Size nbytes; > >> + static XLogReadError errinfo; > > > I see. > > That seems like an absolutely terrible "fix". We don't really want > XLogRead to be defined in a way that forces it to be non-reentrant do we? Good point. I forgot that the XLOG reader can be used by frontends, so thread safety is important here. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Antonin Houska <ah@cybertec.at> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > BTW that tli_p business to the openSegment callback is horribly > > inconsistent. Some callers accept a NULL tli_p, others will outright > > crash, even though the API docs say that the callback must determine the > > timeline. This is made more complicated by us having the TLI in "seg" > > also. Unless I misread, the problem is again that the walsender code is > > doing nasty stuff with globals (endSegNo). As a very minor stylistic > > point, we prefer to have out params at the end of the signature. > > XLogRead() tests for NULL so it should not crash but I don't insist on doing > it this way. XLogRead() actually does not have to care whether the "open > segment callback" determines the TLI or not, so it (XLogRead) can always > receive a valid pointer to seg.ws_tli. This is actually wrong - seg.ws_tli is not always the correct value to pass. If seg.ws_tli refers to the segment from which data was read last time, then XLogRead() still needs a separate argument to specify from which TLI the current call should read. If these two differ, new file needs to be opened. The problem of walsender.c is that its implementation of XLogRead() does not care about the TLI of the previous read. If the behavior of the new, generic implementation should be exactly the same, we need to tell XLogRead() that in some cases it also should not compare the current TLI to the previous one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID earlier. Another approach is to add a boolean argument "check_tli", but that still forces caller to pass some (random) value of the tli. The concept of InvalidTimeLineID seems to me less disturbing than this. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Hello, At Sat, 28 Sep 2019 15:00:35 +0200, Antonin Houska <ah@cybertec.at> wrote in <9236.1569675635@antos> > Antonin Houska <ah@cybertec.at> wrote: > > > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > > BTW that tli_p business to the openSegment callback is horribly > > > inconsistent. Some callers accept a NULL tli_p, others will outright > > > crash, even though the API docs say that the callback must determine the > > > timeline. This is made more complicated by us having the TLI in "seg" > > > also. Unless I misread, the problem is again that the walsender code is > > > doing nasty stuff with globals (endSegNo). As a very minor stylistic > > > point, we prefer to have out params at the end of the signature. > > > > XLogRead() tests for NULL so it should not crash but I don't insist on doing > > it this way. XLogRead() actually does not have to care whether the "open > > segment callback" determines the TLI or not, so it (XLogRead) can always > > receive a valid pointer to seg.ws_tli. > > This is actually wrong - seg.ws_tli is not always the correct value to > pass. If seg.ws_tli refers to the segment from which data was read last time, > then XLogRead() still needs a separate argument to specify from which TLI the > current call should read. If these two differ, new file needs to be opened. openSegment represents the file *currently* opened. XLogRead needs the TLI *to be* opened. If they are different, as far as wal logical wal sender and pg_waldump is concerned, XLogRead switches to the new TLI and the new TLI is set to openSegment.ws_tli. So, it seems to me that the parameter doesn't need to be inout? It is enough that it is an "in" parameter. > The problem of walsender.c is that its implementation of XLogRead() does not > care about the TLI of the previous read. If the behavior of the new, generic > implementation should be exactly the same, we need to tell XLogRead() that in > some cases it also should not compare the current TLI to the previous > one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID > earlier. Physical wal sender doesn't switch TLI. So I don't think the behavior doesn't harm (or doesn't fire). openSegment holds the TLI set at the first call. (Even if future wal sender switches TLI, the behavior should be needed.) > Another approach is to add a boolean argument "check_tli", but that still > forces caller to pass some (random) value of the tli. The concept of > InvalidTimeLineID seems to me less disturbing than this. So I think InvalidTimeLineID is not needed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Sat, 28 Sep 2019 15:00:35 +0200, Antonin Houska <ah@cybertec.at> wrote in <9236.1569675635@antos> > > Antonin Houska <ah@cybertec.at> wrote: > > > > > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > > > > BTW that tli_p business to the openSegment callback is horribly > > > > inconsistent. Some callers accept a NULL tli_p, others will outright > > > > crash, even though the API docs say that the callback must determine the > > > > timeline. This is made more complicated by us having the TLI in "seg" > > > > also. Unless I misread, the problem is again that the walsender code is > > > > doing nasty stuff with globals (endSegNo). As a very minor stylistic > > > > point, we prefer to have out params at the end of the signature. > > > > > > XLogRead() tests for NULL so it should not crash but I don't insist on doing > > > it this way. XLogRead() actually does not have to care whether the "open > > > segment callback" determines the TLI or not, so it (XLogRead) can always > > > receive a valid pointer to seg.ws_tli. > > > > This is actually wrong - seg.ws_tli is not always the correct value to > > pass. If seg.ws_tli refers to the segment from which data was read last time, > > then XLogRead() still needs a separate argument to specify from which TLI the > > current call should read. If these two differ, new file needs to be opened. > > openSegment represents the file *currently* opened. I suppose you mean the "seg" argument. > XLogRead needs the TLI *to be* opened. If they are different, as far as wal > logical wal sender and pg_waldump is concerned, XLogRead switches to the new > TLI and the new TLI is set to openSegment.ws_tli. Yes, it works in these cases. > So, it seems to me that the parameter doesn't need to be inout? It is enough > that it is an "in" parameter. I did consider "TimeLineID *tli_p" to be "in" parameter in the last patch version. The reason I used pointer was the special meaning of the NULL value: if NULL is passed, then the timeline should be ignored (because of the other cases, see below). > > The problem of walsender.c is that its implementation of XLogRead() does not > > care about the TLI of the previous read. If the behavior of the new, generic > > implementation should be exactly the same, we need to tell XLogRead() that in > > some cases it also should not compare the current TLI to the previous > > one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID > > earlier. > > Physical wal sender doesn't switch TLI. So I don't think the > behavior doesn't harm (or doesn't fire). openSegment holds the > TLI set at the first call. (Even if future wal sender switches > TLI, the behavior should be needed.) Note that walsender.c:XLogRead() has no TLI argument, however the XLogRead() introduced by the patch does have one. What should be passed for TLI to the new implementation if it's called from walsender.c? If the check for a segment change looks like this (here "tli" is the argument representing the desired TLI) if (seg->ws_file < 0 || !XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) || tli != seg->ws_tli) { XLogSegNo nextSegNo; /* Switch to another logfile segment */ if (seg->ws_file >= 0) close(seg->ws_file); then any valid TLI can result in accidental closing of the current segment file. Since this is only refactoring patch, we should not allow such a change of behavior even if it seems that the same segment will be reopened immediately. -- Antonin Houska Web: https://www.cybertec-postgresql.com
At Tue, 01 Oct 2019 08:28:03 +0200, Antonin Houska <ah@cybertec.at> wrote in <2188.1569911283@antos> > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > > XLogRead() tests for NULL so it should not crash but I don't insist on doing > > > > it this way. XLogRead() actually does not have to care whether the "open > > > > segment callback" determines the TLI or not, so it (XLogRead) can always > > > > receive a valid pointer to seg.ws_tli. > > > > > > This is actually wrong - seg.ws_tli is not always the correct value to > > > pass. If seg.ws_tli refers to the segment from which data was read last time, > > > then XLogRead() still needs a separate argument to specify from which TLI the > > > current call should read. If these two differ, new file needs to be opened. > > > > openSegment represents the file *currently* opened. > > I suppose you mean the "seg" argument. > > > XLogRead needs the TLI *to be* opened. If they are different, as far as wal > > logical wal sender and pg_waldump is concerned, XLogRead switches to the new > > TLI and the new TLI is set to openSegment.ws_tli. > > Yes, it works in these cases. > > > So, it seems to me that the parameter doesn't need to be inout? It is enough > > that it is an "in" parameter. > > I did consider "TimeLineID *tli_p" to be "in" parameter in the last patch > version. The reason I used pointer was the special meaning of the NULL value: > if NULL is passed, then the timeline should be ignored (because of the other > cases, see below). Understood. > > > The problem of walsender.c is that its implementation of XLogRead() does not > > > care about the TLI of the previous read. If the behavior of the new, generic > > > implementation should be exactly the same, we need to tell XLogRead() that in > > > some cases it also should not compare the current TLI to the previous > > > one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID > > > earlier. > > > > Physical wal sender doesn't switch TLI. So I don't think the > > behavior doesn't harm (or doesn't fire). openSegment holds the > > TLI set at the first call. (Even if future wal sender switches > > TLI, the behavior should be needed.) > > Note that walsender.c:XLogRead() has no TLI argument, however the XLogRead() > introduced by the patch does have one. What should be passed for TLI to the > new implementation if it's called from walsender.c? If the check for a segment > change looks like this (here "tli" is the argument representing the desired > TLI) TLI is mandatory to generate a wal file name so it must be passed to the function anyways. In the current code it is sendTimeLine for the walsender.c:XLogRead(). logical_read_xlog_page sets the variable very time immediately before calling XLogRead(). CreateReplicationSlot and StartReplication set the variable to desired TLI immediately before calling and once it is set by StartReplication, it is not changed by XLogSendPhysical and wal sender ends at the end of the current timeline. In the XLogRead, the value is copied to sendSeg->ws_tli when the file for the new timeline is read. > if (seg->ws_file < 0 || > !XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) || > tli != seg->ws_tli) > { > XLogSegNo nextSegNo; > > /* Switch to another logfile segment */ > if (seg->ws_file >= 0) > close(seg->ws_file); > > then any valid TLI can result in accidental closing of the current segment > file. Since this is only refactoring patch, we should not allow such a change > of behavior even if it seems that the same segment will be reopened > immediately. Mmm. ws_file must be -1 in the case? tli != seg->ws_tli is true but seg->ws_file < 0 is also always true at the time. In other words, the "tli != seg->ws_tli" is not even evaluated. If wal sender had an open file (ws_file >= 0) and the new TLI is different from ws_tli, it would be the sign of a serious bug. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > At Tue, 01 Oct 2019 08:28:03 +0200, Antonin Houska <ah@cybertec.at> wrote in <2188.1569911283@antos> > > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > > The problem of walsender.c is that its implementation of XLogRead() does not > > > > care about the TLI of the previous read. If the behavior of the new, generic > > > > implementation should be exactly the same, we need to tell XLogRead() that in > > > > some cases it also should not compare the current TLI to the previous > > > > one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID > > > > earlier. > > > > > > Physical wal sender doesn't switch TLI. So I don't think the > > > behavior doesn't harm (or doesn't fire). openSegment holds the > > > TLI set at the first call. (Even if future wal sender switches > > > TLI, the behavior should be needed.) > > > > Note that walsender.c:XLogRead() has no TLI argument, however the XLogRead() > > introduced by the patch does have one. What should be passed for TLI to the > > new implementation if it's called from walsender.c? I f the check for a segment > > change looks like this (here "tli" is the argument representing the desired > > TLI) > > TLI is mandatory to generate a wal file name so it must be passed > to the function anyways. In the current code it is sendTimeLine > for the walsender.c:XLogRead(). logical_read_xlog_page sets the > variable very time immediately before calling > XLogRead(). CreateReplicationSlot and StartReplication set the > variable to desired TLI immediately before calling and once it is > set by StartReplication, it is not changed by XLogSendPhysical > and wal sender ends at the end of the current timeline. In the > XLogRead, the value is copied to sendSeg->ws_tli when the file > for the new timeline is read. Are you saying that we should pass sendTimeLine to XLogRead()? I think it's not always correct because sendSeg->ws_tli is sometimes assigned sendTimeLineNextTLI, so the test "tli != seg->ws_tli" in > > if (seg->ws_file < 0 || > > !XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) || > > tli != seg->ws_tli) > > { > > XLogSegNo nextSegNo; could pass occasionally. > Mmm. ws_file must be -1 in the case? tli != seg->ws_tli is true > but seg->ws_file < 0 is also always true at the time. In other > words, the "tli != seg->ws_tli" is not even evaluated. > > If wal sender had an open file (ws_file >= 0) and the new TLI is > different from ws_tli, it would be the sign of a serious bug. So we can probably pass ws_tli as the "new TLI" when calling the new XLogRead() from walsender.c. Is that what you try to say? I need to think about it more but it sounds like a good idea. -- Antonin Houska Web: https://www.cybertec-postgresql.com
This is the next version. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote: > This is the next version. So... These are the two last bits to look at, reducing a bit the code size: 5 files changed, 396 insertions(+), 419 deletions(-) And what this patch set does is to refactor the routines we use now in xlogreader.c to read a page by having new callbacks to open a segment, as that's basically the only difference between the context of a WAL sender, pg_waldump and recovery. Here are some comments reading through the code. + * Note that XLogRead(), if used, should have updated the "seg" too for + * its own reasons, however we cannot rely on ->read_page() to call + * XLogRead(). Why? Your patch removes all the three optional lseek() calls which can happen in a segment. Am I missing something but isn't that plain wrong? You could reuse the error context for that as well if an error happens as what's needed is basically the segment name and the LSN offset. All the callers of XLogReadProcessError() are in src/backend/, so it seems to me that there is no point to keep that in xlogreader.c but it should be instead in xlogutils.c, no? It seems to me that this is more like XLogGenerateError, or just XLogError(). We have been using xlog as an acronym in many places of the code, so switching now to wal just for the past matter of the pg_xlog -> pg_wal switch does not seem worth bothering. +read_local_xlog_page_segment_open(XLogSegNo nextSegNo, + WALSegmentContext *segcxt, Could you think about a more simple name here? It is a callback to open a new segment, so it seems to me that we could call it just open_segment_callback(). There is also no point in using a pointer to the TLI, no? + * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf', + * starting at location 'startptr'. 'seg' is the last segment used, + * 'openSegment' is a callback to opens the next segment if needed and + * 'segcxt' is additional segment info that does not fit into 'seg'. A typo and the last part of the last sentence could be better worded. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote: > > This is the next version. > > So... These are the two last bits to look at, reducing a bit the code > size: > 5 files changed, 396 insertions(+), 419 deletions(-) > > And what this patch set does is to refactor the routines we use now in > xlogreader.c to read a page by having new callbacks to open a segment, > as that's basically the only difference between the context of a WAL > sender, pg_waldump and recovery. > > Here are some comments reading through the code. > > + * Note that XLogRead(), if used, should have updated the "seg" too for > + * its own reasons, however we cannot rely on ->read_page() to call > + * XLogRead(). > Why? I've updated the comment: + /* + * Update read state information. + * + * If XLogRead() is was called by ->read_page, it should have updated the + * ->seg fields accordingly (since we never request more than a single + * page, neither ws_segno nor ws_off should have advanced beyond + * targetSegNo and targetPageOff respectively). However it's not mandatory + * for ->read_page to call XLogRead(). + */ Besides what I say here, I'm not sure if we should impose additional requirement on the existing callbacks (possibly those in extensions) to update the XLogReaderState.seg structure. > Your patch removes all the three optional lseek() calls which can > happen in a segment. Am I missing something but isn't that plain > wrong? You could reuse the error context for that as well if an error > happens as what's needed is basically the segment name and the LSN > offset. Explicit call of lseek() is not used because XLogRead() uses pg_pread() now. Nevertheless I found out that in the the last version of the patch I set ws_off to 0 for a newly opened segment. This was wrong, fixed now. > All the callers of XLogReadProcessError() are in src/backend/, so it > seems to me that there is no point to keep that in xlogreader.c but it > should be instead in xlogutils.c, no? It seems to me that this is > more like XLogGenerateError, or just XLogError(). We have been using > xlog as an acronym in many places of the code, so switching now to wal > just for the past matter of the pg_xlog -> pg_wal switch does not seem > worth bothering. ok, moved to xlogutils.c and renamed to XLogReadRaiseError(). I think the "Read" word should be there because many other error can happen during XLOG processing. > +read_local_xlog_page_segment_open(XLogSegNo nextSegNo, > + WALSegmentContext *segcxt, > Could you think about a more simple name here? It is a callback to > open a new segment, so it seems to me that we could call it just > open_segment_callback(). ok, the function is not exported to other modules, so there's no need to care about uniqueness of the name. I chose wal_segment_open(), according to the callback type WALSegmentOpen. > There is also no point in using a pointer to the TLI, no? This particular callback makes no decision about the TLI, so it only uses tli_p as an input argument. > + * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf', > + * starting at location 'startptr'. 'seg' is the last segment used, > + * 'openSegment' is a callback to opens the next segment if needed and > + * 'segcxt' is additional segment info that does not fit into 'seg'. > A typo and the last part of the last sentence could be better worded. ok, adjusted a bit. Thanks for review. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Mon, Nov 11, 2019 at 04:25:56PM +0100, Antonin Houska wrote: >> On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote: > + /* > + * Update read state information. > + * > + * If XLogRead() is was called by ->read_page, it should have updated the > + * ->seg fields accordingly (since we never request more than a single > + * page, neither ws_segno nor ws_off should have advanced beyond > + * targetSegNo and targetPageOff respectively). However it's not mandatory > + * for ->read_page to call XLogRead(). > + */ > > Besides what I say here, I'm not sure if we should impose additional > requirement on the existing callbacks (possibly those in extensions) to update > the XLogReaderState.seg structure. "is was called" does not make sense in this sentence. Actually, I would tend to just remove it completely. >> Your patch removes all the three optional lseek() calls which can >> happen in a segment. Am I missing something but isn't that plain >> wrong? You could reuse the error context for that as well if an error >> happens as what's needed is basically the segment name and the LSN >> offset. > > Explicit call of lseek() is not used because XLogRead() uses pg_pread() > now. Nevertheless I found out that in the the last version of the patch I set > ws_off to 0 for a newly opened segment. This was wrong, fixed now. Missed that part, thanks. This was actually not obvious after an initial lookup of the patch. Wouldn't it make sense to split that part in a separate patch that we could review and get committed first then? It would have the advantage to make the rest easier to review and follow. And using pread is actually better for performance compared to read+lseek. Now there is also the argument that we don't always seek into an opened WAL segment, and that a plain read() is actually better than pread() in some cases. > ok, moved to xlogutils.c and renamed to XLogReadRaiseError(). I think the > "Read" word should be there because many other error can happen during XLOG > processing. No issue with this name. > ok, the function is not exported to other modules, so there's no need to care > about uniqueness of the name. I chose wal_segment_open(), according to the > callback type WALSegmentOpen. Name is fine by me. >> There is also no point in using a pointer to the TLI, no? > > This particular callback makes no decision about the TLI, so it only uses > tli_p as an input argument. Missed that walsender.c can enforce the tli to a new value, objection withdrawn. + * BasicOpenFile() is the preferred way to open the segment file in backend + * code, whereas open(2) should be used in frontend. I would remove that sentence. +#ifndef FRONTEND +/* + * Backend-specific convenience code to handle read errors encountered by + * XLogRead(). + */ +void +XLogReadRaiseError(XLogReadError *errinfo) No need for the FRONTEND ifndef's here as xlogutils.c is backend-only. +#ifndef FRONTEND +void XLogReadRaiseError(XLogReadError *errinfo); +#endif Same as above, and missing an extern declaration. + fatal_error("could not read from log file %s, offset %u, length %zu: %s", + fname, seg->ws_off, (Size) errinfo.reqbytes, + strerror(errinfo.read_errno)); Let's use this occasion to make those error messages more generic to reduce the pain of translators as the file name lets us know that we have to deal with a WAL segment. Here are some suggestions, taking into account the offset: - If errno is set: "could not read file \"%s\" at offset %u: %m" - For partial read: "could not read file \"%s\" at offset %u: read %d of %zu" -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> wrote: > On Mon, Nov 11, 2019 at 04:25:56PM +0100, Antonin Houska wrote: > >> On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote: > >> Your patch removes all the three optional lseek() calls which can > >> happen in a segment. Am I missing something but isn't that plain > >> wrong? You could reuse the error context for that as well if an error > >> happens as what's needed is basically the segment name and the LSN > >> offset. > > > > Explicit call of lseek() is not used because XLogRead() uses pg_pread() > > now. Nevertheless I found out that in the the last version of the patch I set > > ws_off to 0 for a newly opened segment. This was wrong, fixed now. > > Missed that part, thanks. This was actually not obvious after an > initial lookup of the patch. Wouldn't it make sense to split that > part in a separate patch that we could review and get committed first > then? It would have the advantage to make the rest easier to review > and follow. And using pread is actually better for performance > compared to read+lseek. Now there is also the argument that we don't > always seek into an opened WAL segment, and that a plain read() is > actually better than pread() in some cases. ok, the next version uses explicit lseek(). Maybe the fact that XLOG is mostly read sequentially (i.e. without frequent seeks) is the reason pread() has't been adopted so far. The new version reflects your other suggestions too, except the one about not renaming "XLOG" -> "WAL" (actually you mentioned that earlier in the thread). I recall that when working on the preliminary patch (709d003fbd), Alvaro suggested "WAL" for some structures because these are new. The rule seemed to be that "XLOG..." should be left for the existing symbols, while the new ones should be "WAL...": https://www.postgresql.org/message-id/20190917221521.GA15733%40alvherre.pgsql So I decided to rename the new symbols and to remove the related comment. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On 2019-Nov-12, Antonin Houska wrote: > ok, the next version uses explicit lseek(). Maybe the fact that XLOG is mostly > read sequentially (i.e. without frequent seeks) is the reason pread() has't > been adopted so far. I don't quite understand why you backed off from switching to pread. It seemed a good change to me. Here's a few edits on top of your latest. The new routine WALRead() is not at all the same as the previous XLogRead, so I don't see why we would keep the name. Hence renamed. I see no reason for the openSegment callback to return the FD in an out param instead of straight return value. Changed that way. Having seek/open be a boolean "xlr_seek" seems a bit weird. Changed to an "operation" enum. (Maybe if we go back to pg_pread we can get rid of this.) Accordingly, change WALReadRaiseError and WALDumpReadPage. Change xlr_seg to be a struct rather than pointer to struct. It seems a bit dangerous to me to return a pointer that we don't know is going to be valid at raise-error time. Struct assignment works fine for the purpose. Renamed XLogDumpReadPage to WALDumpReadPage, because what the heck is XLogDump anyway? That doesn't exist. I would only like to switch this back to pg_pread() (from seek/read) and I'd be happy to commit this. What is logical_read_local_xlog_page all about? Seems useless. Let's get rid of it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
BTW ... contrib/test_decoding fails with this patch. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 15, 2019 at 06:41:02PM -0300, Alvaro Herrera wrote: > I don't quite understand why you backed off from switching to pread. It > seemed a good change to me. > > [...] > > Having seek/open be a boolean "xlr_seek" seems a bit weird. Changed to > an "operation" enum. (Maybe if we go back to pg_pread we can get rid of > this.) Accordingly, change WALReadRaiseError and WALDumpReadPage. This has been quickly mentioned on the thread which has introduced pread(): https://www.postgresql.org/message-id/c2f56d0a-cadd-3df1-ae48-b84dc8128c37@redhat.com Now, read() > pread() > read()+lseek(), and we don't actually need to seek into the file for all the cases where we read a WAL page. And on a platform which uses the fallback implementation, this increases the number of lseek() calls. I can see as you say that using it directly in the refactoring can simplify the code. -- Michael
Attachment
On Mon, Nov 18, 2019 at 09:29:03PM +0900, Michael Paquier wrote: > Now, read() > pread() > read()+lseek(), and we don't actually need to > seek into the file for all the cases where we read a WAL page. And on > a platform which uses the fallback implementation, this increases the > number of lseek() calls. I can see as you say that using it directly > in the refactoring can simplify the code. Putting this point aside, here is the error coming from contrib/test_decoding/, and this is independent of Alvaro's changes: +ERROR: invalid magic number 0000 in log segment 000000010000000000000001, offset 6905856 I don't think that this is just xlp_magic messed up, the full page read is full of zeros. But that's just a guess. Looking at the code, I am spotting one inconsistency in the way seg->ws_off is compiled after doing the read on the new version compared to the three others. read() would move the offset of the file, but the code is forgetting to increment it by a amount of readbytes. Isn't that incorrect? A second thing is that wal_segment_open() definition is incorrect in xlogutils.c, generating a warning. The opened fd is the returned result, and not an argument of the routine. I am switching the patch as waiting on author. Antonin, could you look at those problems? -- Michael
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Nov-12, Antonin Houska wrote: > > > ok, the next version uses explicit lseek(). Maybe the fact that XLOG is mostly > > read sequentially (i.e. without frequent seeks) is the reason pread() has't > > been adopted so far. > > I don't quite understand why you backed off from switching to pread. It > seemed a good change to me. I agreed with Michael that it makes comparison of the old and new code more difficult, and I also thought that his arguments about performance might be worthwhile because WAL reading is mostly sequential and does not require many seeks. However things appear to be more complex, see below. > Here's a few edits on top of your latest. > ... I agree with your renamings. > Change xlr_seg to be a struct rather than pointer to struct. It seems a > bit dangerous to me to return a pointer that we don't know is going to > be valid at raise-error time. Struct assignment works fine for the > purpose. ok > I would only like to switch this back to pg_pread() (from seek/read) and > I'd be happy to commit this. I realized that, starting from commit 709d003fbd98b975a4fbcb4c5750fa6efaf9ad87 we use the WALOpenSegment.ws_off field incorrectly in walsender.c:XLogRead(). In that commit we used this field to replace XLogReaderState.readOff: @@ -156,10 +165,9 @@ struct XLogReaderState char *readBuf; uint32 readLen; - /* last read segment, segment offset, TLI for data currently in readBuf */ - XLogSegNo readSegNo; - uint32 readOff; - TimeLineID readPageTLI; + /* last read XLOG position for data currently in readBuf */ + WALSegmentContext segcxt; + WALOpenSegment seg; /* * beginning of prior page read, and its TLI. Doesn't necessarily Thus we cannot use it in XLogRead() to track the current position in the segment file. Although walsender.c:XLogRead() misses this point, it's not broken because walsender.c does not use XLogReaderState at all. So if explicit lseek() should be used, another field should be added to WALOpenSegment. I failed to do so when removing the pg_pread() call from the patch, and that was the reason for the problem reported here: https://www.postgresql.org/message-id/20191117042221.GA16537%40alvherre.pgsql https://www.postgresql.org/message-id/20191120083802.GB47145@paquier.xyz Thus the use of pg_pread() makes the code quite a bit simpler, so I re-introduced it. If you decide that an explicit lseek() should be used yet, just let me know. > What is logical_read_local_xlog_page all about? Seems useless. Let's > get rid of it. It seems so. Should I post a patch for that? -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Wed, Nov 20, 2019 at 03:50:29PM +0100, Antonin Houska wrote: > Thus the use of pg_pread() makes the code quite a bit simpler, so I > re-introduced it. If you decide that an explicit lseek() should be used yet, > just let me know. Skimming through the code, it looks like in a good state. The failures of test_deconding are fixed, and all the changes from Alvaro have been added. + fatal_error("could not read in file %s, offset %u, length %zu: %s", + fname, seg->ws_off, (Size) errinfo.wre_req, + strerror(errinfo.wre_errno)); You should be able to use %m here instead of strerror(). It seems to me that it is always important to not do changes completely blindly either so as this does not become an issue for recovery later on. FWIW, I ran a small set of tests with a WAL segment sizes of 1MB and 1GB (fsync = off, max_wal_size/min_wal_size set very high, 1 billion rows in single-column table followed by a series of updates): - Created a primary and a standby which archive_mode set. - Stopped the standby. - Produced close to 12GB worth of WAL. - Restarted the standby with restore_command and compared the time it takes for recovery to complete all the segments with HEAD and your refactoring: 1GB + HEAD: 7min52s 1GB + patch: 8min10s 1MB + HEAD: 10min17s 1MB + patch: 12min1s And with WAL segments at 1MB, I was seeing quite a slowdown with the patch... Then I have done an extra test with pg_waldump with the segments generated previously with the output redirected to /dev/null. Going through 512 segments takes 15.730s with HEAD (average of 3 runs) and 15.851s with the patch. -- Michael
Attachment
On Thu, Nov 21, 2019 at 05:05:50PM +0900, Michael Paquier wrote: > And with WAL segments at 1MB, I was seeing quite a slowdown with the > patch... Then I have done an extra test with pg_waldump with the > segments generated previously with the output redirected to /dev/null. > Going through 512 segments takes 15.730s with HEAD (average of 3 runs) > and 15.851s with the patch. Here are more tests with pg_waldump and 1MB/1GB segment sizes with records generated from pgbench, (7 runs, eliminated the two highest and two lowest, these are the remaining 3 runs as real time): 1) 1MB segment size, 512 segments: time pg_waldump 000000010000000100000C00 000000010000000100000F00 > /dev/null - HEAD: 0m4.512s, 0m4.446s, 0m4.501s - Patch + system's pg_read: 0m4.495s, 0m4.502s, 0m4.486s - Patch + fallback pg_read: 0m4.505s, 0m4.527s, 0m4.495s 2) 1GB segment size, 3 segments: time pg_waldump 000000010000000200000001 000000010000000200000003 > /dev/null - HEAD: 0m11.802s, 0m11.834s, 0m11.846s - Patch + system's pg_read: 0m11.939s, 0m11.991s, 0m11.966s - Patch + fallback pg_read: 0m12.054s, 0m12.066s, 0m12.159s So there is a tendency for a small slowdown here. Still it is not that much, so I withdraw my concerns. Another thing: +void WALReadRaiseError(WALReadError *errinfo); This is missing an "extern" declaration. Alvaro, you are marked as a committer of this CF entry. Are you planning to look at it again? Sorry for the delay from my side. -- Michael
Attachment
On 2019-Nov-22, Michael Paquier wrote: > Alvaro, you are marked as a committer of this CF entry. Are you > planning to look at it again? Sorry for the delay from my side. Yes :-) hopefully next week. Thanks for reviewing. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 22, 2019 at 12:49:33AM -0300, Alvaro Herrera wrote: > Yes :-) hopefully next week. Thanks for reviewing. Thanks, I am switching the entry as ready for committer then. Please note that the latest patch series have a conflict at the top of walsender.c easy enough to resolve, and that the function declaration in xlogutils.h misses an "extern". I personally find unnecessary the last sentence in the new comment block of xlogreader.h to describe the new callback to open a segment about BasicOpenFile() and open() because one could also use a transient file opened in the backend, but I'll be fine with anything you think is most fit. That's a minor point. Thanks Antonin for doing the refactoring effort. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Nov 21, 2019 at 05:05:50PM +0900, Michael Paquier wrote: > > And with WAL segments at 1MB, I was seeing quite a slowdown with the > > patch... Then I have done an extra test with pg_waldump with the > > segments generated previously with the output redirected to /dev/null. > > Going through 512 segments takes 15.730s with HEAD (average of 3 runs) > > and 15.851s with the patch. > > Here are more tests with pg_waldump and 1MB/1GB segment sizes with > records generated from pgbench, (7 runs, eliminated the two highest > and two lowest, these are the remaining 3 runs as real time): > 1) 1MB segment size, 512 segments: > time pg_waldump 000000010000000100000C00 000000010000000100000F00 > /dev/null > - HEAD: 0m4.512s, 0m4.446s, 0m4.501s > - Patch + system's pg_read: 0m4.495s, 0m4.502s, 0m4.486s > - Patch + fallback pg_read: 0m4.505s, 0m4.527s, 0m4.495s > 2) 1GB segment size, 3 segments: > time pg_waldump 000000010000000200000001 000000010000000200000003 > /dev/null > - HEAD: 0m11.802s, 0m11.834s, 0m11.846s > - Patch + system's pg_read: 0m11.939s, 0m11.991s, 0m11.966s > - Patch + fallback pg_read: 0m12.054s, 0m12.066s, 0m12.159s > So there is a tendency for a small slowdown here. Still it is not > that much, so I withdraw my concerns. Thanks for the testing! I thought that in [1] you try discourage me from using pg_pread(), but now it seems to be the opposite. Ideally I'd like to see no overhead added by my patch at all, but the code simplicity should matter too. As a clue, we can perhaps consider the fact that commit c24dcd0c removed explicit lseek() also from XLogWrite(), but I'm not sure how much we can compare XLOG writing and reading (I'd expect writing to be a bit less sequential than reading because XLogWrite() may need to write the last page more than once.) Let's wait for Alvaro's judgement. > Another thing: > +void WALReadRaiseError(WALReadError *errinfo); > This is missing an "extern" declaration. I'll fix this as well as the other problem reported in [1] as soon as I know whether pg_pread() should be used or not. [1] https://www.postgresql.org/message-id/20191121080550.GG153437%40paquier.xyz -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2019-Nov-22, Antonin Houska wrote: > I thought that in [1] you try discourage me from using pg_pread(), but now it > seems to be the opposite. Ideally I'd like to see no overhead added by my > patch at all, but the code simplicity should matter too. FWIW I think the new code is buggy because it doesn't seem to be setting ws_off, so I suppose the optimization in ReadPageInternal to skip reading the page when it's already the page we have is not hit, except for the first page in the segment. I didn't verify this, just my impression while reading the code. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 22, 2019 at 10:35:51AM -0300, Alvaro Herrera wrote: > FWIW I think the new code is buggy because it doesn't seem to be setting > ws_off, so I suppose the optimization in ReadPageInternal to skip > reading the page when it's already the page we have is not hit, except > for the first page in the segment. I didn't verify this, just my > impression while reading the code. FWIW, this matches with my impression here, third paragraph: https://www.postgresql.org/message-id/20191120083802.GB47145@paquier.xyz -- Michael
Attachment
On 2019-Nov-22, Michael Paquier wrote: > On Fri, Nov 22, 2019 at 10:35:51AM -0300, Alvaro Herrera wrote: > > FWIW I think the new code is buggy because it doesn't seem to be setting > > ws_off, so I suppose the optimization in ReadPageInternal to skip > > reading the page when it's already the page we have is not hit, except > > for the first page in the segment. I didn't verify this, just my > > impression while reading the code. > > FWIW, this matches with my impression here, third paragraph: > https://www.postgresql.org/message-id/20191120083802.GB47145@paquier.xyz Ah, right. I was wondering if we shouldn't do away with the concept of "offset" as such, since the offset there is always forcibly set to the start of a page. Why don't we count page numbers instead? It seems like the interface is confusingly generic (measure in bytes) yet not offer any extra functionality that could not be obtained with a simpler struct repr (measure in pages). But then that's not something that we need to change in this patch. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Nov-22, Michael Paquier wrote: > > > On Fri, Nov 22, 2019 at 10:35:51AM -0300, Alvaro Herrera wrote: > > > FWIW I think the new code is buggy because it doesn't seem to be setting > > > ws_off, so I suppose the optimization in ReadPageInternal to skip > > > reading the page when it's already the page we have is not hit, except > > > for the first page in the segment. I didn't verify this, just my > > > impression while reading the code. > > > > FWIW, this matches with my impression here, third paragraph: > > https://www.postgresql.org/message-id/20191120083802.GB47145@paquier.xyz > > Ah, right. As I pointed out in https://www.postgresql.org/message-id/88183.1574261429%40antos seg.ws_off only replaced readOff in XLogReaderState. So we should only update ws_off where readOff was updated before commit 709d003. This does happen in ReadPageInternal (see HEAD) and I see no reason for the final patch to update ws_off anywhere else. > I was wondering if we shouldn't do away with the concept of "offset" as > such, since the offset there is always forcibly set to the start of a > page. Why don't we count page numbers instead? It seems like the > interface is confusingly generic (measure in bytes) yet not offer any > extra functionality that could not be obtained with a simpler struct > repr (measure in pages). Yes, I agree that page numbers would be sufficient. > But then that's not something that we need to change in this patch. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2019-Nov-22, Antonin Houska wrote: > As I pointed out in > > https://www.postgresql.org/message-id/88183.1574261429%40antos > > seg.ws_off only replaced readOff in XLogReaderState. So we should only update > ws_off where readOff was updated before commit 709d003. This does happen in > ReadPageInternal (see HEAD) and I see no reason for the final patch to update > ws_off anywhere else. Oh you're right. I see no reason to leave ws_off. We can move that to XLogReaderState; I did that here. We also need the offset in WALReadError, though, so I added it there too. Conceptually it seems clearer to me this way. What do you think of the attached? BTW I'm not clear what errors can pread()/pg_pread() report that do not set errno. I think lines 1083/1084 of WALRead are spurious now. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Fri, Nov 22, 2019 at 07:56:32PM -0300, Alvaro Herrera wrote: > I see no reason to leave ws_off. We can move that to XLogReaderState; I > did that here. We also need the offset in WALReadError, though, so I > added it there too. Conceptually it seems clearer to me this way. Yeah, that seems cleaner. > What do you think of the attached? Looks rather fine to me. > BTW I'm not clear what errors can pread()/pg_pread() report that do not > set errno. I think lines 1083/1084 of WALRead are spurious now. Because we have no guarantee that errno will be cleared if you do a partial read where errno is not set, so you may finish by reporting the state of a previous failed read instead of the partially-failed one depending on how WALReadError is treated? In short, I don't see any actual reason why it would be good to remove the reset of errno either before the calls to pread and pwrite(). -- Michael
Attachment
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Nov-22, Antonin Houska wrote: > > > As I pointed out in > > > > https://www.postgresql.org/message-id/88183.1574261429%40antos > > > > seg.ws_off only replaced readOff in XLogReaderState. So we should only update > > ws_off where readOff was updated before commit 709d003. This does happen in > > ReadPageInternal (see HEAD) and I see no reason for the final patch to update > > ws_off anywhere else. > > Oh you're right. > > I see no reason to leave ws_off. We can move that to XLogReaderState; I > did that here. We also need the offset in WALReadError, though, so I > added it there too. Conceptually it seems clearer to me this way. > > What do you think of the attached? It looks good to me. Attached is just a fix of a minor problem in error reporting that Michael pointed out earlier. > BTW I'm not clear what errors can pread()/pg_pread() report that do not > set errno. I think lines 1083/1084 of WALRead are spurious now. All I can say is that the existing calls of pg_pread() do not clear errno, so you may be right. I'd appreciate more background about the "partial read" that Michael mentions here: https://www.postgresql.org/message-id/20191125033048.GG37821%40paquier.xyz -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On 2019-Nov-25, Antonin Houska wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I see no reason to leave ws_off. We can move that to XLogReaderState; I > > did that here. We also need the offset in WALReadError, though, so I > > added it there too. Conceptually it seems clearer to me this way. > > > > What do you think of the attached? > > It looks good to me. Attached is just a fix of a minor problem in error > reporting that Michael pointed out earlier. Excellent, I pushed it with this change included and some other cosmetic changes. Now there's only XLogPageRead() ... > > BTW I'm not clear what errors can pread()/pg_pread() report that do not > > set errno. I think lines 1083/1084 of WALRead are spurious now. > > All I can say is that the existing calls of pg_pread() do not clear errno, so > you may be right. Right ... in this interface, we only report an error if pg_pread() returns negative, which is documented to always set errno. > I'd appreciate more background about the "partial read" that > Michael mentions here: > > https://www.postgresql.org/message-id/20191125033048.GG37821%40paquier.xyz In the current implementation, if pg_pread() does a partial read, we just loop one more time. I considered changing the "if (readbytes <= 0)" with "if (readbytes < segbytes)", but that seemed pointless. However, writing this now makes me think that we should add a CHECK_FOR_INTERRUPTS in this loop. (I also wonder if we shouldn't limit the number of times we retry if pg_pread returns zero (i.e. no error, but no bytes read either). I don't know if this is a real-world consideration.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Nov-25, Antonin Houska wrote: > > > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > I see no reason to leave ws_off. We can move that to XLogReaderState; I > > > did that here. We also need the offset in WALReadError, though, so I > > > added it there too. Conceptually it seems clearer to me this way. > > > > > > What do you think of the attached? > > > > It looks good to me. Attached is just a fix of a minor problem in error > > reporting that Michael pointed out earlier. > > Excellent, I pushed it with this change included and some other cosmetic > changes. Thanks! > Now there's only XLogPageRead() ... Hm, this seems rather specific, not sure it's worth trying to use WALRead() here. Anyway, I notice that it uses pg_read() too. > > I'd appreciate more background about the "partial read" that > > Michael mentions here: > > > > https://www.postgresql.org/message-id/20191125033048.GG37821%40paquier.xyz > > In the current implementation, if pg_pread() does a partial read, we > just loop one more time. > > I considered changing the "if (readbytes <= 0)" with "if (readbytes < > segbytes)", but that seemed pointless. In the pread() documentation I see "Upon reading end-of-file, zero is returned." but that does not tell whether zero can be returned without reaching EOF. However XLogPageRead() handles zero as an error, so WALRead() is consistent with that. > However, writing this now makes me think that we should add a > CHECK_FOR_INTERRUPTS in this loop. (I also wonder if we shouldn't limit > the number of times we retry if pg_pread returns zero (i.e. no error, > but no bytes read either). I don't know if this is a real-world > consideration.) If statement above is correct, then we shouldn't need this. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On 2019-Nov-20, Antonin Houska wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > What is logical_read_local_xlog_page all about? Seems useless. Let's > > get rid of it. > > It seems so. Should I post a patch for that? No need .. it was simple enough. Just pushed it. Thanks -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services