Thread: Attempt to consolidate reading of XLOG page

Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

11 April 2019, 16:05:42

While working on the instance encryption I found it annoying to apply
decyption of XLOG page to three different functions. Attached is a patch that
tries to merge them all into one function, XLogRead(). The existing
implementations differ in the way new segment is opened. So I added a pointer
to callback function as a new argument. This callback handles the specific
ways to determine segment file name and to open the file.

I can split the patch into multiple diffs to make detailed review easier, but
first I'd like to hear if anything is seriously wrong about this
design. Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

reuse_xlogread.patch

Re: Attempt to consolidate reading of XLOG page

From

Kyotaro HORIGUCHI

Date:

12 April 2019, 03:27:11

Hello.

At Thu, 11 Apr 2019 18:05:42 +0200, Antonin Houska <ah@cybertec.at> wrote in <14984.1554998742@spoje.net>
> While working on the instance encryption I found it annoying to apply
> decyption of XLOG page to three different functions. Attached is a patch that
> tries to merge them all into one function, XLogRead(). The existing
> implementations differ in the way new segment is opened. So I added a pointer
> to callback function as a new argument. This callback handles the specific
> ways to determine segment file name and to open the file.
> 
> I can split the patch into multiple diffs to make detailed review easier, but
> first I'd like to hear if anything is seriously wrong about this
> design. Thanks.

This patch changes XLogRead to allow using other than
BasicOpenFile to open a segment, and use XLogReaderState.private
to hold a new struct XLogReadPos for the segment reader. The new
struct is heavily duplicated with XLogReaderState and I'm not
sure the rason why the XLogReadPos is needed.

Anyway, in the first place, such two distinct-but-highly-related
callbacks makes things too complex. Heikki said that the log
reader stuff is better not using callbacks and I agree to that. I
did that once for my own but the code is no longer
applicable. But it seems to be the time to do that.

https://www.postgresql.org/message-id/47215279-228d-f30d-35d1-16af695e53f3@iki.fi

That would seems like follows. That refactoring separates log
reader and page reader.


for(;;)
{
    rc = XLogReadRecord(reader, startptr, errormsg);

    if (rc == XLREAD_SUCCESS)
    {
        /* great, got record */
    }
    if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
    {
        elog(ERROR, "invalid record");
    }
    if (rc == XLREAD_NEED_DATA)
    {
        /*
         * Read a page from disk, and place it into reader->readBuf
         */
        XLogPageRead(reader->readPagePtr, /* page to read */
                     reader->reqLen       /* # of bytes to read */ );
        /*
         * Now that we have read the data that XLogReadRecord()
         * requested, call it again.
         */
        continue;
    }
}

DecodingContextFindStartpoint(ctx)
  do
  {
     read_local_xlog_page(....);
     rc = XLogReadRecord (reader);
  while (rc == XLREAD_NEED_DATA);

I'm going to do that again.


Any other opinions, or thoughts?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

12 April 2019, 04:27:46

On Fri, Apr 12, 2019 at 12:27:11PM +0900, Kyotaro HORIGUCHI wrote:
> This patch changes XLogRead to allow using other than
> BasicOpenFile to open a segment, and use XLogReaderState.private
> to hold a new struct XLogReadPos for the segment reader. The new
> struct is heavily duplicated with XLogReaderState and I'm not
> sure the rason why the XLogReadPos is needed.
> Any other opinions, or thoughts?

The focus is on the stability of v12 for the next couple of months, so
please make sure to register it to the next CF if you want feedback.

Here are some basic thoughts after a very quick lookup.

+/*
+ * Position in XLOG file while reading it.
+ */
+typedef struct XLogReadPos
+{
+   int segFile;                /* segment file descriptor */
+   XLogSegNo   segNo;          /* segment number */
+   uint32 segOff;              /* offset in the segment */
+   TimeLineID tli;     /* timeline ID of the currently open file */
+
+   char    *dir;               /* directory (only needed by
frontends) */
+} XLogReadPos;
Not sure if there is any point to split that with the XLOG reader
status.

+static void fatal_error(const char *fmt,...) pg_attribute_printf(1, 2);
+
+static void
+fatal_error(const char *fmt,...)
In this more confusion accumulates with something which has enough
warts on HEAD when it comes to declare locally equivalents to the
elog() for src/common/.

+/*
+ * This is a front-end counterpart of XLogFileNameP.
+ */
+static char *
+XLogFileNameFE(TimeLineID tli, XLogSegNo segno)
+{
+   char       *result = palloc(MAXFNAMELEN);
+
+   XLogFileName(result, tli, segno, WalSegSz);
+   return result;
+}
We could use a pointer to an allocated area.  Or even better, just a
static variable as this just gets used for error messages to store
temporarily the segment name in a routine part of perhaps
xlogreader.c.
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Haribabu Kommi

Date:

12 April 2019, 08:48:33

On Fri, Apr 12, 2019 at 2:06 AM Antonin Houska <ah@cybertec.at> wrote:

While working on the instance encryption I found it annoying to apply
decyption of XLOG page to three different functions. Attached is a patch that
tries to merge them all into one function, XLogRead(). The existing
implementations differ in the way new segment is opened. So I added a pointer
to callback function as a new argument. This callback handles the specific
ways to determine segment file name and to open the file.

I can split the patch into multiple diffs to make detailed review easier, but
first I'd like to hear if anything is seriously wrong about this
design. Thanks.

I didn't check the code, but it is good to combine all the 3 page read functions

into one instead of spreading the logic.

Regards,

Haribabu Kommi

Fujitsu Australia

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

12 April 2019, 16:46:56

On 2019-Apr-11, Antonin Houska wrote:

> While working on the instance encryption I found it annoying to apply
> decyption of XLOG page to three different functions. Attached is a patch that
> tries to merge them all into one function, XLogRead(). The existing
> implementations differ in the way new segment is opened. So I added a pointer
> to callback function as a new argument. This callback handles the specific
> ways to determine segment file name and to open the file.
> 
> I can split the patch into multiple diffs to make detailed review easier, but
> first I'd like to hear if anything is seriously wrong about this
> design. Thanks.

I agree that xlog reading is pretty messy.

I think ifdef'ing the way XLogRead reports errors is not great.  Maybe
we can pass a function pointer that is to be called in case of errors?
Not sure about the walsize; maybe it can be a member in XLogReadPos, and
given to XLogReadInitPos()?  (Maybe rename XLogReadPos as
XLogReadContext or something like that, indicating it's not just the
read position.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

15 April 2019, 08:22:05

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> Hello.
>
> At Thu, 11 Apr 2019 18:05:42 +0200, Antonin Houska <ah@cybertec.at> wrote in <14984.1554998742@spoje.net>
> > While working on the instance encryption I found it annoying to apply
> > decyption of XLOG page to three different functions. Attached is a patch that
> > tries to merge them all into one function, XLogRead(). The existing
> > implementations differ in the way new segment is opened. So I added a pointer
> > to callback function as a new argument. This callback handles the specific
> > ways to determine segment file name and to open the file.
> >
> > I can split the patch into multiple diffs to make detailed review easier, but
> > first I'd like to hear if anything is seriously wrong about this
> > design. Thanks.
>
> This patch changes XLogRead to allow using other than
> BasicOpenFile to open a segment,

Good point. The acceptable ways to open file on both frontend and backend side
need to be documented.

> and use XLogReaderState.private to hold a new struct XLogReadPos for the
> segment reader. The new struct is heavily duplicated with XLogReaderState
> and I'm not sure the rason why the XLogReadPos is needed.

ok, I missed the fact that XLogReaderState already contains most of the info
that I put into XLogReadPos. So XLogReadPos is not needed.

> Anyway, in the first place, such two distinct-but-highly-related
> callbacks makes things too complex. Heikki said that the log
> reader stuff is better not using callbacks and I agree to that. I
> did that once for my own but the code is no longer
> applicable. But it seems to be the time to do that.
>
> https://www.postgresql.org/message-id/47215279-228d-f30d-35d1-16af695e53f3@iki.fi

Thanks for the link. My understanding is that the drawback of the
XLogReaderState.read_page callback is that it cannot easily switch between
XLOG sources in order to handle failure because the caller of XLogReadRecord()
usually controls those sources too.

However the callback I pass to XLogRead() is different: if it fails, it simply
raises ERROR. Since this indicates rather low-level problem, there's no reason
for this callback to try to recover from the failure.

> That would seems like follows. That refactoring separates log
> reader and page reader.
>
>
> for(;;)
> {
>     rc = XLogReadRecord(reader, startptr, errormsg);
>
>     if (rc == XLREAD_SUCCESS)
>     {
>         /* great, got record */
>     }
>     if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
>     {
>         elog(ERROR, "invalid record");
>     }
>     if (rc == XLREAD_NEED_DATA)
>     {
>         /*
>          * Read a page from disk, and place it into reader->readBuf
>          */
>         XLogPageRead(reader->readPagePtr, /* page to read */
>                      reader->reqLen       /* # of bytes to read */ );
>         /*
>          * Now that we have read the data that XLogReadRecord()
>          * requested, call it again.
>          */
>         continue;
>     }
> }
>
> DecodingContextFindStartpoint(ctx)
>   do
>   {
>      read_local_xlog_page(....);
>      rc = XLogReadRecord (reader);
>   while (rc == XLREAD_NEED_DATA);
>
> I'm going to do that again.
>
>
> Any other opinions, or thoughts?

I don't see an overlap between what you do and what I do. It seems that even
if you change the XLOG reader API, you don't care what read_local_xlog_page()
does internally. What I try to fix is XLogRead(), and that is actually a
subroutine of read_local_xlog_page().

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

15 April 2019, 08:48:47

Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Apr 12, 2019 at 12:27:11PM +0900, Kyotaro HORIGUCHI wrote:
> > This patch changes XLogRead to allow using other than
> > BasicOpenFile to open a segment, and use XLogReaderState.private
> > to hold a new struct XLogReadPos for the segment reader. The new
> > struct is heavily duplicated with XLogReaderState and I'm not
> > sure the rason why the XLogReadPos is needed.
> > Any other opinions, or thoughts?
> 
> The focus is on the stability of v12 for the next couple of months, so
> please make sure to register it to the next CF if you want feedback.

ok, will do. (A link to mailing list is needed for the CF entry, so I had to
post something anyway :-) Since I don't introduce any kind of "cool new
feature" here, I believe it did not disturb much.)

> Here are some basic thoughts after a very quick lookup.
> ...

Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

15 April 2019, 09:27:36

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> I agree that xlog reading is pretty messy.
> 
> I think ifdef'ing the way XLogRead reports errors is not great.  Maybe
> we can pass a function pointer that is to be called in case of errors?

I'll try a bit harder to evaluate the existing approaches to report the same
error on both backend and frontend side.

> Not sure about the walsize; maybe it can be a member in XLogReadPos, and
> given to XLogReadInitPos()?  (Maybe rename XLogReadPos as
> XLogReadContext or something like that, indicating it's not just the
> read position.)

As pointed out by others, XLogReadPos is not necessary. So if XLogRead()
receives XLogReaderState instead, it can get the segment size from there.

Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

02 May 2019, 16:17:36

Antonin Houska <ah@cybertec.at> wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> 
> > Not sure about the walsize; maybe it can be a member in XLogReadPos, and
> > given to XLogReadInitPos()?  (Maybe rename XLogReadPos as
> > XLogReadContext or something like that, indicating it's not just the
> > read position.)
> 
> As pointed out by others, XLogReadPos is not necessary. So if XLogRead()
> receives XLogReaderState instead, it can get the segment size from there.

Eventually I found out that it's good to have a separate structure for the
read position because walsender calls the XLogRead() function directly, not
via the XLOG reader. Currently the structure name is XLogSegment (maybe
someone can propose better name) and it's a member of XLogReaderState. No
field of the new structure is duplicated now.

The next version of the patch is attached.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Robert Haas

Date:

06 May 2019, 18:03:13

On Thu, May 2, 2019 at 12:18 PM Antonin Houska <ah@cybertec.at> wrote:
> The next version of the patch is attached.

I don't think any of this looks acceptable:

+#ifndef FRONTEND
+/*
+ * Backend should have wal_segment_size variable initialized, segsize is not
+ * used.
+ */
+#define XLogFileNameCommon(tli, num, segsize) XLogFileNameP((tli), (num))
+#define xlr_error(...) ereport(ERROR, (errcode_for_file_access(),
errmsg(__VA_ARGS__)))
+#else
+static char xlr_error_msg[MAXFNAMELEN];
+#define XLogFileNameCommon(tli, num, segsize)
(XLogFileName(xlr_error_msg, (tli), (num), (segsize)),\
+    xlr_error_msg)
+#include "fe_utils/logging.h"
+/*
+ * Frontend application (currently only pg_waldump.c) cannot catch and further
+ * process errors, so they simply treat them as fatal.
+ */
+#define xlr_error(...) do {pg_log_fatal(__VA_ARGS__);
exit(EXIT_FAILURE); } while(0)
+#endif

The backend part doesn't look OK because depending on the value of a
global variable instead of getting the information via parameters
seems like a step backward.  The frontend part doesn't look OK because
it locks every application that uses the xlogreader stuff into using
pg_log_fatal when an error occurs, which may not be what everybody
wants to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

06 May 2019, 18:21:34

On 2019-May-06, Robert Haas wrote:

> On Thu, May 2, 2019 at 12:18 PM Antonin Houska <ah@cybertec.at> wrote:
> > The next version of the patch is attached.
> 
> I don't think any of this looks acceptable:

I agree.  I inteded to suggest upthread to pass an additional argument
to XLogRead, which is a function that takes a message string and
SQLSTATE; in backend, the function does errstart / errstate / errmsg /
errfinish, and in frontend programs it does pg_log_fatal (and ignores
sqlstate).  The message must be sprintf'ed and translated by XLogRead.
(xlogreader.c could itself provide a default error reporting callback,
at least for frontend, to avoid repeating the code).  That way, if a
different frontend program wants to do something different, it's fairly
easy to pass a different function pointer.

BTW, having frontend's XLogFileNameCommon use a totally unrelated
variable for its printing is naughty.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Robert Haas

Date:

06 May 2019, 19:58:22

On Mon, May 6, 2019 at 2:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2019-May-06, Robert Haas wrote:
> > On Thu, May 2, 2019 at 12:18 PM Antonin Houska <ah@cybertec.at> wrote:
> > > The next version of the patch is attached.
> >
> > I don't think any of this looks acceptable:
>
> I agree.  I inteded to suggest upthread to pass an additional argument
> to XLogRead, which is a function that takes a message string and
> SQLSTATE; in backend, the function does errstart / errstate / errmsg /
> errfinish, and in frontend programs it does pg_log_fatal (and ignores
> sqlstate).  The message must be sprintf'ed and translated by XLogRead.
> (xlogreader.c could itself provide a default error reporting callback,
> at least for frontend, to avoid repeating the code).  That way, if a
> different frontend program wants to do something different, it's fairly
> easy to pass a different function pointer.

It seems to me that it's better to unwind the stack i.e. have the
function return the error information to the caller and let the caller
do as it likes.  The other thread to which Horiguchi-san referred
earlier in this thread seems to me to have basically concluded that
the XLogPageReadCB callback to XLogReaderAllocate is a pain to use
because it doesn't unwind the stack, and work is under way over there
to get rid of that callback for just that reason.  Adding a new
callback for error-reporting would just be creating a new instance of
the same issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

21 May 2019, 09:11:37

Robert Haas <robertmhaas@gmail.com> wrote:
> It seems to me that it's better to unwind the stack i.e. have the
> function return the error information to the caller and let the caller
> do as it likes.

Thanks for a hint. The next version tries to do that.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Thomas Munro

Date:

01 July 2019, 09:54:21

On Tue, May 21, 2019 at 9:12 PM Antonin Houska <ah@cybertec.at> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
> > It seems to me that it's better to unwind the stack i.e. have the
> > function return the error information to the caller and let the caller
> > do as it likes.
>
> Thanks for a hint. The next version tries to do that.

Hi Antonin,

Could you please send a fresh rebase for the new Commitfest?

Thanks,


--
Thomas Munro
https://enterprisedb.com

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

09 July 2019, 10:15:37

Thomas Munro <thomas.munro@gmail.com> wrote:

> On Tue, May 21, 2019 at 9:12 PM Antonin Houska <ah@cybertec.at> wrote:
> > Robert Haas <robertmhaas@gmail.com> wrote:
> > > It seems to me that it's better to unwind the stack i.e. have the
> > > function return the error information to the caller and let the caller
> > > do as it likes.
> >
> > Thanks for a hint. The next version tries to do that.
> 
> Hi Antonin,
> 
> Could you please send a fresh rebase for the new Commitfest?

Rebased.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

03 September 2019, 21:15:32

Hi Antonin, could you please rebase again?

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

03 September 2019, 21:42:33

Pushed 0001.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

09 September 2019, 10:20:30

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> Hi Antonin, could you please rebase again?

Attached.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

17 September 2019, 22:15:21

I was confused by the struct name XLogSegment -- the struct is used to
represent a WAL segment while it's kept open, rather than just a WAL
segment in abstract.  Also, now that we've renamed everything to use the
term WAL, it seems wrong to use the name XLog for new structs.  I
propose the name WALOpenSegment for the struct, which solves both
problems.  (Its initializer function would get the name
WALOpenSegmentInit.)

Now, the patch introduces a callback for XLogRead, the type of which is
called XLogOpenSegment.  If we rename it from XLog to WAL, both names
end up the same.  I propose to rename the function type to
WALSegmentOpen, which in a "noun-verb" view of the world, represents the
action of opening a WAL segment.

I attach a patch for all this renaming, on top of your series.

I wonder if each of those WALSegmentOpen callbacks should reset [at
least some members of] the struct; they're already in charge of setting
->file, and apparently we're leaving the responsibility of setting the
rest of the members to XLogRead.  That seems weird.  Maybe we should say
that the CB should only open the segment and not touch the struct at all
and XLogRead is in charge of everything.  Perhaps the other way around
-- the CB should set everything correctly ... I'm not sure which is
best.  But having half here and half there seems a recipe for confusion
and bugs.


Another thing I didn't like much is that everything seems to assume that
the only error possible from XLogRead is a read error.  Maybe that's
okay, because it seems to be the current reality, but it seemed odd.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

xlogopen-rename.patch

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

23 September 2019, 10:44:31

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> I was confused by the struct name XLogSegment -- the struct is used to
> represent a WAL segment while it's kept open, rather than just a WAL
> segment in abstract.  Also, now that we've renamed everything to use the
> term WAL, it seems wrong to use the name XLog for new structs.  I
> propose the name WALOpenSegment for the struct, which solves both
> problems.  (Its initializer function would get the name
> WALOpenSegmentInit.)
> 
> Now, the patch introduces a callback for XLogRead, the type of which is
> called XLogOpenSegment.  If we rename it from XLog to WAL, both names
> end up the same.  I propose to rename the function type to
> WALSegmentOpen, which in a "noun-verb" view of the world, represents the
> action of opening a WAL segment.
> 
> I attach a patch for all this renaming, on top of your series.

ok, thanks.

In addition I renamed WalSndOpenSegment() to WalSndSegmentOpen() and
read_local_xlog_page_open_segment() to read_local_xlog_page_segment_open()

> I wonder if each of those WALSegmentOpen callbacks should reset [at
> least some members of] the struct; they're already in charge of setting
> ->file, and apparently we're leaving the responsibility of setting the
> rest of the members to XLogRead.  That seems weird.  Maybe we should say
> that the CB should only open the segment and not touch the struct at all
> and XLogRead is in charge of everything.  Perhaps the other way around
> -- the CB should set everything correctly ... I'm not sure which is
> best.  But having half here and half there seems a recipe for confusion
> and bugs.

ok, I've changed the CB signature. Now it receives poiners to the two
variables that it can change while the "seg" argument is documented as
read-only. To indicate that the CB should determine timeline itself, I
introduced a new constant InvalidTimeLineID, see the 0004 part.

> Another thing I didn't like much is that everything seems to assume that
> the only error possible from XLogRead is a read error.  Maybe that's
> okay, because it seems to be the current reality, but it seemed odd.

In this case I only moved the ereport() code from XLogRead() away (so that
both backend and frontend can call the function). Given that the code to open
WAL segment is now in the callbacks, the only thing that XLogRead() can
ereport is that read() failed. BTW, I introduced one more structure,
XLogReadError, in this patch version. I think it's better than adding
error-specific fields to the WALOpenSegment structure.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

23 September 2019, 22:00:10

I spent a couple of hours on this patchset today.  I merged 0001 and
0002, and decided the result was still messier than I would have liked,
so I played with it a bit more -- see attached.  I think this is
committable, but I'm afraid it'll cause quite a few conflicts with the
rest of your series.

I had two gripes, which I feel solved with my changes:

1. I didn't like that "dir" and "wal segment size" were part of the
"currently open segment" supporting struct.  It seemed that those two
were slightly higher-level, since they apply to every segment that's
going to be opened, not just the current one.

My first thought was to put those as members of XLogReaderState, but
that doesn't work because the physical walsender.c code does not use
xlogreader at all, even though it is reading WAL.  Anyway my solution
was to create yet another struct, which for everything that uses
xlogreader is just part of that state struct; and for walsender, it's
just a separate one alongside sendSeg.  All in all, this seems pretty
clean.

2. Having the wal dir be #ifdef FRONTEND seemed out of place.  I know
the backend code does not use that, but eliding it is more "noisy" than
just setting it to NULL.  Also, the "Finalize the segment pointer"
thingy seemed out of place.  So my code passes the dir as an argument to
XLogReaderAllocate, and if it's null then we just don't allocate it.
Everybody else can use it to guide things.  This results in cleaner
code, because we don't have to handle it externally, which was causing
quite some pain to pg_waldump.  

Note that ws_dir member is a char array in the struct, not just a
pointer.  This saves trouble trying to allocate it (I mainly did it this
way because we don't have pstrdup_extended(MCXT_ALLOC_NO_OOM) ... yes,
this could be made with palloc+snprintf, but eh, that doesn't seem worth
the trouble.)


Separately from those two API-wise points, there was one bug which meant
that with your 0002+0003 the recovery tests did not pass -- code
placement bug.  I suppose the bug disappears with later patches in your
series, which probably is why you didn't notice.  This is the fix for that:

-   XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
-   state->seg.tli = pageTLI;
+   state->seg.ws_tli = pageTLI;
+   XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
             XLOG_BLCKSZ);


... Also, yes, I renamed all the struct members.


If you don't have any strong dislikes for these changes, I'll push this
part and let you rebase the remains on top.


Regarding the other patches:

1. I think trying to do palloc(XLogReadError) is a bad idea ... for
example, if the read fails because of system pressure, we might return
"out of memory" during that palloc instead of the real read error.  This
particular problem you could forestall by changing to ErrorContext, but
I have the impression that it might be better to have the error struct
by stack-allocated in the caller stack.  This forces you to limit the
message string to a maximum size (say 128 bytes or maybe even 1000 bytes
like MAX_ERRORMSG_LEN) but I don't have a problem with that.

2. Not a fan of the InvalidTimeLineID stuff offhand.  Maybe it's okay ...
not convinced yet either way.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera from 2ndQuadrant

Date:

23 September 2019, 22:00:38

On 2019-Sep-23, Alvaro Herrera wrote:

> I spent a couple of hours on this patchset today.  I merged 0001 and
> 0002, and decided the result was still messier than I would have liked,
> so I played with it a bit more -- see attached.


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0002-0003.patch

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

24 September 2019, 15:33:24

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> I spent a couple of hours on this patchset today.  I merged 0001 and
> 0002, and decided the result was still messier than I would have liked,
> so I played with it a bit more -- see attached.  I think this is
> committable, but I'm afraid it'll cause quite a few conflicts with the
> rest of your series.
>
> I had two gripes, which I feel solved with my changes:
>
> 1. I didn't like that "dir" and "wal segment size" were part of the
> "currently open segment" supporting struct.  It seemed that those two
> were slightly higher-level, since they apply to every segment that's
> going to be opened, not just the current one.

ok

> My first thought was to put those as members of XLogReaderState, but
> that doesn't work because the physical walsender.c code does not use
> xlogreader at all, even though it is reading WAL.

`I don't remember clearly but I think that this was the reason I tried to move
"wal_segment_size" away from XLogReaderState.


> Separately from those two API-wise points, there was one bug which meant
> that with your 0002+0003 the recovery tests did not pass -- code
> placement bug.  I suppose the bug disappears with later patches in your
> series, which probably is why you didn't notice.  This is the fix for that:
>
> -   XLogRead(cur_page, state->seg.size, state->seg.tli, targetPagePtr,
> -   state->seg.tli = pageTLI;
> +   state->seg.ws_tli = pageTLI;
> +   XLogRead(cur_page, state->segcxt.ws_segsize, state->seg.ws_tli, targetPagePtr,
>              XLOG_BLCKSZ);
>

Yes, it seems so - the following parts ensure that XLogRead() adjusts the
timeline itself. I only checked that the each part of the series keeps the
source tree compilable. Thanks for fixing.

> ... Also, yes, I renamed all the struct members.
>
>
> If you don't have any strong dislikes for these changes, I'll push this
> part and let you rebase the remains on top.

No objections here.

> 2. Not a fan of the InvalidTimeLineID stuff offhand.  Maybe it's okay ...
> not convinced yet either way.

Well, it seems that the individual callbacks only use this constant in
Assert() statements. I'll consider if we really need it. The argument value
should not determine whether the callback derives the TLI or not.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

24 September 2019, 19:50:51

On 2019-Sep-24, Antonin Houska wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > If you don't have any strong dislikes for these changes, I'll push this
> > part and let you rebase the remains on top.
> 
> No objections here.

oK, pushed.  Please rebase the other parts.

I made one small adjustment: in read_local_xlog_page() there was one
*readTLI output parameter that was being changed to a local variable
plus later assigment to the output struct member; I changed the code to
continue to assign directly to the output variable instead.  There was
an error case in which the TLI was not assigned to; I suppose this
doesn't really change things (we don't examine the TLI in that case, do
we?), but it seemed dangerous to leave like that.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

26 September 2019, 12:08:33

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> On 2019-Sep-24, Antonin Houska wrote:
> 
> > Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> 
> > > If you don't have any strong dislikes for these changes, I'll push this
> > > part and let you rebase the remains on top.
> > 
> > No objections here.
> 
> oK, pushed.  Please rebase the other parts.

Thanks!

> I made one small adjustment: in read_local_xlog_page() there was one
> *readTLI output parameter that was being changed to a local variable
> plus later assigment to the output struct member; I changed the code to
> continue to assign directly to the output variable instead.  There was
> an error case in which the TLI was not assigned to; I suppose this
> doesn't really change things (we don't examine the TLI in that case, do
> we?), but it seemed dangerous to leave like that.

I used the local variable to make some expressions simpler, but missed the
fact that this way I can leave the ws_tli field unassigned if the function
returns prematurely. Now that I look closer, I see that it can be a problem -
in the case of ERROR, XLogReadRecord() does reset the state, but it does not
reset the TLI:

err:
    /*
     * Invalidate the read state. We might read from a different source after
     * failure.
     */
    XLogReaderInvalReadState(state);

Thus the TLI appears to be important even on ERROR, and what you've done is
correct. Thanks for fixing that.

One comment on the remaining part of the series:

Before this refactoring, the walsender.c:XLogRead() function contained these
lines

       /*
        * After reading into the buffer, check that what we read was valid. We do
        * this after reading, because even though the segment was present when we
        * opened it, it might get recycled or removed while we read it. The
        * read() succeeds in that case, but the data we tried to read might
        * already have been overwritten with new WAL records.
        */
       XLByteToSeg(startptr, segno, segcxt->ws_segsize);
       CheckXLogRemoved(segno, ThisTimeLineID);

but they don't fit into the new, generic implementation, so I copied these
lines to the two places right after the call of the new XLogRead(). However I
was not sure if ThisTimeLineID was ever correct here. It seems the original
walsender.c:XLogRead() implementation did not update ThisTimeLineID (and
therefore neither the new callback WalSndSegmentOpen() does), so both
logical_read_xlog_page() and XLogSendPhysical() could read the data from
another (historic) timeline. I think we should check the segment we really
read data from:

    CheckXLogRemoved(segno, sendSeg->ws_tli);

The rebased code is attached.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

27 September 2019, 13:22:10

On 2019-Sep-26, Antonin Houska wrote:

> One comment on the remaining part of the series:
> 
> Before this refactoring, the walsender.c:XLogRead() function contained these
> lines
> 
>        /*
>         * After reading into the buffer, check that what we read was valid. We do
>         * this after reading, because even though the segment was present when we
>         * opened it, it might get recycled or removed while we read it. The
>         * read() succeeds in that case, but the data we tried to read might
>         * already have been overwritten with new WAL records.
>         */
>        XLByteToSeg(startptr, segno, segcxt->ws_segsize);
>        CheckXLogRemoved(segno, ThisTimeLineID);
> 
> but they don't fit into the new, generic implementation, so I copied these
> lines to the two places right after the call of the new XLogRead(). However I
> was not sure if ThisTimeLineID was ever correct here. It seems the original
> walsender.c:XLogRead() implementation did not update ThisTimeLineID (and
> therefore neither the new callback WalSndSegmentOpen() does), so both
> logical_read_xlog_page() and XLogSendPhysical() could read the data from
> another (historic) timeline. I think we should check the segment we really
> read data from:
> 
>     CheckXLogRemoved(segno, sendSeg->ws_tli);

Hmm, okay.  I hope we can get rid of ThisTimeLineID one day.

You placed the errinfo in XLogRead's stack rather than its callers' ...
I don't think that works, because as soon as XLogRead returns that
memory is no longer guaranteed to exist.  You need to allocate the
struct in the callers stacks and pass its address to XLogRead.  XLogRead
can return NULL if everything's okay or the pointer to the errinfo
struct.

I've been wondering if it's really necessary to pass 'seg' to the
openSegment() callback.  Only walsender wants that, and it seems ...
weird.  Maybe that's not something for this patch series to fix, but it
would be good to find a more decent way to do the TLI switch at some
point.

> +            /*
> +             * If the function is called by the XLOG reader, the reader will
> +             * eventually set both "ws_segno" and "ws_off", however the XLOG
> +             * reader is not necessarily involved. Furthermore, we need to set
> +             * the current values for this function to work.
> +             */
> +            seg->ws_segno = nextSegNo;
> +            seg->ws_off = 0;

Why do we leave this responsibility to ReadPageInternal?  Wouldn't it
make more sense to leave XLogRead be always responsible for setting
these correctly, and remove those lines from ReadPageInternal?  (BTW "is
called by the XLOG reader" is a bit strange in code that appears in
xlogreader.c).

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

27 September 2019, 17:54:25

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> On 2019-Sep-26, Antonin Houska wrote:

> You placed the errinfo in XLogRead's stack rather than its callers' ...
> I don't think that works, because as soon as XLogRead returns that
> memory is no longer guaranteed to exist.

I was aware of this problem, therefore I defined the field as static:

+XLogReadError *
+XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
+                WALOpenSegment *seg, WALSegmentContext *segcxt,
+                WALSegmentOpen openSegment)
+{
+       char       *p;
+       XLogRecPtr      recptr;
+       Size            nbytes;
+       static XLogReadError errinfo;

> You need to allocate the struct in the callers stacks and pass its address
> to XLogRead.  XLogRead can return NULL if everything's okay or the pointer
> to the errinfo struct.

I didn't choose this approach because that would add one more argument to the
function.

> I've been wondering if it's really necessary to pass 'seg' to the
> openSegment() callback.  Only walsender wants that, and it seems ...
> weird.  Maybe that's not something for this patch series to fix, but it
> would be good to find a more decent way to do the TLI switch at some
> point.

Good point. Since walsender.c already has the "sendSeg" global variable, maybe
we can let WalSndSegmentOpen() use this one, and remove the "seg" argument
from the callback.

> > +            /*
> > +             * If the function is called by the XLOG reader, the reader will
> > +             * eventually set both "ws_segno" and "ws_off", however the XLOG
> > +             * reader is not necessarily involved. Furthermore, we need to set
> > +             * the current values for this function to work.
> > +             */
> > +            seg->ws_segno = nextSegNo;
> > +            seg->ws_off = 0;
> 
> Why do we leave this responsibility to ReadPageInternal?  Wouldn't it
> make more sense to leave XLogRead be always responsible for setting
> these correctly, and remove those lines from ReadPageInternal?

I think there's no rule that ReadPageInternal() must use XLogRead(). If we do
what you suggest, we need make this responsibility documented. I'll consider
that.

> (BTW "is called by the XLOG reader" is a bit strange in code that appears in
> xlogreader.c).

ok, "called by XLogPageReadCB callback" would be more accurate. Not sure if
we'll eventually need this phrase in the comment at all.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

27 September 2019, 19:17:36

On 2019-Sep-27, Antonin Houska wrote:

> > You placed the errinfo in XLogRead's stack rather than its callers' ...
> > I don't think that works, because as soon as XLogRead returns that
> > memory is no longer guaranteed to exist.
> 
> I was aware of this problem, therefore I defined the field as static:
> 
> +XLogReadError *
> +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
> +                WALOpenSegment *seg, WALSegmentContext *segcxt,
> +                WALSegmentOpen openSegment)
> +{
> +       char       *p;
> +       XLogRecPtr      recptr;
> +       Size            nbytes;
> +       static XLogReadError errinfo;

I see.

> > You need to allocate the struct in the callers stacks and pass its address
> > to XLogRead.  XLogRead can return NULL if everything's okay or the pointer
> > to the errinfo struct.
> 
> I didn't choose this approach because that would add one more argument to the
> function.

Yeah, the signature does seem a bit unwieldy.  But I wonder if that's
too terrible a problem, considering that this code is incurring a bunch
of syscalls in the best case anyway.

BTW that tli_p business to the openSegment callback is horribly
inconsistent.  Some callers accept a NULL tli_p, others will outright
crash, even though the API docs say that the callback must determine the
timeline.  This is made more complicated by us having the TLI in "seg"
also.  Unless I misread, the problem is again that the walsender code is
doing nasty stuff with globals (endSegNo).  As a very minor stylistic
point, we prefer to have out params at the end of the signature.

> > Why do we leave this responsibility to ReadPageInternal?  Wouldn't it
> > make more sense to leave XLogRead be always responsible for setting
> > these correctly, and remove those lines from ReadPageInternal?
> 
> I think there's no rule that ReadPageInternal() must use XLogRead(). If we do
> what you suggest, we need make this responsibility documented. I'll consider
> that.

Hmm.  Thanks.

> > (BTW "is called by the XLOG reader" is a bit strange in code that appears in
> > xlogreader.c).
> 
> ok, "called by XLogPageReadCB callback" would be more accurate. Not sure if
> we'll eventually need this phrase in the comment at all.

I think that would be slightly clearer.  But if we can force this code
into actually making sense, that would be much better.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Tom Lane

Date:

27 September 2019, 19:28:38

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2019-Sep-27, Antonin Houska wrote:
>>> You placed the errinfo in XLogRead's stack rather than its callers' ...
>>> I don't think that works, because as soon as XLogRead returns that
>>> memory is no longer guaranteed to exist.

>> I was aware of this problem, therefore I defined the field as static:
>> 
>> +XLogReadError *
>> +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
>> +                WALOpenSegment *seg, WALSegmentContext *segcxt,
>> +                WALSegmentOpen openSegment)
>> +{
>> +       char       *p;
>> +       XLogRecPtr      recptr;
>> +       Size            nbytes;
>> +       static XLogReadError errinfo;

> I see.

That seems like an absolutely terrible "fix".  We don't really want
XLogRead to be defined in a way that forces it to be non-reentrant do we?

            regards, tom lane

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

28 September 2019, 08:14:25

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> BTW that tli_p business to the openSegment callback is horribly
> inconsistent.  Some callers accept a NULL tli_p, others will outright
> crash, even though the API docs say that the callback must determine the
> timeline.  This is made more complicated by us having the TLI in "seg"
> also.  Unless I misread, the problem is again that the walsender code is
> doing nasty stuff with globals (endSegNo).  As a very minor stylistic
> point, we prefer to have out params at the end of the signature.

XLogRead() tests for NULL so it should not crash but I don't insist on doing
it this way. XLogRead() actually does not have to care whether the "open
segment callback" determines the TLI or not, so it (XLogRead) can always
receive a valid pointer to seg.ws_tli. However that in turn implies that
XLogRead() does not need the "tli" argument at all.

> > > Why do we leave this responsibility to ReadPageInternal?  Wouldn't it
> > > make more sense to leave XLogRead be always responsible for setting
> > > these correctly, and remove those lines from ReadPageInternal?
> >
> > I think there's no rule that ReadPageInternal() must use XLogRead(). If we do
> > what you suggest, we need make this responsibility documented. I'll consider
> > that.

I think now we should not add any responsibility to XLogPageReadCB or its
subroutines because some extensions might already have their implementation of
XLogPageReadCB w/o XLogRead, and this change would break them.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

28 September 2019, 08:15:56

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2019-Sep-27, Antonin Houska wrote:
> >>> You placed the errinfo in XLogRead's stack rather than its callers' ...
> >>> I don't think that works, because as soon as XLogRead returns that
> >>> memory is no longer guaranteed to exist.
> 
> >> I was aware of this problem, therefore I defined the field as static:
> >> 
> >> +XLogReadError *
> >> +XLogRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID *tli_p,
> >> +                WALOpenSegment *seg, WALSegmentContext *segcxt,
> >> +                WALSegmentOpen openSegment)
> >> +{
> >> +       char       *p;
> >> +       XLogRecPtr      recptr;
> >> +       Size            nbytes;
> >> +       static XLogReadError errinfo;
> 
> > I see.
> 
> That seems like an absolutely terrible "fix".  We don't really want
> XLogRead to be defined in a way that forces it to be non-reentrant do we?

Good point. I forgot that the XLOG reader can be used by frontends, so thread
safety is important here.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

28 September 2019, 13:00:35

Antonin Houska <ah@cybertec.at> wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> 
> > BTW that tli_p business to the openSegment callback is horribly
> > inconsistent.  Some callers accept a NULL tli_p, others will outright
> > crash, even though the API docs say that the callback must determine the
> > timeline.  This is made more complicated by us having the TLI in "seg"
> > also.  Unless I misread, the problem is again that the walsender code is
> > doing nasty stuff with globals (endSegNo).  As a very minor stylistic
> > point, we prefer to have out params at the end of the signature.
> 
> XLogRead() tests for NULL so it should not crash but I don't insist on doing
> it this way. XLogRead() actually does not have to care whether the "open
> segment callback" determines the TLI or not, so it (XLogRead) can always
> receive a valid pointer to seg.ws_tli.

This is actually wrong - seg.ws_tli is not always the correct value to
pass. If seg.ws_tli refers to the segment from which data was read last time,
then XLogRead() still needs a separate argument to specify from which TLI the
current call should read. If these two differ, new file needs to be opened.

The problem of walsender.c is that its implementation of XLogRead() does not
care about the TLI of the previous read. If the behavior of the new, generic
implementation should be exactly the same, we need to tell XLogRead() that in
some cases it also should not compare the current TLI to the previous
one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
earlier.

Another approach is to add a boolean argument "check_tli", but that still
forces caller to pass some (random) value of the tli. The concept of
InvalidTimeLineID seems to me less disturbing than this.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Kyotaro Horiguchi

Date:

01 October 2019, 03:22:27

Hello,

At Sat, 28 Sep 2019 15:00:35 +0200, Antonin Houska <ah@cybertec.at> wrote in <9236.1569675635@antos>
> Antonin Houska <ah@cybertec.at> wrote:
> 
> > Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > 
> > > BTW that tli_p business to the openSegment callback is horribly
> > > inconsistent.  Some callers accept a NULL tli_p, others will outright
> > > crash, even though the API docs say that the callback must determine the
> > > timeline.  This is made more complicated by us having the TLI in "seg"
> > > also.  Unless I misread, the problem is again that the walsender code is
> > > doing nasty stuff with globals (endSegNo).  As a very minor stylistic
> > > point, we prefer to have out params at the end of the signature.
> > 
> > XLogRead() tests for NULL so it should not crash but I don't insist on doing
> > it this way. XLogRead() actually does not have to care whether the "open
> > segment callback" determines the TLI or not, so it (XLogRead) can always
> > receive a valid pointer to seg.ws_tli.
> 
> This is actually wrong - seg.ws_tli is not always the correct value to
> pass. If seg.ws_tli refers to the segment from which data was read last time,
> then XLogRead() still needs a separate argument to specify from which TLI the
> current call should read. If these two differ, new file needs to be opened.

openSegment represents the file *currently* opened. XLogRead
needs the TLI *to be* opened. If they are different, as far as
wal logical wal sender and pg_waldump is concerned, XLogRead
switches to the new TLI and the new TLI is set to
openSegment.ws_tli.  So, it seems to me that the parameter
doesn't need to be inout? It is enough that it is an "in"
parameter.

> The problem of walsender.c is that its implementation of XLogRead() does not
> care about the TLI of the previous read. If the behavior of the new, generic
> implementation should be exactly the same, we need to tell XLogRead() that in
> some cases it also should not compare the current TLI to the previous
> one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
> earlier.

Physical wal sender doesn't switch TLI. So I don't think the
behavior doesn't harm (or doesn't fire). openSegment holds the
TLI set at the first call. (Even if future wal sender switches
TLI, the behavior should be needed.)

> Another approach is to add a boolean argument "check_tli", but that still
> forces caller to pass some (random) value of the tli. The concept of
> InvalidTimeLineID seems to me less disturbing than this.

So I think InvalidTimeLineID is not needed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

01 October 2019, 06:28:03

Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Sat, 28 Sep 2019 15:00:35 +0200, Antonin Houska <ah@cybertec.at> wrote in <9236.1569675635@antos>
> > Antonin Houska <ah@cybertec.at> wrote:
> >
> > > Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >
> > > > BTW that tli_p business to the openSegment callback is horribly
> > > > inconsistent.  Some callers accept a NULL tli_p, others will outright
> > > > crash, even though the API docs say that the callback must determine the
> > > > timeline.  This is made more complicated by us having the TLI in "seg"
> > > > also.  Unless I misread, the problem is again that the walsender code is
> > > > doing nasty stuff with globals (endSegNo).  As a very minor stylistic
> > > > point, we prefer to have out params at the end of the signature.
> > >
> > > XLogRead() tests for NULL so it should not crash but I don't insist on doing
> > > it this way. XLogRead() actually does not have to care whether the "open
> > > segment callback" determines the TLI or not, so it (XLogRead) can always
> > > receive a valid pointer to seg.ws_tli.
> >
> > This is actually wrong - seg.ws_tli is not always the correct value to
> > pass. If seg.ws_tli refers to the segment from which data was read last time,
> > then XLogRead() still needs a separate argument to specify from which TLI the
> > current call should read. If these two differ, new file needs to be opened.
>
> openSegment represents the file *currently* opened.

I suppose you mean the "seg" argument.

> XLogRead needs the TLI *to be* opened. If they are different, as far as wal
> logical wal sender and pg_waldump is concerned, XLogRead switches to the new
> TLI and the new TLI is set to openSegment.ws_tli.

Yes, it works in these cases.

> So, it seems to me that the parameter doesn't need to be inout? It is enough
> that it is an "in" parameter.

I did consider "TimeLineID *tli_p" to be "in" parameter in the last patch
version. The reason I used pointer was the special meaning of the NULL value:
if NULL is passed, then the timeline should be ignored (because of the other
cases, see below).

> > The problem of walsender.c is that its implementation of XLogRead() does not
> > care about the TLI of the previous read. If the behavior of the new, generic
> > implementation should be exactly the same, we need to tell XLogRead() that in
> > some cases it also should not compare the current TLI to the previous
> > one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
> > earlier.
>
> Physical wal sender doesn't switch TLI. So I don't think the
> behavior doesn't harm (or doesn't fire). openSegment holds the
> TLI set at the first call. (Even if future wal sender switches
> TLI, the behavior should be needed.)

Note that walsender.c:XLogRead() has no TLI argument, however the XLogRead()
introduced by the patch does have one. What should be passed for TLI to the
new implementation if it's called from walsender.c? If the check for a segment
change looks like this (here "tli" is the argument representing the desired
TLI)

    if (seg->ws_file < 0 ||
        !XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
        tli != seg->ws_tli)
    {
        XLogSegNo    nextSegNo;

        /* Switch to another logfile segment */
        if (seg->ws_file >= 0)
            close(seg->ws_file);

then any valid TLI can result in accidental closing of the current segment
file. Since this is only refactoring patch, we should not allow such a change
of behavior even if it seems that the same segment will be reopened
immediately.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Kyotaro Horiguchi

Date:

01 October 2019, 08:48:22

At Tue, 01 Oct 2019 08:28:03 +0200, Antonin Houska <ah@cybertec.at> wrote in <2188.1569911283@antos>
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> > > > XLogRead() tests for NULL so it should not crash but I don't insist on doing
> > > > it this way. XLogRead() actually does not have to care whether the "open
> > > > segment callback" determines the TLI or not, so it (XLogRead) can always
> > > > receive a valid pointer to seg.ws_tli.
> > > 
> > > This is actually wrong - seg.ws_tli is not always the correct value to
> > > pass. If seg.ws_tli refers to the segment from which data was read last time,
> > > then XLogRead() still needs a separate argument to specify from which TLI the
> > > current call should read. If these two differ, new file needs to be opened.
> > 
> > openSegment represents the file *currently* opened.
> 
> I suppose you mean the "seg" argument.
> 
> > XLogRead needs the TLI *to be* opened. If they are different, as far as wal
> > logical wal sender and pg_waldump is concerned, XLogRead switches to the new
> > TLI and the new TLI is set to openSegment.ws_tli.
> 
> Yes, it works in these cases.
> 
> > So, it seems to me that the parameter doesn't need to be inout? It is enough
> > that it is an "in" parameter.
> 
> I did consider "TimeLineID *tli_p" to be "in" parameter in the last patch
> version. The reason I used pointer was the special meaning of the NULL value:
> if NULL is passed, then the timeline should be ignored (because of the other
> cases, see below).

Understood.

> > > The problem of walsender.c is that its implementation of XLogRead() does not
> > > care about the TLI of the previous read. If the behavior of the new, generic
> > > implementation should be exactly the same, we need to tell XLogRead() that in
> > > some cases it also should not compare the current TLI to the previous
> > > one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
> > > earlier.
> > 
> > Physical wal sender doesn't switch TLI. So I don't think the
> > behavior doesn't harm (or doesn't fire). openSegment holds the
> > TLI set at the first call. (Even if future wal sender switches
> > TLI, the behavior should be needed.)
> 
> Note that walsender.c:XLogRead() has no TLI argument, however the XLogRead()
> introduced by the patch does have one. What should be passed for TLI to the
> new implementation if it's called from walsender.c? If the check for a segment
> change looks like this (here "tli" is the argument representing the desired
> TLI)

TLI is mandatory to generate a wal file name so it must be passed
to the function anyways. In the current code it is sendTimeLine
for the walsender.c:XLogRead(). logical_read_xlog_page sets the
variable very time immediately before calling
XLogRead(). CreateReplicationSlot and StartReplication set the
variable to desired TLI immediately before calling and once it is
set by StartReplication, it is not changed by XLogSendPhysical
and wal sender ends at the end of the current timeline. In the
XLogRead, the value is copied to sendSeg->ws_tli when the file
for the new timeline is read.

>     if (seg->ws_file < 0 ||
>         !XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
>         tli != seg->ws_tli)
>     {
>         XLogSegNo    nextSegNo;
> 
>         /* Switch to another logfile segment */
>         if (seg->ws_file >= 0)
>             close(seg->ws_file);
> 
> then any valid TLI can result in accidental closing of the current segment
> file. Since this is only refactoring patch, we should not allow such a change
> of behavior even if it seems that the same segment will be reopened
> immediately.

Mmm. ws_file must be -1 in the case? tli != seg->ws_tli is true
but seg->ws_file < 0 is also always true at the time. In other
words, the "tli != seg->ws_tli" is not even evaluated.

If wal sender had an open file (ws_file >= 0) and the new TLI is
different from ws_tli, it would be the sign of a serious bug.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

02 October 2019, 07:16:10

Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> At Tue, 01 Oct 2019 08:28:03 +0200, Antonin Houska <ah@cybertec.at> wrote in <2188.1569911283@antos>
> > Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

> > > > The problem of walsender.c is that its implementation of XLogRead() does not
> > > > care about the TLI of the previous read. If the behavior of the new, generic
> > > > implementation should be exactly the same, we need to tell XLogRead() that in
> > > > some cases it also should not compare the current TLI to the previous
> > > > one. That's why I tried to use the NULL pointer, or the InvalidTimeLineID
> > > > earlier.
> > >
> > > Physical wal sender doesn't switch TLI. So I don't think the
> > > behavior doesn't harm (or doesn't fire). openSegment holds the
> > > TLI set at the first call. (Even if future wal sender switches
> > > TLI, the behavior should be needed.)
> >
> > Note that walsender.c:XLogRead() has no TLI argument, however the XLogRead()
> > introduced by the patch does have one. What should be passed for TLI to the
> > new implementation if it's called from walsender.c? I f the check for a segment
> > change looks like this (here "tli" is the argument representing the desired
> > TLI)
>
> TLI is mandatory to generate a wal file name so it must be passed
> to the function anyways. In the current code it is sendTimeLine
> for the walsender.c:XLogRead(). logical_read_xlog_page sets the
> variable very time immediately before calling
> XLogRead(). CreateReplicationSlot and StartReplication set the
> variable to desired TLI immediately before calling and once it is
> set by StartReplication, it is not changed by XLogSendPhysical
> and wal sender ends at the end of the current timeline. In the
> XLogRead, the value is copied to sendSeg->ws_tli when the file
> for the new timeline is read.

Are you saying that we should pass sendTimeLine to XLogRead()? I think it's
not always correct because sendSeg->ws_tli is sometimes assigned
sendTimeLineNextTLI, so the test "tli != seg->ws_tli" in

> >     if (seg->ws_file < 0 ||
> >         !XLByteInSeg(recptr, seg->ws_segno, segcxt->ws_segsize) ||
> >         tli != seg->ws_tli)
> >     {
> >         XLogSegNo    nextSegNo;

could pass occasionally.

> Mmm. ws_file must be -1 in the case? tli != seg->ws_tli is true
> but seg->ws_file < 0 is also always true at the time. In other
> words, the "tli != seg->ws_tli" is not even evaluated.
>
> If wal sender had an open file (ws_file >= 0) and the new TLI is
> different from ws_tli, it would be the sign of a serious bug.

So we can probably pass ws_tli as the "new TLI" when calling the new
XLogRead() from walsender.c. Is that what you try to say? I need to think
about it more but it sounds like a good idea.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

04 October 2019, 10:11:11

This is the next version.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

07 November 2019, 04:48:41

On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote:
> This is the next version.

So...  These are the two last bits to look at, reducing a bit the code
size:
5 files changed, 396 insertions(+), 419 deletions(-)

And what this patch set does is to refactor the routines we use now in
xlogreader.c to read a page by having new callbacks to open a segment,
as that's basically the only difference between the context of a WAL
sender, pg_waldump and recovery.

Here are some comments reading through the code.

+    * Note that XLogRead(), if used, should have updated the "seg" too for
+    * its own reasons, however we cannot rely on ->read_page() to call
+    * XLogRead().
Why?

Your patch removes all the three optional lseek() calls which can
happen in a segment.  Am I missing something but isn't that plain
wrong?  You could reuse the error context for that as well if an error
happens as what's needed is basically the segment name and the LSN
offset.

All the callers of XLogReadProcessError() are in src/backend/, so it
seems to me that there is no point to keep that in xlogreader.c but it
should be instead in xlogutils.c, no?  It seems to me that this is
more like XLogGenerateError, or just XLogError().  We have been using
xlog as an acronym in many places of the code, so switching now to wal
just for the past matter of the pg_xlog -> pg_wal switch does not seem
worth bothering.

+read_local_xlog_page_segment_open(XLogSegNo nextSegNo,
+                                 WALSegmentContext *segcxt,
Could you think about a more simple name here?  It is a callback to
open a new segment, so it seems to me that we could call it just
open_segment_callback().  There is also no point in using a pointer to
the TLI, no?

+ * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf',
+ * starting at location 'startptr'. 'seg' is the last segment used,
+ * 'openSegment' is a callback to opens the next segment if needed and
+ * 'segcxt' is additional segment info that does not fit into 'seg'.
A typo and the last part of the last sentence could be better worded.
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

11 November 2019, 15:25:56

Michael Paquier <michael@paquier.xyz> wrote:

> On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote:
> > This is the next version.
> 
> So...  These are the two last bits to look at, reducing a bit the code
> size:
> 5 files changed, 396 insertions(+), 419 deletions(-)
> 
> And what this patch set does is to refactor the routines we use now in
> xlogreader.c to read a page by having new callbacks to open a segment,
> as that's basically the only difference between the context of a WAL
> sender, pg_waldump and recovery.
> 
> Here are some comments reading through the code.
> 
> +    * Note that XLogRead(), if used, should have updated the "seg" too for
> +    * its own reasons, however we cannot rely on ->read_page() to call
> +    * XLogRead().
> Why?

I've updated the comment:

+       /*
+        * Update read state information.
+        *
+        * If XLogRead() is was called by ->read_page, it should have updated the
+        * ->seg fields accordingly (since we never request more than a single
+        * page, neither ws_segno nor ws_off should have advanced beyond
+        * targetSegNo and targetPageOff respectively). However it's not mandatory
+        * for ->read_page to call XLogRead().
+        */

Besides what I say here, I'm not sure if we should impose additional
requirement on the existing callbacks (possibly those in extensions) to update
the XLogReaderState.seg structure.

> Your patch removes all the three optional lseek() calls which can
> happen in a segment.  Am I missing something but isn't that plain
> wrong?  You could reuse the error context for that as well if an error
> happens as what's needed is basically the segment name and the LSN
> offset.

Explicit call of lseek() is not used because XLogRead() uses pg_pread()
now. Nevertheless I found out that in the the last version of the patch I set
ws_off to 0 for a newly opened segment. This was wrong, fixed now.

> All the callers of XLogReadProcessError() are in src/backend/, so it
> seems to me that there is no point to keep that in xlogreader.c but it
> should be instead in xlogutils.c, no?  It seems to me that this is
> more like XLogGenerateError, or just XLogError().  We have been using
> xlog as an acronym in many places of the code, so switching now to wal
> just for the past matter of the pg_xlog -> pg_wal switch does not seem
> worth bothering.

ok, moved to xlogutils.c and renamed to XLogReadRaiseError(). I think the
"Read" word should be there because many other error can happen during XLOG
processing.

> +read_local_xlog_page_segment_open(XLogSegNo nextSegNo,
> +                                 WALSegmentContext *segcxt,
> Could you think about a more simple name here?  It is a callback to
> open a new segment, so it seems to me that we could call it just
> open_segment_callback().

ok, the function is not exported to other modules, so there's no need to care
about uniqueness of the name. I chose wal_segment_open(), according to the
callback type WALSegmentOpen.

> There is also no point in using a pointer to the TLI, no?

This particular callback makes no decision about the TLI, so it only uses
tli_p as an input argument.

> + * Read 'count' bytes from WAL fetched from timeline 'tli' into 'buf',
> + * starting at location 'startptr'. 'seg' is the last segment used,
> + * 'openSegment' is a callback to opens the next segment if needed and
> + * 'segcxt' is additional segment info that does not fit into 'seg'.
> A typo and the last part of the last sentence could be better worded.

ok, adjusted a bit.

Thanks for review.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

12 November 2019, 04:31:34

On Mon, Nov 11, 2019 at 04:25:56PM +0100, Antonin Houska wrote:
>> On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote:
> +       /*
> +        * Update read state information.
> +        *
> +        * If XLogRead() is was called by ->read_page, it should have updated the
> +        * ->seg fields accordingly (since we never request more than a single
> +        * page, neither ws_segno nor ws_off should have advanced beyond
> +        * targetSegNo and targetPageOff respectively). However it's not mandatory
> +        * for ->read_page to call XLogRead().
> +        */
>
> Besides what I say here, I'm not sure if we should impose additional
> requirement on the existing callbacks (possibly those in extensions) to update
> the XLogReaderState.seg structure.

"is was called" does not make sense in this sentence.  Actually, I
would tend to just remove it completely.

>> Your patch removes all the three optional lseek() calls which can
>> happen in a segment.  Am I missing something but isn't that plain
>> wrong?  You could reuse the error context for that as well if an error
>> happens as what's needed is basically the segment name and the LSN
>> offset.
>
> Explicit call of lseek() is not used because XLogRead() uses pg_pread()
> now. Nevertheless I found out that in the the last version of the patch I set
> ws_off to 0 for a newly opened segment. This was wrong, fixed now.

Missed that part, thanks.  This was actually not obvious after an
initial lookup of the patch.  Wouldn't it make sense to split that
part in a separate patch that we could review and get committed first
then?  It would have the advantage to make the rest easier to review
and follow.  And using pread is actually better for performance
compared to read+lseek.  Now there is also the argument that we don't
always seek into an opened WAL segment, and that a plain read() is
actually better than pread() in some cases.

> ok, moved to xlogutils.c and renamed to XLogReadRaiseError(). I think the
> "Read" word should be there because many other error can happen during XLOG
> processing.

No issue with this name.

> ok, the function is not exported to other modules, so there's no need to care
> about uniqueness of the name. I chose wal_segment_open(), according to the
> callback type WALSegmentOpen.

Name is fine by me.

>> There is also no point in using a pointer to the TLI, no?
>
> This particular callback makes no decision about the TLI, so it only uses
> tli_p as an input argument.

Missed that walsender.c can enforce the tli to a new value, objection
withdrawn.

+ * BasicOpenFile() is the preferred way to open the segment file in backend
+ * code, whereas open(2) should be used in frontend.
I would remove that sentence.

+#ifndef FRONTEND
+/*
+ * Backend-specific convenience code to handle read errors encountered by
+ * XLogRead().
+ */
+void
+XLogReadRaiseError(XLogReadError *errinfo)
No need for the FRONTEND ifndef's here as xlogutils.c is backend-only.

+#ifndef FRONTEND
+void       XLogReadRaiseError(XLogReadError *errinfo);
+#endif
Same as above, and missing an extern declaration.

+           fatal_error("could not read from log file %s, offset %u, length %zu: %s",
+                       fname, seg->ws_off, (Size) errinfo.reqbytes,
+                       strerror(errinfo.read_errno));
Let's use this occasion to make those error messages more generic to
reduce the pain of translators as the file name lets us know that we
have to deal with a WAL segment.  Here are some suggestions, taking
into account the offset:
- If errno is set: "could not read file \"%s\" at offset %u: %m"
- For partial read: "could not read file \"%s\" at offset %u: read %d of %zu"
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

12 November 2019, 11:06:33

Michael Paquier <michael@paquier.xyz> wrote:

> On Mon, Nov 11, 2019 at 04:25:56PM +0100, Antonin Houska wrote:
> >> On Fri, Oct 04, 2019 at 12:11:11PM +0200, Antonin Houska wrote:
> >> Your patch removes all the three optional lseek() calls which can
> >> happen in a segment.  Am I missing something but isn't that plain
> >> wrong?  You could reuse the error context for that as well if an error
> >> happens as what's needed is basically the segment name and the LSN
> >> offset.
> > 
> > Explicit call of lseek() is not used because XLogRead() uses pg_pread()
> > now. Nevertheless I found out that in the the last version of the patch I set
> > ws_off to 0 for a newly opened segment. This was wrong, fixed now.
> 
> Missed that part, thanks.  This was actually not obvious after an
> initial lookup of the patch.  Wouldn't it make sense to split that
> part in a separate patch that we could review and get committed first
> then?  It would have the advantage to make the rest easier to review
> and follow.  And using pread is actually better for performance
> compared to read+lseek.  Now there is also the argument that we don't
> always seek into an opened WAL segment, and that a plain read() is
> actually better than pread() in some cases.

ok, the next version uses explicit lseek(). Maybe the fact that XLOG is mostly
read sequentially (i.e. without frequent seeks) is the reason pread() has't
been adopted so far.

The new version reflects your other suggestions too, except the one about not
renaming "XLOG" -> "WAL" (actually you mentioned that earlier in the
thread). I recall that when working on the preliminary patch (709d003fbd),
Alvaro suggested "WAL" for some structures because these are new. The rule
seemed to be that "XLOG..." should be left for the existing symbols, while the
new ones should be "WAL...":

https://www.postgresql.org/message-id/20190917221521.GA15733%40alvherre.pgsql

So I decided to rename the new symbols and to remove the related comment.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

15 November 2019, 21:41:02

On 2019-Nov-12, Antonin Houska wrote:

> ok, the next version uses explicit lseek(). Maybe the fact that XLOG is mostly
> read sequentially (i.e. without frequent seeks) is the reason pread() has't
> been adopted so far.

I don't quite understand why you backed off from switching to pread.  It
seemed a good change to me.

Here's a few edits on top of your latest.

The new routine WALRead() is not at all the same as the previous
XLogRead, so I don't see why we would keep the name.  Hence renamed.

I see no reason for the openSegment callback to return the FD in an out
param instead of straight return value.  Changed that way.

Having seek/open be a boolean "xlr_seek" seems a bit weird.  Changed to
an "operation" enum.  (Maybe if we go back to pg_pread we can get rid of
this.)  Accordingly, change WALReadRaiseError and WALDumpReadPage.

Change xlr_seg to be a struct rather than pointer to struct.  It seems a
bit dangerous to me to return a pointer that we don't know is going to
be valid at raise-error time.  Struct assignment works fine for the
purpose.

Renamed XLogDumpReadPage to WALDumpReadPage, because what the heck is
XLogDump anyway?  That doesn't exist.

I would only like to switch this back to pg_pread() (from seek/read) and
I'd be happy to commit this.

What is logical_read_local_xlog_page all about?  Seems useless.  Let's
get rid of it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

15 November 2019, 21:41:45

Attachment.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

xlogread-extra.patch

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

17 November 2019, 04:22:21

BTW ... contrib/test_decoding fails with this patch.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

18 November 2019, 12:29:03

On Fri, Nov 15, 2019 at 06:41:02PM -0300, Alvaro Herrera wrote:
> I don't quite understand why you backed off from switching to pread.  It
> seemed a good change to me.
>
> [...]
>
> Having seek/open be a boolean "xlr_seek" seems a bit weird.  Changed to
> an "operation" enum.  (Maybe if we go back to pg_pread we can get rid of
> this.)  Accordingly, change WALReadRaiseError and WALDumpReadPage.

This has been quickly mentioned on the thread which has introduced
pread():
https://www.postgresql.org/message-id/c2f56d0a-cadd-3df1-ae48-b84dc8128c37@redhat.com

Now, read() > pread() > read()+lseek(), and we don't actually need to
seek into the file for all the cases where we read a WAL page.  And on
a platform which uses the fallback implementation, this increases the
number of lseek() calls.  I can see as you say that using it directly
in the refactoring can simplify the code.
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

20 November 2019, 08:38:02

On Mon, Nov 18, 2019 at 09:29:03PM +0900, Michael Paquier wrote:
> Now, read() > pread() > read()+lseek(), and we don't actually need to
> seek into the file for all the cases where we read a WAL page.  And on
> a platform which uses the fallback implementation, this increases the
> number of lseek() calls.  I can see as you say that using it directly
> in the refactoring can simplify the code.

Putting this point aside, here is the error coming from
contrib/test_decoding/, and this is independent of Alvaro's changes:
+ERROR:  invalid magic number 0000 in log segment
000000010000000000000001, offset 6905856

I don't think that this is just xlp_magic messed up, the full page
read is full of zeros.  But that's just a guess.

Looking at the code, I am spotting one inconsistency in the way
seg->ws_off is compiled after doing the read on the new version
compared to the three others.  read() would move the offset of the
file, but the code is forgetting to increment it by a amount of
readbytes.  Isn't that incorrect?

A second thing is that wal_segment_open() definition is incorrect in
xlogutils.c, generating a warning.  The opened fd is the returned
result, and not an argument of the routine.

I am switching the patch as waiting on author.  Antonin, could you
look at those problems?
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

20 November 2019, 14:50:29

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> On 2019-Nov-12, Antonin Houska wrote:
> 
> > ok, the next version uses explicit lseek(). Maybe the fact that XLOG is mostly
> > read sequentially (i.e. without frequent seeks) is the reason pread() has't
> > been adopted so far.
> 
> I don't quite understand why you backed off from switching to pread.  It
> seemed a good change to me.

I agreed with Michael that it makes comparison of the old and new code more
difficult, and I also thought that his arguments about performance might be
worthwhile because WAL reading is mostly sequential and does not require many
seeks. However things appear to be more complex, see below.

> Here's a few edits on top of your latest.

> ...

I agree with your renamings.

> Change xlr_seg to be a struct rather than pointer to struct.  It seems a
> bit dangerous to me to return a pointer that we don't know is going to
> be valid at raise-error time.  Struct assignment works fine for the
> purpose.

ok

> I would only like to switch this back to pg_pread() (from seek/read) and
> I'd be happy to commit this.

I realized that, starting from commit 709d003fbd98b975a4fbcb4c5750fa6efaf9ad87
we use the WALOpenSegment.ws_off field incorrectly in
walsender.c:XLogRead(). In that commit we used this field to replace
XLogReaderState.readOff:

@@ -156,10 +165,9 @@ struct XLogReaderState
        char       *readBuf;
        uint32          readLen;

-       /* last read segment, segment offset, TLI for data currently in readBuf */
-       XLogSegNo       readSegNo;
-       uint32          readOff;
-       TimeLineID      readPageTLI;
+       /* last read XLOG position for data currently in readBuf */
+       WALSegmentContext segcxt;
+       WALOpenSegment seg;

        /*
         * beginning of prior page read, and its TLI.  Doesn't necessarily

Thus we cannot use it in XLogRead() to track the current position in the
segment file. Although walsender.c:XLogRead() misses this point, it's not
broken because walsender.c does not use XLogReaderState at all.

So if explicit lseek() should be used, another field should be added to
WALOpenSegment. I failed to do so when removing the pg_pread() call from the
patch, and that was the reason for the problem reported here:

https://www.postgresql.org/message-id/20191117042221.GA16537%40alvherre.pgsql
https://www.postgresql.org/message-id/20191120083802.GB47145@paquier.xyz

Thus the use of pg_pread() makes the code quite a bit simpler, so I
re-introduced it. If you decide that an explicit lseek() should be used yet,
just let me know.

> What is logical_read_local_xlog_page all about?  Seems useless.  Let's
> get rid of it.

It seems so. Should I post a patch for that?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

21 November 2019, 08:05:50

On Wed, Nov 20, 2019 at 03:50:29PM +0100, Antonin Houska wrote:
> Thus the use of pg_pread() makes the code quite a bit simpler, so I
> re-introduced it. If you decide that an explicit lseek() should be used yet,
> just let me know.

Skimming through the code, it looks like in a good state.  The
failures of test_deconding are fixed, and all the changes from Alvaro
have been added.

+           fatal_error("could not read in file %s, offset %u, length %zu: %s",
+                       fname, seg->ws_off, (Size) errinfo.wre_req,
+                       strerror(errinfo.wre_errno));
You should be able to use %m here instead of strerror().

It seems to me that it is always important to not do changes
completely blindly either so as this does not become an issue for
recovery later on.  FWIW, I ran a small set of tests with a WAL
segment sizes of 1MB and 1GB (fsync = off, max_wal_size/min_wal_size
set very high, 1 billion rows in single-column table followed by a
series of updates):
- Created a primary and a standby which archive_mode set.
- Stopped the standby.
- Produced close to 12GB worth of WAL.
- Restarted the standby with restore_command and compared the time it
takes for recovery to complete all the segments with HEAD and your
refactoring:
1GB + HEAD: 7min52s
1GB + patch: 8min10s
1MB + HEAD: 10min17s
1MB + patch: 12min1s

And with WAL segments at 1MB, I was seeing quite a slowdown with the
patch...  Then I have done an extra test with pg_waldump with the
segments generated previously with the output redirected to /dev/null.
Going through 512 segments takes 15.730s with HEAD (average of 3 runs)
and 15.851s with the patch.
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

22 November 2019, 00:49:03

On Thu, Nov 21, 2019 at 05:05:50PM +0900, Michael Paquier wrote:
> And with WAL segments at 1MB, I was seeing quite a slowdown with the
> patch...  Then I have done an extra test with pg_waldump with the
> segments generated previously with the output redirected to /dev/null.
> Going through 512 segments takes 15.730s with HEAD (average of 3 runs)
> and 15.851s with the patch.

Here are more tests with pg_waldump and 1MB/1GB segment sizes with
records generated from pgbench, (7 runs, eliminated the two highest
and two lowest, these are the remaining 3 runs as real time):
1) 1MB segment size, 512 segments:
time pg_waldump 000000010000000100000C00 000000010000000100000F00 > /dev/null
- HEAD: 0m4.512s, 0m4.446s, 0m4.501s
- Patch + system's pg_read: 0m4.495s, 0m4.502s, 0m4.486s
- Patch + fallback pg_read: 0m4.505s, 0m4.527s, 0m4.495s
2) 1GB segment size, 3 segments:
time pg_waldump 000000010000000200000001 000000010000000200000003 > /dev/null
- HEAD: 0m11.802s, 0m11.834s, 0m11.846s
- Patch + system's pg_read: 0m11.939s, 0m11.991s, 0m11.966s
- Patch + fallback pg_read: 0m12.054s, 0m12.066s, 0m12.159s
So there is a tendency for a small slowdown here.  Still it is not
that much, so I withdraw my concerns.

Another thing:
+void       WALReadRaiseError(WALReadError *errinfo);
This is missing an "extern" declaration.

Alvaro, you are marked as a committer of this CF entry.  Are you
planning to look at it again?  Sorry for the delay from my side.
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

22 November 2019, 03:49:33

On 2019-Nov-22, Michael Paquier wrote:

> Alvaro, you are marked as a committer of this CF entry.  Are you
> planning to look at it again?  Sorry for the delay from my side.

Yes :-)  hopefully next week.  Thanks for reviewing.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

22 November 2019, 07:11:23

On Fri, Nov 22, 2019 at 12:49:33AM -0300, Alvaro Herrera wrote:
> Yes :-)  hopefully next week.  Thanks for reviewing.

Thanks, I am switching the entry as ready for committer then.  Please
note that the latest patch series have a conflict at the top of
walsender.c easy enough to resolve, and that the function declaration
in xlogutils.h misses an "extern".  I personally find unnecessary the
last sentence in the new comment block of xlogreader.h to describe the
new callback to open a segment about BasicOpenFile() and open()
because one could also use a transient file opened in the backend, but
I'll be fine with anything you think is most fit.  That's a minor
point.

Thanks Antonin for doing the refactoring effort.
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

22 November 2019, 07:28:32

Michael Paquier <michael@paquier.xyz> wrote:

> On Thu, Nov 21, 2019 at 05:05:50PM +0900, Michael Paquier wrote:
> > And with WAL segments at 1MB, I was seeing quite a slowdown with the
> > patch...  Then I have done an extra test with pg_waldump with the
> > segments generated previously with the output redirected to /dev/null.
> > Going through 512 segments takes 15.730s with HEAD (average of 3 runs)
> > and 15.851s with the patch.
>
> Here are more tests with pg_waldump and 1MB/1GB segment sizes with
> records generated from pgbench, (7 runs, eliminated the two highest
> and two lowest, these are the remaining 3 runs as real time):
> 1) 1MB segment size, 512 segments:
> time pg_waldump 000000010000000100000C00 000000010000000100000F00 > /dev/null
> - HEAD: 0m4.512s, 0m4.446s, 0m4.501s
> - Patch + system's pg_read: 0m4.495s, 0m4.502s, 0m4.486s
> - Patch + fallback pg_read: 0m4.505s, 0m4.527s, 0m4.495s
> 2) 1GB segment size, 3 segments:
> time pg_waldump 000000010000000200000001 000000010000000200000003 > /dev/null
> - HEAD: 0m11.802s, 0m11.834s, 0m11.846s
> - Patch + system's pg_read: 0m11.939s, 0m11.991s, 0m11.966s
> - Patch + fallback pg_read: 0m12.054s, 0m12.066s, 0m12.159s
> So there is a tendency for a small slowdown here.  Still it is not
> that much, so I withdraw my concerns.

Thanks for the testing!

I thought that in [1] you try discourage me from using pg_pread(), but now it
seems to be the opposite. Ideally I'd like to see no overhead added by my
patch at all, but the code simplicity should matter too.

As a clue, we can perhaps consider the fact that commit c24dcd0c removed
explicit lseek() also from XLogWrite(), but I'm not sure how much we can
compare XLOG writing and reading (I'd expect writing to be a bit less
sequential than reading because XLogWrite() may need to write the last page
more than once.)

Let's wait for Alvaro's judgement.

> Another thing:
> +void       WALReadRaiseError(WALReadError *errinfo);
> This is missing an "extern" declaration.

I'll fix this as well as the other problem reported in [1] as soon as I know
whether pg_pread() should be used or not.

[1] https://www.postgresql.org/message-id/20191121080550.GG153437%40paquier.xyz

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

22 November 2019, 13:35:51

On 2019-Nov-22, Antonin Houska wrote:

> I thought that in [1] you try discourage me from using pg_pread(), but now it
> seems to be the opposite. Ideally I'd like to see no overhead added by my
> patch at all, but the code simplicity should matter too.

FWIW I think the new code is buggy because it doesn't seem to be setting
ws_off, so I suppose the optimization in ReadPageInternal to skip
reading the page when it's already the page we have is not hit, except
for the first page in the segment.  I didn't verify this, just my
impression while reading the code.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

22 November 2019, 13:40:07

On Fri, Nov 22, 2019 at 10:35:51AM -0300, Alvaro Herrera wrote:
> FWIW I think the new code is buggy because it doesn't seem to be setting
> ws_off, so I suppose the optimization in ReadPageInternal to skip
> reading the page when it's already the page we have is not hit, except
> for the first page in the segment.  I didn't verify this, just my
> impression while reading the code.

FWIW, this matches with my impression here, third paragraph:
https://www.postgresql.org/message-id/20191120083802.GB47145@paquier.xyz
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

22 November 2019, 14:25:28

On 2019-Nov-22, Michael Paquier wrote:

> On Fri, Nov 22, 2019 at 10:35:51AM -0300, Alvaro Herrera wrote:
> > FWIW I think the new code is buggy because it doesn't seem to be setting
> > ws_off, so I suppose the optimization in ReadPageInternal to skip
> > reading the page when it's already the page we have is not hit, except
> > for the first page in the segment.  I didn't verify this, just my
> > impression while reading the code.
> 
> FWIW, this matches with my impression here, third paragraph:
> https://www.postgresql.org/message-id/20191120083802.GB47145@paquier.xyz

Ah, right.

I was wondering if we shouldn't do away with the concept of "offset" as
such, since the offset there is always forcibly set to the start of a
page.  Why don't we count page numbers instead?  It seems like the
interface is confusingly generic (measure in bytes) yet not offer any
extra functionality that could not be obtained with a simpler struct
repr (measure in pages).

But then that's not something that we need to change in this patch.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

22 November 2019, 15:55:20

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> On 2019-Nov-22, Michael Paquier wrote:
> 
> > On Fri, Nov 22, 2019 at 10:35:51AM -0300, Alvaro Herrera wrote:
> > > FWIW I think the new code is buggy because it doesn't seem to be setting
> > > ws_off, so I suppose the optimization in ReadPageInternal to skip
> > > reading the page when it's already the page we have is not hit, except
> > > for the first page in the segment.  I didn't verify this, just my
> > > impression while reading the code.
> > 
> > FWIW, this matches with my impression here, third paragraph:
> > https://www.postgresql.org/message-id/20191120083802.GB47145@paquier.xyz
> 
> Ah, right.

As I pointed out in

https://www.postgresql.org/message-id/88183.1574261429%40antos

seg.ws_off only replaced readOff in XLogReaderState. So we should only update
ws_off where readOff was updated before commit 709d003. This does happen in
ReadPageInternal (see HEAD) and I see no reason for the final patch to update
ws_off anywhere else.

> I was wondering if we shouldn't do away with the concept of "offset" as
> such, since the offset there is always forcibly set to the start of a
> page.  Why don't we count page numbers instead?  It seems like the
> interface is confusingly generic (measure in bytes) yet not offer any
> extra functionality that could not be obtained with a simpler struct
> repr (measure in pages).

Yes, I agree that page numbers would be sufficient.

> But then that's not something that we need to change in this patch.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

22 November 2019, 22:56:32

On 2019-Nov-22, Antonin Houska wrote:

> As I pointed out in
> 
> https://www.postgresql.org/message-id/88183.1574261429%40antos
> 
> seg.ws_off only replaced readOff in XLogReaderState. So we should only update
> ws_off where readOff was updated before commit 709d003. This does happen in
> ReadPageInternal (see HEAD) and I see no reason for the final patch to update
> ws_off anywhere else.

Oh you're right.

I see no reason to leave ws_off.  We can move that to XLogReaderState; I
did that here.  We also need the offset in WALReadError, though, so I
added it there too.  Conceptually it seems clearer to me this way.

What do you think of the attached?

BTW I'm not clear what errors can pread()/pg_pread() report that do not
set errno.  I think lines 1083/1084 of WALRead are spurious now.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v12-0001-Use-only-xlogreader.c-WALRead.patch

Re: Attempt to consolidate reading of XLOG page

From

Michael Paquier

Date:

25 November 2019, 03:30:48

On Fri, Nov 22, 2019 at 07:56:32PM -0300, Alvaro Herrera wrote:
> I see no reason to leave ws_off.  We can move that to XLogReaderState; I
> did that here.  We also need the offset in WALReadError, though, so I
> added it there too.  Conceptually it seems clearer to me this way.

Yeah, that seems cleaner.

> What do you think of the attached?

Looks rather fine to me.

> BTW I'm not clear what errors can pread()/pg_pread() report that do not
> set errno.  I think lines 1083/1084 of WALRead are spurious now.

Because we have no guarantee that errno will be cleared if you do a
partial read where errno is not set, so you may finish by reporting
the state of a previous failed read instead of the partially-failed
one depending on how WALReadError is treated?  In short, I don't see
any actual reason why it would be good to remove the reset of errno
either before the calls to pread and pwrite().
--
Michael

Attachment

signature.asc

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

25 November 2019, 09:02:00

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> On 2019-Nov-22, Antonin Houska wrote:
> 
> > As I pointed out in
> > 
> > https://www.postgresql.org/message-id/88183.1574261429%40antos
> > 
> > seg.ws_off only replaced readOff in XLogReaderState. So we should only update
> > ws_off where readOff was updated before commit 709d003. This does happen in
> > ReadPageInternal (see HEAD) and I see no reason for the final patch to update
> > ws_off anywhere else.
> 
> Oh you're right.
> 
> I see no reason to leave ws_off.  We can move that to XLogReaderState; I
> did that here.  We also need the offset in WALReadError, though, so I
> added it there too.  Conceptually it seems clearer to me this way.
> 
> What do you think of the attached?

It looks good to me. Attached is just a fix of a minor problem in error
reporting that Michael pointed out earlier.

> BTW I'm not clear what errors can pread()/pg_pread() report that do not
> set errno.  I think lines 1083/1084 of WALRead are spurious now.

All I can say is that the existing calls of pg_pread() do not clear errno, so
you may be right. I'd appreciate more background about the "partial read" that
Michael mentions here:

https://www.postgresql.org/message-id/20191125033048.GG37821%40paquier.xyz

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Attachment

wal_read_errno.patch

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

25 November 2019, 18:15:34

On 2019-Nov-25, Antonin Houska wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > I see no reason to leave ws_off.  We can move that to XLogReaderState; I
> > did that here.  We also need the offset in WALReadError, though, so I
> > added it there too.  Conceptually it seems clearer to me this way.
> > 
> > What do you think of the attached?
> 
> It looks good to me. Attached is just a fix of a minor problem in error
> reporting that Michael pointed out earlier.

Excellent, I pushed it with this change included and some other cosmetic
changes.

Now there's only XLogPageRead() ...

> > BTW I'm not clear what errors can pread()/pg_pread() report that do not
> > set errno.  I think lines 1083/1084 of WALRead are spurious now.
> 
> All I can say is that the existing calls of pg_pread() do not clear errno, so
> you may be right.

Right ... in this interface, we only report an error if pg_pread()
returns negative, which is documented to always set errno.

> I'd appreciate more background about the "partial read" that
> Michael mentions here:
> 
> https://www.postgresql.org/message-id/20191125033048.GG37821%40paquier.xyz

In the current implementation, if pg_pread() does a partial read, we
just loop one more time.

I considered changing the "if (readbytes <= 0)" with "if (readbytes <
segbytes)", but that seemed pointless.

However, writing this now makes me think that we should add a
CHECK_FOR_INTERRUPTS in this loop.  (I also wonder if we shouldn't limit
the number of times we retry if pg_pread returns zero (i.e. no error,
but no bytes read either).  I don't know if this is a real-world
consideration.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Attempt to consolidate reading of XLOG page

From

Antonin Houska

Date:

26 November 2019, 10:40:17

Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> On 2019-Nov-25, Antonin Houska wrote:
> 
> > Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> 
> > > I see no reason to leave ws_off.  We can move that to XLogReaderState; I
> > > did that here.  We also need the offset in WALReadError, though, so I
> > > added it there too.  Conceptually it seems clearer to me this way.
> > > 
> > > What do you think of the attached?
> > 
> > It looks good to me. Attached is just a fix of a minor problem in error
> > reporting that Michael pointed out earlier.
> 
> Excellent, I pushed it with this change included and some other cosmetic
> changes.

Thanks!

> Now there's only XLogPageRead() ...

Hm, this seems rather specific, not sure it's worth trying to use WALRead()
here. Anyway, I notice that it uses pg_read() too.

> > I'd appreciate more background about the "partial read" that
> > Michael mentions here:
> > 
> > https://www.postgresql.org/message-id/20191125033048.GG37821%40paquier.xyz
> 
> In the current implementation, if pg_pread() does a partial read, we
> just loop one more time.
> 
> I considered changing the "if (readbytes <= 0)" with "if (readbytes <
> segbytes)", but that seemed pointless.

In the pread() documentation I see "Upon reading end-of-file, zero is
returned."  but that does not tell whether zero can be returned without
reaching EOF. However XLogPageRead() handles zero as an error, so WALRead() is
consistent with that.

> However, writing this now makes me think that we should add a
> CHECK_FOR_INTERRUPTS in this loop.  (I also wonder if we shouldn't limit
> the number of times we retry if pg_pread returns zero (i.e. no error,
> but no bytes read either).  I don't know if this is a real-world
> consideration.)

If statement above is correct, then we shouldn't need this.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Attempt to consolidate reading of XLOG page

From

Alvaro Herrera

Date:

17 March 2020, 21:22:41

On 2019-Nov-20, Antonin Houska wrote:

> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > What is logical_read_local_xlog_page all about?  Seems useless.  Let's
> > get rid of it.
> 
> It seems so. Should I post a patch for that?

No need .. it was simple enough.  Just pushed it.

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services