Re: streaming header too small - Mailing list pgsql-hackers

From Magnus Hagander
Subject Re: streaming header too small
Date
Msg-id CABUevEzwipq1YCGsnBZKJCgJkJZDm8V2X=_UftNgYniQmmKjew@mail.gmail.com
Whole thread Raw
In response to Re: streaming header too small  (Selena Deckelmann <selena@chesnok.com>)
List pgsql-hackers
On Wed, Feb 20, 2013 at 4:53 PM, Selena Deckelmann <selena@chesnok.com> wrote:
>
>
>
> On Wed, Feb 20, 2013 at 6:23 AM, Magnus Hagander <magnus@hagander.net>
> wrote:
>>
>>
>> On Feb 20, 2013 11:29 AM, "Heikki Linnakangas" <hlinnakangas@vmware.com>
>> wrote:
>> >
>> > On 20.02.2013 02:11, Selena Deckelmann wrote:
>> >>
>> >> So, I just ran into a similar issue backing up a 9.2.1 server using
>> >> pg_basebackup version 9.2.3:
>> >>
>> >> pg_basebackup: starting background WAL receiver
>> >> pg_basebackup: streaming header too small: 25
>> >>
>> >>
>> >> I've had it happen two times in a row. I'm going to try again...
>> >>
>> >> But -- what would be helpful here? I can recompile pg_basebackup with
>> >> more
>> >> debugging...
>> >
>> >
>> > Hmm, 25 bytes would be the size of the WAL data packet, if it contains
>> > just the header and no actual WAL data. I think pg_basebackup should accept
>> > that - it's not unreasonable that the server might send such a packet
>> > sometimes.
>> >
>> > Looking at the walsender code, it's not supposed to ever send such a
>> > packet. But I suspect there's one corner-case where it might: if the current
>> > send location is at an xlogid boundary, so that we previously sent the last
>> > byte from the last WAL segment in the previous logical xlog file, and the
>> > WAL flush position points to byte 0 in the beginning of the new WAL file.
>> > Both of those positions are in fact the same thing, but we have two
>> > different ways to represent the same position. For example, if we've already
>> > sent up to WAL position (sentPtr in walsender.c):
>> >
>> > xlogid = 4
>> > xrecoff = XLogFileSize
>> >
>> > and GetFlushRecPtr() returns:
>> >
>> > xlogid = 5
>> > xrecoff = 0
>> >
>> > Those both point to the same position. But the check in XLogSend that
>> > decides if there is any work to do uses XLByteLE() to check if they are
>> > equal, and XLByteLE() treats the latter to be greater than the former. So,
>> > in that situation, XLogSend() would decide that it has work to do, but there
>> > actually isn't, so it would send 0 bytes of WAL data.
>> >
>> > I'm not sure how GetFlushRecPtr() could return such a position, though.
>> > But I'm also not convinced that it can't happen.
>> >
>> > It would be fairly easy to fix walsender to not send anything in that
>> > situation. It would also be easy to fix pg_basebackup to not treat it as an
>> > error. We probably should do both.
>> >
>> > In 9.3, the XLogRecPtr representation changed so that there is only one
>> > value for a boundary position like that, so this is a 9.2-only issue.
>
>
> Thanks for the debugging, Heikki!
>
>>
>> That does sound like a reasonable explanation and fix. Heck, probably
>> enough to just put the fix in pg_basebackup since it's gone in 9.3 anyway.
>>
>> But I'd really like to confirm this is the actual situation before
>> considering it fixed, since it's clearly very intermittent.
>>
>> Selena, was this reasonably reproducible for you? Would it be possible to
>> get a network trace of it to show of that's the kind of package coming
>> across, or by hacking up pg_basebackup to print the exact position it was at
>> when the problem occurred?
>
>
> This is happening with a very busy 700 GB system, so I'm going to rule out a
> network trace out for the moment. The error is occurring "sometime" in the
> middle of the backup. Last time it was at least 30-40 minutes into a 2 hr
> backup.

Yikes. Yeah, that doesn't seem like the best choice of method :)


> I'll see about hacking up pg_basebackup. I'm doing nightly backups so
> hopefully (?) it will happen again in the next few days.

Thanks!


-- Magnus HaganderMe: http://www.hagander.net/Work: http://www.redpill-linpro.com/



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: posix_fadvise missing in the walsender
Next
From: Heikki Linnakangas
Date:
Subject: Re: streaming header too small