Thread: [HACKERS] Subtle bug in "Simplify tape block format" commit

[HACKERS] Subtle bug in "Simplify tape block format" commit

From
Peter Geoghegan
Date:
It seems that commit 01ec2563 has a subtle bug, which stems from the
fact that logtape.c no longer follows the rule described above
ltsGetFreeBlock():

/*
 * Select a currently unused block for writing to.
 *
 * NB: should only be called when writer is ready to write immediately,
 * to ensure that first write pass is sequential.
 */
static long
ltsGetFreeBlock(LogicalTapeSet *lts)
{
    ...
}

Previously, all calls to ltsGetFreeBlock() were immediately followed
by a corresponding call to ltsWriteBlock(); we wrote out the
newly-allocated-in-first-pass block there and then. It's a good idea
for a corresponding ltsWriteBlock() call to happen immediately, and
it's *essential* for it to happen before any later block is written
during the first write pass (when tuples are initially dumped out to
form runs), since, as noted above ltsWriteBlock():

/*
 * Write a block-sized buffer to the specified block of the underlying file.
 *
 * NB: should not attempt to write beyond current end of file (ie, create
 * "holes" in file), since BufFile doesn't allow that.  The first write pass
 * must write blocks sequentially.
...

However, a "write beyond current end of file" now seems to actually be
attempted at times, resulting in an arcane error during sorting. This
is more or less because LogicalTapeWrite() doesn't heed the warning
from ltsGetFreeBlock(), as seen here:

    while (size > 0)
    {
        if (lt->pos >= TapeBlockPayloadSize)
        {
            ...

            /*
             * First allocate the next block, so that we can store it in the
             * 'next' pointer of this block.
             */
            nextBlockNumber = ltsGetFreeBlock(lts);

            /* set the next-pointer and dump the current block. */
            TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
            ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
            ...
        }
        ...

    }

Note that LogicalTapeWrite() now allocates each block (calls
ltsGetFreeBlock()) one block in advance of current block, immediately
before *current* block is written (not the next block/block just
allocated). So, the corresponding ltsWriteBlock() call *must* happen
at an unspecified time in the future, typically the next time control
ends up at exactly the same point, where the block that was "next"
becomes "current".

I'm not about to argue that we should go back to following this
ltsGetFreeBlock() rule, though; I can see why Heikki refactored
LogicalTapeWrite() to allocate blocks a block in advance. Still, we
need to be more careful in avoiding the underlying problem that the
ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-*
has logtape.c be sure to write out a tape's buffer every time
tuplesort ends a run.

I have a test case. Build Postgres with only 0002-* applied, which
makes MAX_PHYSICAL_FILESIZE far smaller than its current value
(current value is 2^30, while this reduces it to BLCKSZ). Then:

postgres=# set replacement_sort_tuples = 0; set work_mem = 64;
SET
SET
postgres=# with a as (select repeat('a', 7000) i from
generate_series(1, 7)) select * from a order by i;
ERROR:  XX000: could not write block 5 of temporary file: Success
LOCATION:  ltsWriteBlock, logtape.c:204

This "block 5" corresponds to an fd.c file/BufFile segment that does
not in fact exist when this error is raised. Since BufFiles multiplex
BLCKSZ-sized segment files in this build of Postgres, it's quite
likely, in general, that writes marking the end of a run will straddle
"segment boundaries", which is what it takes for buffile.c/logtape.c
to throw this error. (The BufFile contract does not allow "holes" in
files, but segment boundaries must be crossed as the critical point
(just before a tape switch) to actually see this error -- you'd have
to be very unlucky to have things line up in exactly the wrong way
with 1GiB BufFile segments, but it can still happen.)

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

From
Heikki Linnakangas
Date:
On 01/07/2017 12:33 AM, Peter Geoghegan wrote:
> Previously, all calls to ltsGetFreeBlock() were immediately followed
> by a corresponding call to ltsWriteBlock(); we wrote out the
> newly-allocated-in-first-pass block there and then. It's a good idea
> for a corresponding ltsWriteBlock() call to happen immediately, and
> it's *essential* for it to happen before any later block is written
> during the first write pass (when tuples are initially dumped out to
> form runs), since, as noted above ltsWriteBlock():
>
> /*
>  * Write a block-sized buffer to the specified block of the underlying file.
>  *
>  * NB: should not attempt to write beyond current end of file (ie, create
>  * "holes" in file), since BufFile doesn't allow that.  The first write pass
>  * must write blocks sequentially.
> ...

I completely missed that rule :-(.

> However, a "write beyond current end of file" now seems to actually be
> attempted at times, resulting in an arcane error during sorting. This
> is more or less because LogicalTapeWrite() doesn't heed the warning
> from ltsGetFreeBlock(), as seen here:
>
>     while (size > 0)
>     {
>         if (lt->pos >= TapeBlockPayloadSize)
>         {
>             ...
>
>             /*
>              * First allocate the next block, so that we can store it in the
>              * 'next' pointer of this block.
>              */
>             nextBlockNumber = ltsGetFreeBlock(lts);
>
>             /* set the next-pointer and dump the current block. */
>             TapeBlockGetTrailer(lt->buffer)->next = nextBlockNumber;
>             ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
>             ...
>         }
>         ...
>
>     }
>
> Note that LogicalTapeWrite() now allocates each block (calls
> ltsGetFreeBlock()) one block in advance of current block, immediately
> before *current* block is written (not the next block/block just
> allocated). So, the corresponding ltsWriteBlock() call *must* happen
> at an unspecified time in the future, typically the next time control
> ends up at exactly the same point, where the block that was "next"
> becomes "current".
>
> I'm not about to argue that we should go back to following this
> ltsGetFreeBlock() rule, though; I can see why Heikki refactored
> LogicalTapeWrite() to allocate blocks a block in advance. Still, we
> need to be more careful in avoiding the underlying problem that the
> ltsGetFreeBlock() rule was intended to prevent. Attached patch 0001-*
> has logtape.c be sure to write out a tape's buffer every time
> tuplesort ends a run.

Hmm. The LogicalTapeEndWriting() function is very similar to the 
LogicalTapePause() function I had in the pause/resume patch. They both 
flush the last block to disk. The difference is that LogicalTapePause() 
also free'd the buffer, and read back the last block from the disk if 
you continued writing, while LogicalTapeEndWriting() keeps a copy of it 
in memory.

With the proposed fix (or with the pause/resume), you can only write to 
a single tape at a time. Not a problem at the moment, but something to 
consider. At least, would need more comments to make that more clear, 
and an Assert would be nice.

Alternatively, we could fix this with a small change in ltsWriteBlock(), 
see attached patch. When we're about to create a hole in the file, write 
all-zero blocks to avoid the creating hole, before the block itself. 
That's not quite as efficient as writing the actual block contents into 
the hole, which avoids having to write it later, but probably won't make 
any measurable difference in practice. I'm inclined to do that, because 
it relaxes the rules on what you're allowed to do, in what order, which 
makes this more robust in general. We coul *also* have something like 
LogicalTapeEndWriting() or LogicalTapePause(), for efficiency, but it 
doesn't seem that important.

> I have a test case.
 > ...

Thanks for the analysis!

- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

From
Peter Geoghegan
Date:
On Mon, Jan 30, 2017 at 4:04 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm. The LogicalTapeEndWriting() function is very similar to the
> LogicalTapePause() function I had in the pause/resume patch. They both flush
> the last block to disk. The difference is that LogicalTapePause() also
> free'd the buffer, and read back the last block from the disk if you
> continued writing, while LogicalTapeEndWriting() keeps a copy of it in
> memory.

Right.

> With the proposed fix (or with the pause/resume), you can only write to a
> single tape at a time. Not a problem at the moment, but something to
> consider. At least, would need more comments to make that more clear, and an
> Assert would be nice.

Agreed on that.

> Alternatively, we could fix this with a small change in ltsWriteBlock(), see
> attached patch. When we're about to create a hole in the file, write
> all-zero blocks to avoid the creating hole, before the block itself. That's
> not quite as efficient as writing the actual block contents into the hole,
> which avoids having to write it later, but probably won't make any
> measurable difference in practice. I'm inclined to do that, because it
> relaxes the rules on what you're allowed to do, in what order, which makes
> this more robust in general.

Why write an all-zero block, rather than arbitrary garbage, which is
all that BufFile really requires? I think it's because sizeof(int) nul
bytes represent the end of a run for tuplesort's purposes. That makes
me doubtful that this is any more robust or general than what I
proposed. So, I don't have a problem with the performance implications
of doing this, which should be minor, but I'm concerned that it
appears to be more general than it actually is.

-- 
Peter Geoghegan



Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

From
Heikki Linnakangas
Date:
On 01/30/2017 07:45 PM, Peter Geoghegan wrote:
> On Mon, Jan 30, 2017 at 4:04 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Alternatively, we could fix this with a small change in ltsWriteBlock(), see
>> attached patch. When we're about to create a hole in the file, write
>> all-zero blocks to avoid the creating hole, before the block itself. That's
>> not quite as efficient as writing the actual block contents into the hole,
>> which avoids having to write it later, but probably won't make any
>> measurable difference in practice. I'm inclined to do that, because it
>> relaxes the rules on what you're allowed to do, in what order, which makes
>> this more robust in general.
>
> Why write an all-zero block, rather than arbitrary garbage, which is
> all that BufFile really requires? I think it's because sizeof(int) nul
> bytes represent the end of a run for tuplesort's purposes. That makes
> me doubtful that this is any more robust or general than what I
> proposed. So, I don't have a problem with the performance implications
> of doing this, which should be minor, but I'm concerned that it
> appears to be more general than it actually is.

It won't make any difference for correctness what bit pattern you write 
to "fill the hole", but you have to write something, and zeros seems 
like a natural choice.

- Heikki




Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

From
Peter Geoghegan
Date:
On Mon, Jan 30, 2017 at 9:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> It won't make any difference for correctness what bit pattern you write to
> "fill the hole", but you have to write something, and zeros seems like a
> natural choice.

Ah, okay. That's probably fine, then.

Let me take another look at this later today before proceeding. I want
to run it against a custom test suite I've developed.

-- 
Peter Geoghegan



Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

From
Peter Geoghegan
Date:
On Mon, Jan 30, 2017 at 10:01 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Let me take another look at this later today before proceeding. I want
> to run it against a custom test suite I've developed.

I've done so. Some more thoughts:

* I don't think that this is really any less efficient than my patch.
It's just that the write of a block happens within ltsWriteBlock()
automatically, rather than being an explicit thing that comes at the
tail-end of a run. This is more general, but otherwise very similar.
The downside is that it will work when something that we might want to
break happens, but I do concede that that's worth it.

* If you're writing out any old bit pattern, I suggest using 0x7F
bytes rather than nul bytes (I do agree that it does seem worth making
this some particular bit pattern, so as to not have to bother with
Valgrind suppressions and so on). That pattern immediately gives you a
hint on where to look for more information when there is a problem. To
me, it suggests "this is something weird". We don't want this to
happen very often.

* I think that the name lts->nBlocksAllocated works better than
lts->nBlocksReserved. After all, you are still writing stuff out in a
way that respects BufFile limitations around holes, and still
reporting that to the user via trace_sort as the number of blocks
actually written.

* I think that you should put the new code into a new function, called
ltsAllocBlocksUntil(), or similar -- this can do the BufFileWrite()
stuff itself, with a suitable custom defensive elog(ERROR) message.
That way, the only new branch needed in ltsWriteBlock() is "if
(blocknum > lts->nBlocksWritten)" (maybe use unlikely() too?), and you
can make it clear that ltsAllocBlocksUntil() is a rarely needed thing,
which seems appropriate.

We really don't want ltsAllocBlocksUntil() logic to be called very
often, and certainly not to write more than 1 or 2 blocks at a time
(no more than 1 with current usage). After all, that would mean
writing to the same position twice or more, for no reason at all.
Maybe note in comments that it's only called at the end of a run in
practice, or something to that effect. Keeping writes sequential is
very important, to keep logtape block reclamation effective.

I was actually planning on introducing a distinction between blocks
allocated and blocks written anyway, for the benefit of logtape.c
parallel sort, where holes can be left behind post-unification (the
start of each workers space needs to be aligned to
MAX_PHYSICAL_FILESIZE, so multi-block ranges of holes will be common).
So, your approach makes sense to me, and I now slightly prefer your
patch to my original.

Thanks
-- 
Peter Geoghegan



Re: [HACKERS] Subtle bug in "Simplify tape block format" commit

From
Heikki Linnakangas
Date:
On 01/31/2017 01:51 AM, Peter Geoghegan wrote:
> * If you're writing out any old bit pattern, I suggest using 0x7F
> bytes rather than nul bytes (I do agree that it does seem worth making
> this some particular bit pattern, so as to not have to bother with
> Valgrind suppressions and so on). That pattern immediately gives you a
> hint on where to look for more information when there is a problem. To
> me, it suggests "this is something weird". We don't want this to
> happen very often.

Somehow writing zeros still feels more natural to me. That's what you'd 
get if you just seeked past the end of file, too, for example. I 
understand your point of using 0x7F to catch bugs, but an all-zeros page 
is a tell-tale too. So I went with all-zeros, anyway.

> * I think that you should put the new code into a new function, called
> ltsAllocBlocksUntil(), or similar -- this can do the BufFileWrite()
> stuff itself, with a suitable custom defensive elog(ERROR) message.
> That way, the only new branch needed in ltsWriteBlock() is "if
> (blocknum > lts->nBlocksWritten)" (maybe use unlikely() too?), and you
> can make it clear that ltsAllocBlocksUntil() is a rarely needed thing,
> which seems appropriate.

I started to refactor this with something like ltsAllocBlocksUntil(), 
but in the end, it just seemed like more almost identical code with 
ltsWriteBlock.

> We really don't want ltsAllocBlocksUntil() logic to be called very
> often, and certainly not to write more than 1 or 2 blocks at a time
> (no more than 1 with current usage). After all, that would mean
> writing to the same position twice or more, for no reason at all.
> Maybe note in comments that it's only called at the end of a run in
> practice, or something to that effect. Keeping writes sequential is
> very important, to keep logtape block reclamation effective.

Added a comment explaining that.

Pushed with those little changes. Thanks!

- Heikki