Thread: smgrzeroextend clarification

smgrzeroextend clarification

From
Peter Eisentraut
Date:
I was looking at the new smgrzeroextend() function in the smgr API.  The 
documentation isn't very extensive:

/*
  *  smgrzeroextend() -- Add new zeroed out blocks to a file.
  *
  *      Similar to smgrextend(), except the relation can be extended by
  *      multiple blocks at once and the added blocks will be filled with
  *      zeroes.
  */

The documentation of smgrextend() is:

/*
  *  smgrextend() -- Add a new block to a file.
  *
  *      The semantics are nearly the same as smgrwrite(): write at the
  *      specified position.  However, this is to be used for the case of
  *      extending a relation (i.e., blocknum is at or beyond the current
  *      EOF).  Note that we assume writing a block beyond current EOF
  *      causes intervening file space to become filled with zeroes.
  */

So if you want to understand what smgrzeroextend() does, you need to 
mentally combine the documentation of three different functions.  Could 
we write documentation for each function that stands on its own?  And 
document the function arguments, like what does blocknum and nblocks mean?

Moreover, the text "except the relation can be extended by multiple 
blocks at once and the added blocks will be filled with zeroes" doesn't 
make much sense as a differentiation, because smgrextend() does that as 
well.

AFAICT, the differences between smgrextend() and smgrzeroextend() are:

1. smgrextend() writes a payload block in addition to extending the 
file, smgrzeroextend() just extends the file without writing a payload.

2. smgrzeroextend() uses various techniques (posix_fallocate() etc.) to 
make sure the extended space is actually reserved on disk, smgrextend() 
does not.

#1 seems fine, but the naming of the APIs does not reflect that at all.

If we think that #2 is important, maybe smgrextend() should do that as 
well?  Or at least explain why it's not needed?



Re: smgrzeroextend clarification

From
Bharath Rupireddy
Date:
On Wed, May 10, 2023 at 3:20 PM Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> I was looking at the new smgrzeroextend() function in the smgr API.  The
> documentation isn't very extensive:
>
> /*
>   *  smgrzeroextend() -- Add new zeroed out blocks to a file.
>   *
>   *      Similar to smgrextend(), except the relation can be extended by
>   *      multiple blocks at once and the added blocks will be filled with
>   *      zeroes.
>   */
>
> The documentation of smgrextend() is:
>
> /*
>   *  smgrextend() -- Add a new block to a file.
>   *
>   *      The semantics are nearly the same as smgrwrite(): write at the
>   *      specified position.  However, this is to be used for the case of
>   *      extending a relation (i.e., blocknum is at or beyond the current
>   *      EOF).  Note that we assume writing a block beyond current EOF
>   *      causes intervening file space to become filled with zeroes.
>   */
>
> So if you want to understand what smgrzeroextend() does, you need to
> mentally combine the documentation of three different functions.  Could
> we write documentation for each function that stands on its own?  And
> document the function arguments, like what does blocknum and nblocks mean?

Why not?

> Moreover, the text "except the relation can be extended by multiple
> blocks at once and the added blocks will be filled with zeroes" doesn't
> make much sense as a differentiation, because smgrextend() does that as
> well.

Not exactly. smgrextend() doesn't write a zero-ed block on its own, it
writes the content that's passed to it via 'buffer'. It's just that
some of smgrextend() callers pass in a zero buffer. Whereas,
smgrzeroextend() writes zero-ed blocks on its own, something
smgrextend() called in a loop with zero-ed 'buffer'. Therefore, the
existing wording seems fine to me.

> AFAICT, the differences between smgrextend() and smgrzeroextend() are:
>
> 1. smgrextend() writes a payload block in addition to extending the
> file, smgrzeroextend() just extends the file without writing a payload.

I think how smgrzeroextend() extends the zeroed blocks is internal to
it, and mdzeroextend() happens to use fallocate (if available). I
think the existing wording around smgrzeroextend() seems fine to me.

> 2. smgrzeroextend() uses various techniques (posix_fallocate() etc.) to
> make sure the extended space is actually reserved on disk, smgrextend()
> does not.

It's not smgrzeroextend() per se, it is mdzeroextend() that uses
fallocate() if available.

Overall, +0.5 from me if we want to avoid comment traversals to
understand what these functions do and be more descriptive; we might
end up duplicating comments. But, I'm fine with ""except the relation
can be extended by multiple...." before smgrzeroextend().

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: smgrzeroextend clarification

From
Andres Freund
Date:
Hi,

On 2023-05-10 11:50:14 +0200, Peter Eisentraut wrote:
> I was looking at the new smgrzeroextend() function in the smgr API.  The
> documentation isn't very extensive:
> 
> /*
>  *  smgrzeroextend() -- Add new zeroed out blocks to a file.
>  *
>  *      Similar to smgrextend(), except the relation can be extended by
>  *      multiple blocks at once and the added blocks will be filled with
>  *      zeroes.
>  */
> 
> The documentation of smgrextend() is:
> 
> /*
>  *  smgrextend() -- Add a new block to a file.
>  *
>  *      The semantics are nearly the same as smgrwrite(): write at the
>  *      specified position.  However, this is to be used for the case of
>  *      extending a relation (i.e., blocknum is at or beyond the current
>  *      EOF).  Note that we assume writing a block beyond current EOF
>  *      causes intervening file space to become filled with zeroes.
>  */
> 
> So if you want to understand what smgrzeroextend() does, you need to
> mentally combine the documentation of three different functions.  Could we
> write documentation for each function that stands on its own?  And document
> the function arguments, like what does blocknum and nblocks mean?

I guess it couldn't hurt. But if we go down that route, we basically need to
rewrite all the function headers in smgr.c, I think.


> Moreover, the text "except the relation can be extended by multiple blocks
> at once and the added blocks will be filled with zeroes" doesn't make much
> sense as a differentiation, because smgrextend() does that as well.

Hm? smgrextend() writes a single block, and it's filled with the caller
provided buffer.


> AFAICT, the differences between smgrextend() and smgrzeroextend() are:
> 
> 1. smgrextend() writes a payload block in addition to extending the file,
> smgrzeroextend() just extends the file without writing a payload.
> 
> 2. smgrzeroextend() uses various techniques (posix_fallocate() etc.) to make
> sure the extended space is actually reserved on disk, smgrextend() does not.
> 
> #1 seems fine, but the naming of the APIs does not reflect that at all.
> 
> If we think that #2 is important, maybe smgrextend() should do that as well?
> Or at least explain why it's not needed?

smgrextend() does #2 - it just does it by writing data.

The FileFallocate() path in smgrzeroextend() tries to avoid writing data if
extending by sufficient blocks - not having dirty data in the kernel page
cache can substantially reduce the IO usage.

Whereas the FileZero() path just optimizes the number of syscalls (and cache
misses etc).

Greetings,

Andres Freund



Re: smgrzeroextend clarification

From
Peter Eisentraut
Date:
On 10.05.23 20:10, Andres Freund wrote:
>> Moreover, the text "except the relation can be extended by multiple blocks
>> at once and the added blocks will be filled with zeroes" doesn't make much
>> sense as a differentiation, because smgrextend() does that as well.
> 
> Hm? smgrextend() writes a single block, and it's filled with the caller
> provided buffer.

But there is nothing that says that the block written by smgrextend() 
has to be the one right after the last existing block.  You can give it 
any block number, and it will write there, and the blocks in between 
that are skipped over will effectively be filled with zeros.  This is 
because of the way the POSIX file system APIs work.

You can observe this by hacking it up like this:

  smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
            char *buffer, bool skipFsync)
  {
+   if (blocknum > smgrnblocks(reln, forknum) + 1)
+       elog(INFO, "XXX");
+
     smgrsw[reln->smgr_which].smgr_extend(reln, forknum, blocknum,
                                          buffer, skipFsync);

Then you will get various test "failures" for hash indexes.

If you hack it up even further and actively fill the skipped-over blocks 
with something other than zeros, you will get even more dramatic failures.

So apparently, this behavior is actively being used.

Maybe it was never meant that way and only works accidentally?  Maybe 
hash indexes are broken?




Re: smgrzeroextend clarification

From
Greg Stark
Date:
On Thu, 11 May 2023 at 05:37, Peter Eisentraut
<peter.eisentraut@enterprisedb.com> wrote:
>
> Maybe it was never meant that way and only works accidentally?  Maybe
> hash indexes are broken?

It's explicitly documented to be this way. And I think it has to work
this way for recovery to work.

I think the reason you and Bharath and Andres are talking past each
other is that they're thinking about how the implementation works and
you're talking about the API definition.

If you read the API definition and treat the functions as a black box
I think you're right -- those two definitions sound pretty much
equivalent to me. They both extend the file, possibly multiple blocks,
and zero fill. The only difference is that smgrextend() additionally
allows you to provide data.

-- 
greg



Re: smgrzeroextend clarification

From
Thomas Munro
Date:
On Sat, May 13, 2023 at 6:07 AM Greg Stark <stark@mit.edu> wrote:
> On Thu, 11 May 2023 at 05:37, Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
> > Maybe it was never meant that way and only works accidentally?  Maybe
> > hash indexes are broken?
>
> It's explicitly documented to be this way. And I think it has to work
> this way for recovery to work.
>
> I think the reason you and Bharath and Andres are talking past each
> other is that they're thinking about how the implementation works and
> you're talking about the API definition.
>
> If you read the API definition and treat the functions as a black box
> I think you're right -- those two definitions sound pretty much
> equivalent to me. They both extend the file, possibly multiple blocks,
> and zero fill. The only difference is that smgrextend() additionally
> allows you to provide data.

Just a thought: should RelationCopyStorageUsingBuffer(), the new code
used by CREATE DATABASE with the default strategy WAL_LOG, use the
newer interface so that it creates fully allocated files instead of
sparse ones?



Re: smgrzeroextend clarification

From
Andres Freund
Date:
Hi,

On May 11, 2023 2:37:00 AM PDT, Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
>On 10.05.23 20:10, Andres Freund wrote:
>>> Moreover, the text "except the relation can be extended by multiple blocks
>>> at once and the added blocks will be filled with zeroes" doesn't make much
>>> sense as a differentiation, because smgrextend() does that as well.
>>
>> Hm? smgrextend() writes a single block, and it's filled with the caller
>> provided buffer.
>
>But there is nothing that says that the block written by smgrextend() has to be the one right after the last existing
block. You can give it any block number, and it will write there, and the blocks in between that are skipped over will
effectivelybe filled with zeros.  This is because of the way the POSIX file system APIs work. 

Sure, but that's pretty much independent of my changes. With the exception of, I believe, hash indexes we are quite
carefulto never leave holes in files. And not just for performance reasons - itd make it much more likely to encounter
ENOSPCwhile writing back blocks. Being unable to checkpoint (because they fail due to ENOSPC), is quite nasty.  


>Maybe it was never meant that way and only works accidentally?  Maybe hash indexes are broken?

It's known behavior I think - but also quite bad. I think it got a good bit worse after WAL support for hash indexes
wentin. I think during replay we sometimes end up actually allocating the blocks one by one. 

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: smgrzeroextend clarification

From
Andres Freund
Date:
Hi,

On May 12, 2023 11:36:23 AM PDT, Thomas Munro <thomas.munro@gmail.com> wrote:
>Just a thought: should RelationCopyStorageUsingBuffer(), the new code
>used by CREATE DATABASE with the default strategy WAL_LOG, use the
>newer interface so that it creates fully allocated files instead of
>sparse ones?

I played with that, but at least on Linux with ext4 and xfs it was hard to find cases where it really was beneficial.
That'sactually how I ended up finding the issues I'd fixed recently-ish. 

I think it might be different if we had an option to not use a strategy for the target database - right now we IIRC
writeback due to ring replacement. Entirely or largely in order, which I think removes most of the issues you could
have.

One issue is that it'd be worse on platforms / filesystems without fallocate support, because we would write the data
backtwice (once with zeros, once the real data). Perhaps we should add a separate parameter controlling the fallback
behaviour.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: smgrzeroextend clarification

From
Peter Eisentraut
Date:
On 10.05.23 20:10, Andres Freund wrote:
>> So if you want to understand what smgrzeroextend() does, you need to
>> mentally combine the documentation of three different functions.  Could we
>> write documentation for each function that stands on its own?  And document
>> the function arguments, like what does blocknum and nblocks mean?
> I guess it couldn't hurt. But if we go down that route, we basically need to
> rewrite all the function headers in smgr.c, I think.

I took a stab at this, going through the function comments in smgr.c and 
md.c and try to make some things easier to follow.

- Took at out the weird leading tabs in the older comments.

- Rephrased some comments so that smgr.c is more like an API 
documentation and md.c documents what that particular instance of that 
API does.

- Move the *extend and *zeroextend functions to a more sensible place 
among all the functions.  Especially since *write and *extend are very 
similar, it makes sense to have them close together.  This way, it's 
also easier to follow "this function is like that function" comments.

- Also moved mdwriteback(), which was in some completely odd place.

- Added comments for function arguments that were previously not documented.

- Reworded the comments for *extend and *zeroextend to make more sense 
relative to each other.

- I left this for smgrzeroextend():

FIXME: why both blocknum and nblocks

Like, what happens when blocknum is not the block right after the last 
existing block?  Do you get an implicit POSIX hole between the end of 
the file and the specified block and then an explicit hole for the next 
nblocks?  We should be clear here what the designer of this API intended.


The name smgrzeroextend is not great.  The "zero" isn't what 
distinguishes it from smgrextend.

Attachment

Re: smgrzeroextend clarification

From
Andres Freund
Date:
Hi,

On 2023-05-16 20:40:01 +0200, Peter Eisentraut wrote:
> On 10.05.23 20:10, Andres Freund wrote:
> > > So if you want to understand what smgrzeroextend() does, you need to
> > > mentally combine the documentation of three different functions.  Could we
> > > write documentation for each function that stands on its own?  And document
> > > the function arguments, like what does blocknum and nblocks mean?
> > I guess it couldn't hurt. But if we go down that route, we basically need to
> > rewrite all the function headers in smgr.c, I think.
> 
> I took a stab at this, going through the function comments in smgr.c and
> md.c and try to make some things easier to follow.
> 
> - Took at out the weird leading tabs in the older comments.

I hate those. I wonder if we could devise a regex that'd remove them
tree-wide, instead of ending up doing it very slowly one by one - I've
e.g. removed them from a bunch of pgstat files. Reflowing the comments after
is probably the most painful part.

FWIW, I'd do that in a separate commit, and then add that commit to
.git-blame-ignore-revs. Otherwise blaming gets unnecessarily painful. Also
makes it easier to see the actual content changes...


> - Rephrased some comments so that smgr.c is more like an API documentation
> and md.c documents what that particular instance of that API does.
> 
> - Move the *extend and *zeroextend functions to a more sensible place among
> all the functions.  Especially since *write and *extend are very similar, it
> makes sense to have them close together.  This way, it's also easier to
> follow "this function is like that function" comments.

For me the prior location made a bit more sense - we should always extend the
file before writing or reading the relevant blocks. But I don't really care.


> - Also moved mdwriteback(), which was in some completely odd place.
> 
> - Added comments for function arguments that were previously not documented.
> 
> - Reworded the comments for *extend and *zeroextend to make more sense
> relative to each other.
> 
> - I left this for smgrzeroextend():
> 
> FIXME: why both blocknum and nblocks

I guess you're suggesting that we would do an lstat() to figure out the size
instead? Or use some cached size? That'd not be trivial to add - and it just
seems unrelated to smgzerorextend(), it's just as true for smgrextend().


> Like, what happens when blocknum is not the block right after the last
> existing block?

The same as with smgrextend().


> Do you get an implicit POSIX hole between the end of the file and the
> specified block and then an explicit hole for the next nblocks?

I don't know what you mean with an explicit hole? The whole point of
posix_fallocate() is that it actually allocates blocks - I don't really see
how such a space could be described as a hole? My understanding of the
"implicit POSIX hole" semantics is that the point precisely is that there is
*no* space allocated for the region.


> We should be clear here what the designer of this API intended.
> 
> 
> The name smgrzeroextend is not great.  The "zero" isn't what distinguishes
> it from smgrextend.

I think it's a quite distinguishing feature - you can't provide initial block
contents, which you can with smgrextend() (and we largely do). And using
fallocate() is only possible because we know that we're intending to read
zeros.


Greetings,

Andres Freund



Re: smgrzeroextend clarification

From
Peter Eisentraut
Date:
I have committed some of the unrelated formatting changes separately, so 
what's left now is attached.

On 17.05.23 01:38, Andres Freund wrote:
>> - I left this for smgrzeroextend():
>>
>> FIXME: why both blocknum and nblocks
> 
> I guess you're suggesting that we would do an lstat() to figure out the size
> instead? Or use some cached size? That'd not be trivial to add - and it just
> seems unrelated to smgzerorextend(), it's just as true for smgrextend().

What smgzerorextend() does now seems sensible to me.  I'd just like one 
or two sentences that document its API.  Right now, blocknum and nblocks 
are not documented at all.  Of course, we can guess, but I'm also 
interested in edge cases.  What are you supposed to pass for blocknum? 
What happens if it's not right after the current last block?  You 
answered that, but is that just what happens to happen, or do we 
actually want to support that?  What happens if blocknum is *before* the 
currently last block?  Would that overwrite the last existing blocks 
with zero?  What if nblocks is negative?  Does that zero out blocks 
backwards?

Obviously, the answer for most of these right now is that you're not 
supposed to do that.  But as the smgrextend() + hash index case shows, 
these things tend to grow in unexpected directions.

Also, slightly unrelated to the API, did we consider COW file systems? 
Like, is explicitly allocating blocks of zeroes sensible on btrfs?

Attachment