Thread: WIP: Avoid creation of the free space map for small tables

WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
Hi all,
A while back, Robert Haas noticed that the space taken up by very
small tables is dominated by the FSM [1]. Tom suggested that we could
prevent creation of the FSM until the heap has reached a certain
threshold size [2]. Attached is a WIP patch to implement that. I've
also attached a SQL script to demonstrate the change in behavior for
various scenarios.

The behavior that allows the simplest implementation I thought of is as follows:

-The FSM isn't created if the heap has fewer than 10 blocks (or
whatever). If the last known good block has insufficient space, try
every block before extending the heap.

-If a heap with a FSM is truncated back to below the threshold, the
FSM stays around and can be used as usual.

-If the heap tuples are all deleted, the FSM stays but has no leaf
blocks (same as on master). Although it exists, it won't be
re-extended until the heap re-passes the threshold.

--
Some notes:

-For normal mode, I taught fsm_set_and_search() to switch to a
non-extending buffer call, but the biggest missing piece is WAL
replay. I couldn't find a non-extending equivalent of
XLogReadBufferExtended(), so I might have to create one.

-There'll need to be some performance testing to make sure there's no
regression, and to choose a good value for the threshold. I'll look
into that, but if anyone has any ideas for tests, that'll help this
effort along.

-A possible TODO item is to teach pg_upgrade not to link FSMs for
small heaps. I haven't look into the feasibility of that, however.

-RelationGetBufferForTuple() now has two boolean variables that mean
"don't use the FSM", but with different behaviors. To avoid confusion,
I've renamed use_fsm to always_extend and revised the commentary
accordingly.

-I've only implemented this for heaps, because indexes (at least
B-tree) don't seem to be as eager to create a FSM. I haven't looked at
the code, however.

--
[1] https://www.postgresql.org/message-id/CA%2BTgmoac%2B6qTNp2U%2BwedY8-PU6kK_b6hbdhR5xYGBG3GtdFcww%40mail.gmail.com
[2] https://www.postgresql.org/message-id/11360.1345502641%40sss.pgh.pa.us

--
I'll add this to the November commitfest.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Thomas Munro
Date:
On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:
> A while back, Robert Haas noticed that the space taken up by very
> small tables is dominated by the FSM [1]. Tom suggested that we could
> prevent creation of the FSM until the heap has reached a certain
> threshold size [2]. Attached is a WIP patch to implement that. I've
> also attached a SQL script to demonstrate the change in behavior for
> various scenarios.

Hi John,

You'll need to tweak the test in contrib/pageinspect/sql/page.sql,
because it's currently asserting that there is an FSM on a small table
so make check-world fails.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:
>> A while back, Robert Haas noticed that the space taken up by very
>> small tables is dominated by the FSM [1]. Tom suggested that we could
>> prevent creation of the FSM until the heap has reached a certain
>> threshold size [2]. Attached is a WIP patch to implement that. I've
>> also attached a SQL script to demonstrate the change in behavior for
>> various scenarios.
>
> Hi John,
>
> You'll need to tweak the test in contrib/pageinspect/sql/page.sql,
> because it's currently asserting that there is an FSM on a small table
> so make check-world fails.

Whoops, sorry about that; the attached patch passes make check-world.
While looking into that, I also found a regression: If the cached
target block is the last block in the relation and there is no free
space, that block will be tried twice. That's been fixed as well.

Thanks,
-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Tom Lane
Date:
John Naylor <jcnaylor@gmail.com> writes:
> On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:
>>> A while back, Robert Haas noticed that the space taken up by very
>>> small tables is dominated by the FSM [1]. Tom suggested that we could
>>> prevent creation of the FSM until the heap has reached a certain
>>> threshold size [2]. Attached is a WIP patch to implement that.

BTW, don't we need a similar hack for visibility maps?

            regards, tom lane


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 10/7/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> John Naylor <jcnaylor@gmail.com> writes:
>> On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:
>>>> A while back, Robert Haas noticed that the space taken up by very
>>>> small tables is dominated by the FSM [1]. Tom suggested that we could
>>>> prevent creation of the FSM until the heap has reached a certain
>>>> threshold size [2]. Attached is a WIP patch to implement that.
>
> BTW, don't we need a similar hack for visibility maps?

The FSM is the bigger bang for the buck, and fairly simple to do, but
it would be nice to do something about VMs as well. I'm not sure if
simply lacking a VM would be as simple (or as free of downsides) as
for the FSM. I haven't studied the VM code in detail, however.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> -There'll need to be some performance testing to make sure there's no
> regression, and to choose a good value for the threshold. I'll look
> into that, but if anyone has any ideas for tests, that'll help this
> effort along.
>

Can you try with a Copy command which copies just enough tuples to
fill the pages equivalent to HEAP_FSM_EXTENSION_THRESHOLD?  It seems
to me in such a case patch will try each of the blocks multiple times.
  It looks quite lame that we have to try again and again the blocks
which we have just filled by ourselves but may be that doesn't matter
much as the threshold value is small.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> Hi all,
> A while back, Robert Haas noticed that the space taken up by very
> small tables is dominated by the FSM [1]. Tom suggested that we could
> prevent creation of the FSM until the heap has reached a certain
> threshold size [2]. Attached is a WIP patch to implement that. I've
> also attached a SQL script to demonstrate the change in behavior for
> various scenarios.
>
> The behavior that allows the simplest implementation I thought of is as follows:
>
> -The FSM isn't created if the heap has fewer than 10 blocks (or
> whatever). If the last known good block has insufficient space, try
> every block before extending the heap.
>
> -If a heap with a FSM is truncated back to below the threshold, the
> FSM stays around and can be used as usual.
>
> -If the heap tuples are all deleted, the FSM stays but has no leaf
> blocks (same as on master). Although it exists, it won't be
> re-extended until the heap re-passes the threshold.
>
> --
> Some notes:
>
> -For normal mode, I taught fsm_set_and_search() to switch to a
> non-extending buffer call, but the biggest missing piece is WAL
> replay.
>

fsm_set_and_search()
{
..
+ /*
+ * For heaps we prevent extension of the FSM unless the number of pages
+ * exceeds
HEAP_FSM_EXTENSION_THRESHOLD. For tables that don't already
+ * have a FSM, this will save an inode and a few kB
of space.
+ * For sane threshold values, the FSM address will be zero, so we
+ * don't bother dealing with
anything else.
+ */
+ if (rel->rd_rel->relkind == RELKIND_RELATION
+ && addr.logpageno == 0)

I am not sure if this is a solid way to avoid creating FSM.  What if
fsm_set_and_search gets called for the level other than 0?   Also,
when the relation has blocks more than HEAP_FSM_EXTENSION_THRESHOLD,
then first time when vacuum will try to record the free space in the
page, won't it skip recording free space for first
HEAP_FSM_EXTENSION_THRESHOLD pages?

I think you have found a good way to avoid creating FSM, but can't we
use some simpler technique like if the FSM fork for a relation doesn't
exist, then check the heapblk number for which we try to update the
FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
creating the FSM.

> I couldn't find a non-extending equivalent of
> XLogReadBufferExtended(), so I might have to create one.
>

I think it would be better if we can find a common way to avoid
creating FSM both during DO and REDO time.  It might be possible if
somethin like what I have said above is feasible.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote:
>> -For normal mode, I taught fsm_set_and_search() to switch to a
>> non-extending buffer call, but the biggest missing piece is WAL
>> replay.
>>
>
> fsm_set_and_search()
> {
> ..
> + /*
> + * For heaps we prevent extension of the FSM unless the number of pages
> + * exceeds
> HEAP_FSM_EXTENSION_THRESHOLD. For tables that don't already
> + * have a FSM, this will save an inode and a few kB
> of space.
> + * For sane threshold values, the FSM address will be zero, so we
> + * don't bother dealing with
> anything else.
> + */
> + if (rel->rd_rel->relkind == RELKIND_RELATION
> + && addr.logpageno == 0)
>
> I am not sure if this is a solid way to avoid creating FSM.  What if
> fsm_set_and_search gets called for the level other than 0?

Thanks for taking a look. As for levels other than 0, I think that
only happens when fsm_set_and_search() is called by fsm_search(),
which will not cause extension.

> Also,
> when the relation has blocks more than HEAP_FSM_EXTENSION_THRESHOLD,
> then first time when vacuum will try to record the free space in the
> page, won't it skip recording free space for first
> HEAP_FSM_EXTENSION_THRESHOLD pages?

Hmm, that's a good point.

> I think you have found a good way to avoid creating FSM, but can't we
> use some simpler technique like if the FSM fork for a relation doesn't
> exist, then check the heapblk number for which we try to update the
> FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
> creating the FSM.

I think I see what you mean, but to avoid the vacuum problem you just
mentioned, we'd need to check the relation size, too. I've attached an
unpolished revision to do this. It seems to work, but I haven't tested
the vacuum issue yet. I'll do that and some COPY performance testing
in the next day or so. There's a bit more repetition than I would
like, so I'm not sure it's simpler - perhaps RecordPageWithFreeSpace()
could be turned into a wrapper around RecordAndGetPageWithFreeSpace().

Also new in this version, some non-functional improvements to hio.c:
-debugging calls that are #ifdef'd out.
-move some code out into a function instead of adding another goto.

>> I couldn't find a non-extending equivalent of
>> XLogReadBufferExtended(), so I might have to create one.
>>
>
> I think it would be better if we can find a common way to avoid
> creating FSM both during DO and REDO time.  It might be possible if
> somethin like what I have said above is feasible.

That would be ideal.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
> On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think you have found a good way to avoid creating FSM, but can't we
>> use some simpler technique like if the FSM fork for a relation doesn't
>> exist, then check the heapblk number for which we try to update the
>> FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
>> creating the FSM.

>> I think it would be better if we can find a common way to avoid
>> creating FSM both during DO and REDO time.  It might be possible if
>> somethin like what I have said above is feasible.

I've attached v4, which implements the REDO case, and as closely as
possible to the DO case. I've created a new function to guard against
creation of the FSM, which is called by  RecordPageWithFreeSpace() and
RecordAndGetPageWithFreeSpace(). Since XLogRecordPageWithFreeSpace()
takes a relfilenode and not a relation, I had to reimplement that
separately, but the logic is basically the same. It works under
streaming replication.

I've also attached a couple SQL scripts which, when the aforementioned
DEBUG1 calls are enabled, show what the heap insert code is doing for
different scenarios. Make check-world passes.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sun, Oct 14, 2018 at 1:09 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > I think you have found a good way to avoid creating FSM, but can't we
> > use some simpler technique like if the FSM fork for a relation doesn't
> > exist, then check the heapblk number for which we try to update the
> > FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
> > creating the FSM.
>
> I think I see what you mean, but to avoid the vacuum problem you just
> mentioned, we'd need to check the relation size, too.
>

Sure, but vacuum already has relation size.  In general, I think we
should try to avoid adding more system calls in the common code path.
It can impact the performance.

Few comments on your latest patch:
-
+static bool
+allow_write_to_fsm(Relation rel, BlockNumber heapBlk)
+{
+ BlockNumber heap_nblocks;
+
+ if (heapBlk > HEAP_FSM_EXTENSION_THRESHOLD ||
+ rel->rd_rel->relkind != RELKIND_RELATION)
+ return true;
+
+ /* XXX is this value cached? */
+ heap_nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (heap_nblocks > HEAP_FSM_EXTENSION_THRESHOLD)
+ return true;
+ else
+ {
+ RelationOpenSmgr(rel);
+ return smgrexists(rel->rd_smgr, FSM_FORKNUM);
+ }
+}

I think you can avoid calling RelationGetNumberOfBlocks, if you call
smgrexists before and for the purpose of vacuum, we can get that as an
input parameter.  I think one can argue for not changing the interface
functions like RecordPageWithFreeSpace to avoid calling
RelationGetNumberOfBlocks, but to me, it appears worth to save the
additional system call.

-
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
-
- /*
- * If the FSM knows nothing of the rel, try the last page before we
- * give up and extend.  This avoids one-tuple-per-page syndrome during
- * bootstrapping or in a recently-started system.
- */
  if (targetBlock == InvalidBlockNumber)
- {
- BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
-
- if (nblocks > 0)
- targetBlock = nblocks - 1;
- }
+ targetBlock = get_page_no_fsm(relation, InvalidBlockNumber,
+   &try_every_page);


Is it possible to hide the magic of trying each block within
GetPageWithFreeSpace?  It will simplify the code and in future, if
another storage API has a different function for
RelationGetBufferForTuple, it will work seamlessly, provided they are
using same FSM.  One such user is zheap.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few comments on your latest patch:
> -
> +static bool
> +allow_write_to_fsm(Relation rel, BlockNumber heapBlk)
> +{
> + BlockNumber heap_nblocks;
> +
> + if (heapBlk > HEAP_FSM_EXTENSION_THRESHOLD ||
> + rel->rd_rel->relkind != RELKIND_RELATION)
> + return true;
> +
> + /* XXX is this value cached? */
> + heap_nblocks = RelationGetNumberOfBlocks(rel);
> +
> + if (heap_nblocks > HEAP_FSM_EXTENSION_THRESHOLD)
> + return true;
> + else
> + {
> + RelationOpenSmgr(rel);
> + return smgrexists(rel->rd_smgr, FSM_FORKNUM);
> + }
> +}
>
> I think you can avoid calling RelationGetNumberOfBlocks, if you call
> smgrexists before

Okay, I didn't know which was cheaper, but I'll check smgrexists
first. Thanks for the info.

> and for the purpose of vacuum, we can get that as an
> input parameter.  I think one can argue for not changing the interface
> functions like RecordPageWithFreeSpace to avoid calling
> RelationGetNumberOfBlocks, but to me, it appears worth to save the
> additional system call.

I agree, and that should be fairly straightforward. I'll include that
in the next patch.

> -
> targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> -
> - /*
> - * If the FSM knows nothing of the rel, try the last page before we
> - * give up and extend.  This avoids one-tuple-per-page syndrome during
> - * bootstrapping or in a recently-started system.
> - */
>   if (targetBlock == InvalidBlockNumber)
> - {
> - BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
> -
> - if (nblocks > 0)
> - targetBlock = nblocks - 1;
> - }
> + targetBlock = get_page_no_fsm(relation, InvalidBlockNumber,
> +   &try_every_page);
>
>
> Is it possible to hide the magic of trying each block within
> GetPageWithFreeSpace?  It will simplify the code and in future, if
> another storage API has a different function for
> RelationGetBufferForTuple, it will work seamlessly, provided they are
> using same FSM.  One such user is zheap.

Hmm, here I'm a bit more skeptical about the trade offs. That would
mean, in effect, to put a function called get_page_no_fsm() in the FSM
code. ;-)  I'm willing to be convinced otherwise, of course, but
here's my reasoning:

For one, we'd have to pass prevBlockNumber and &try_every_block to
GetPageWithFreeSpace() (and RecordAndGetPageWithFreeSpace() by
extension), which are irrelevant to some callers. In addition, in
hio.c, there is a call where we don't want to try any blocks that we
have already, much less all of them:

/*
 * Check if some other backend has extended a block for us while
 * we were waiting on the lock.
 */
targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);

By the time we get to this call, we likely wouldn't trigger the logic
to try every block, but I don't think we can guarantee that. We could
add a boolean parameter that means "consider trying every block", but
I don't think the FSM code should have so much state passed to it.

Thanks for reviewing,
-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Oct 15, 2018 at 4:09 PM John Naylor <jcnaylor@gmail.com> wrote:
> On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Few comments on your latest patch:
> > -
> > targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> > -
> > - /*
> > - * If the FSM knows nothing of the rel, try the last page before we
> > - * give up and extend.  This avoids one-tuple-per-page syndrome during
> > - * bootstrapping or in a recently-started system.
> > - */
> >   if (targetBlock == InvalidBlockNumber)
> > - {
> > - BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
> > -
> > - if (nblocks > 0)
> > - targetBlock = nblocks - 1;
> > - }
> > + targetBlock = get_page_no_fsm(relation, InvalidBlockNumber,
> > +   &try_every_page);
> >
> >
> > Is it possible to hide the magic of trying each block within
> > GetPageWithFreeSpace?  It will simplify the code and in future, if
> > another storage API has a different function for
> > RelationGetBufferForTuple, it will work seamlessly, provided they are
> > using same FSM.  One such user is zheap.
>
> Hmm, here I'm a bit more skeptical about the trade offs. That would
> mean, in effect, to put a function called get_page_no_fsm() in the FSM
> code. ;-)  I'm willing to be convinced otherwise, of course, but
> here's my reasoning:
>
> For one, we'd have to pass prevBlockNumber and &try_every_block to
> GetPageWithFreeSpace() (and RecordAndGetPageWithFreeSpace() by
> extension), which are irrelevant to some callers.
>

I think we can avoid using prevBlockNumber and try_every_block, if we
maintain a small cache which tells whether the particular block is
tried or not.  What I am envisioning is that while finding the block
with free space, if we came to know that the relation in question is
small enough that it doesn't have FSM, we can perform a local_search.
In local_seach, we can enquire the cache for any block that we can try
and if we find any block, we can try inserting in that block,
otherwise, we need to extend the relation.  One simple way to imagine
such a cache would be an array of structure and structure has blkno
and status fields.  After we get the usable block, we need to clear
the cache, if exists.

> In addition, in
> hio.c, there is a call where we don't want to try any blocks that we
> have already, much less all of them:
>
> /*
>  * Check if some other backend has extended a block for us while
>  * we were waiting on the lock.
>  */
> targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
>
> By the time we get to this call, we likely wouldn't trigger the logic
> to try every block, but I don't think we can guarantee that.
>

I think the current code as proposed has that limitation, but if we
have a cache, then we can check if the relation has actually extended
after taking the lock and if so we can try only newly added block/'s.

I am not completely sure if the idea described above is certainly
better, but it seems that it will be clean and can handle some of the
cases with ease.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think you can avoid calling RelationGetNumberOfBlocks, if you call
> smgrexists before

This is done in the attached v5, 0001.

> and for the purpose of vacuum, we can get that as an
> input parameter.  I think one can argue for not changing the interface
> functions like RecordPageWithFreeSpace to avoid calling
> RelationGetNumberOfBlocks, but to me, it appears worth to save the
> additional system call.

This is done in 0002. I also added a check for the cached value of
pg_class.relpages, since it's cheap and may help non-VACUUM callers.

> [proposal for a cache of blocks to try]

That's interesting. I'll have to do some reading elsewhere in the
codebase, and then I'll follow up.

Thanks,
-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Tue, Oct 16, 2018 at 4:27 PM John Naylor <jcnaylor@gmail.com> wrote:
> > [proposal for a cache of blocks to try]
>
> That's interesting. I'll have to do some reading elsewhere in the
> codebase, and then I'll follow up.
>

Thanks, I have changed the status of this patch as "Waiting on
Author".  Feel free to change it once you have a new patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 10/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think we can avoid using prevBlockNumber and try_every_block, if we
> maintain a small cache which tells whether the particular block is
> tried or not.  What I am envisioning is that while finding the block
> with free space, if we came to know that the relation in question is
> small enough that it doesn't have FSM, we can perform a local_search.
> In local_seach, we can enquire the cache for any block that we can try
> and if we find any block, we can try inserting in that block,
> otherwise, we need to extend the relation.  One simple way to imagine
> such a cache would be an array of structure and structure has blkno
> and status fields.  After we get the usable block, we need to clear
> the cache, if exists.

Here is the design I've implemented in the attached v6. There is more
code than v5, but there's a cleaner separation between freespace.c and
hio.c, as you preferred. I also think it's more robust. I've expended
some effort to avoid doing unnecessary system calls to get the number
of blocks.
--

For the local, in-memory map, maintain a static array of status
markers, of fixed-length HEAP_FSM_CREATION_THRESHOLD, indexed by block
number. This is populated every time we call GetPageWithFreeSpace() on
small tables with no FSM. The statuses are

'zero' (beyond the relation)
'available to try'
'tried already'

Example for a 4-page heap:

01234567
AAAA0000

If we try block 3 and there is no space, we set it to 'tried' and next
time through the loop we'll try 2, etc:

01234567
AAAT0000

If we try all available blocks, we will extend the relation. As in the
master branch, first we call GetPageWithFreeSpace() again to see if
another backend extended the relation to 5 blocks while we were
waiting for the lock. If we find a new block, we will mark the new
block available and leave the rest alone:

01234567
TTTTA000

On the odd chance we still can't insert into the new block, we'll skip
checking any others and we'll redo the logic to extend the relation.

If we're about to successfully return a buffer, whether from an
existing block, or by extension, we clear the local map.

Once this is in shape, I'll do some performance testing.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
I wrote:

> Once this is in shape, I'll do some performance testing.

On second thought, there's no point in waiting, especially if a
regression points to a design flaw.

I compiled patched postgres with HEAP_FSM_CREATION_THRESHOLD set to
32, then ran the attached script which populates 100 tables with
varying numbers of blocks. I wanted a test that created pages eagerly
and wrote to disk as little as possible. Config was stock, except for
fsync = off. I took the average of 10 runs after removing the slowest
and fastest run:

# blocks    master        patch
4            36.4ms        33.9ms
8            50.6ms        48.9ms
12            58.6ms        66.3ms
16            65.5ms        81.4ms

It seems under these circumstances a threshold of up to 8 performs
comparably to the master branch, with small block numbers possibly
faster than with the FSM, provided they're in shared buffers already.
I didn't bother testing higher values because it's clear there's a
regression starting around 10 or so, beyond which it helps to have the
FSM.

A case could be made for setting the threshold to 4, since not having
3 blocks of FSM in shared buffers exactly makes up for the 3 other
blocks of heap that are checked when free space runs out.

I can run additional tests if there's interest.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
Upthread I wrote:

> -A possible TODO item is to teach pg_upgrade not to link FSMs for
> small heaps. I haven't look into the feasibility of that, however.

This turned out to be relatively light weight (0002 attached). I had
to add relkind to the RelInfo struct and save the size of each heap as
its transferred. The attached SQL script will setup a couple test
cases to demonstrate pg_upgrade. Installations with large numbers of
small tables will be able to see space savings right away.

For 0001, I adjusted the README and docs, and also made some cosmetic
improvements to the code, mostly in the comments. I've set the
commitfest entry back to 'needs review'

One thing I noticed is that one behavior on master hasn't changed:
System catalogs created during bootstrap still have a FSM if they have
any data. Possibly related, catalogs also have a VM even if they have
no data at all. This isn't anything to get excited about, but it would
be nice to investigate, at least so it can be documented. A cursory
dig hasn't found the cause, but I'll keep doing that as time permits.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Oct 31, 2018 at 1:42 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> Upthread I wrote:
>
> > -A possible TODO item is to teach pg_upgrade not to link FSMs for
> > small heaps. I haven't look into the feasibility of that, however.
>
> This turned out to be relatively light weight (0002 attached). I had
> to add relkind to the RelInfo struct and save the size of each heap as
> its transferred. The attached SQL script will setup a couple test
> cases to demonstrate pg_upgrade. Installations with large numbers of
> small tables will be able to see space savings right away.
>
> For 0001, I adjusted the README and docs, and also made some cosmetic
> improvements to the code, mostly in the comments. I've set the
> commitfest entry back to 'needs review'
>
> One thing I noticed is that one behavior on master hasn't changed:
> System catalogs created during bootstrap still have a FSM if they have
> any data. Possibly related, catalogs also have a VM even if they have
> no data at all. This isn't anything to get excited about, but it would
> be nice to investigate, at least so it can be documented. A cursory
> dig hasn't found the cause, but I'll keep doing that as time permits.
>

Thanks for your work on this, I will try to review it during CF.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Robert Haas
Date:
On Tue, Oct 23, 2018 at 9:42 AM John Naylor <jcnaylor@gmail.com> wrote:
> A case could be made for setting the threshold to 4, since not having
> 3 blocks of FSM in shared buffers exactly makes up for the 3 other
> blocks of heap that are checked when free space runs out.

That doesn't seem like an unreasonable argument.  I'm not sure whether
the right threshold is 4 or something a little bigger, but I bet it's
not very large.  It seems important to me that before anybody thinks
about committing this, we construct some kind of destruction case
where repeated scans of the whole table are triggered as frequently as
possible, and then run that test with varying thresholds.  I might be
totally wrong, but I bet with a value as large as 32 you will be able
to find cases where it regresses in a big way.

We also need to think about what happens on the standby, where the FSM
is updated in a fairly different way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Oct 31, 2018 at 10:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Oct 23, 2018 at 9:42 AM John Naylor <jcnaylor@gmail.com> wrote:
> > A case could be made for setting the threshold to 4, since not having
> > 3 blocks of FSM in shared buffers exactly makes up for the 3 other
> > blocks of heap that are checked when free space runs out.
>
> That doesn't seem like an unreasonable argument.  I'm not sure whether
> the right threshold is 4 or something a little bigger, but I bet it's
> not very large.  It seems important to me that before anybody thinks
> about committing this, we construct some kind of destruction case
> where repeated scans of the whole table are triggered as frequently as
> possible, and then run that test with varying thresholds.
>

Why do you think repeated scans will be a destruction case when there
is no FSM for a small table?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Robert Haas
Date:
On Fri, Nov 2, 2018 at 7:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > That doesn't seem like an unreasonable argument.  I'm not sure whether
> > the right threshold is 4 or something a little bigger, but I bet it's
> > not very large.  It seems important to me that before anybody thinks
> > about committing this, we construct some kind of destruction case
> > where repeated scans of the whole table are triggered as frequently as
> > possible, and then run that test with varying thresholds.
>
> Why do you think repeated scans will be a destruction case when there
> is no FSM for a small table?

That's not what I'm saying.  If we don't have the FSM, we have to
check every page of the table.  If there's a workload where that
happens a lot on a table that is just under the size threshold for
creating the FSM, then it's likely to be a worst case for this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Avoid creation of the free space map for small tables

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> That's not what I'm saying.  If we don't have the FSM, we have to
> check every page of the table.  If there's a workload where that
> happens a lot on a table that is just under the size threshold for
> creating the FSM, then it's likely to be a worst case for this patch.

Hmm, you're assuming something not in evidence: why would that be the
algorithm?  On a FSM-less table, I'd be inclined to just check the
last page and then grow the table if the tuple doesn't fit there.
This would, in many cases, soon result in a FSM being created, but
I think that's just fine.  The point of the change is to optimize
for cases where a table *never* gets more than a few inserts.  Not, IMO,
for cases where a table gets a lot of churn but never has a whole lot of
live tuples.  In the latter scenario we are far better off having a FSM.

            regards, tom lane


Re: WIP: Avoid creation of the free space map for small tables

From
Robert Haas
Date:
On Fri, Nov 2, 2018 at 10:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > That's not what I'm saying.  If we don't have the FSM, we have to
> > check every page of the table.  If there's a workload where that
> > happens a lot on a table that is just under the size threshold for
> > creating the FSM, then it's likely to be a worst case for this patch.
>
> Hmm, you're assuming something not in evidence: why would that be the
> algorithm?

I think it's in evidence, in the form of several messages mentioning a
flag called try_every_block.

Just checking the last page of the table doesn't sound like a good
idea to me.  I think that will just lead to a lot of stupid bloat.  It
seems likely that checking every page of the table is fine for npages
<= 3, and that would still be win in a very significant number of
cases, since lots of instances have many empty or tiny tables.  I was
merely reacting to the suggestion that the approach should be used for
npages <= 32; that threshold sounds way too high.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 11/2/18, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Nov 2, 2018 at 10:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>> > That's not what I'm saying.  If we don't have the FSM, we have to
>> > check every page of the table.  If there's a workload where that
>> > happens a lot on a table that is just under the size threshold for
>> > creating the FSM, then it's likely to be a worst case for this patch.
>>
>> Hmm, you're assuming something not in evidence: why would that be the
>> algorithm?
>
> I think it's in evidence, in the form of several messages mentioning a
> flag called try_every_block.

Correct.

> Just checking the last page of the table doesn't sound like a good
> idea to me.  I think that will just lead to a lot of stupid bloat.  It
> seems likely that checking every page of the table is fine for npages
> <= 3, and that would still be win in a very significant number of
> cases, since lots of instances have many empty or tiny tables.  I was
> merely reacting to the suggestion that the approach should be used for
> npages <= 32; that threshold sounds way too high.

To be clear, no one suggested that. The patch has always had 8 or 10
as a starting point, and I've mentioned 4 and 8 as good possibilities
based on the COPY tests upthread. It was apparent I didn't need to
recompile a bunch of binaries with different thresholds. All I had to
do was compile with a threshold much larger than required, and then
test inserting into X number of pages, to simulate a threshold of X. I
increased X until I saw a regression. That's where the 32 came from,
sorry if that was misleading, in my head it was obvious. I'd be happy
test other scenarios. I'm not sure how to test redo -- seems more
difficult to get meaningful results than the normal case.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Nov 2, 2018 at 7:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Nov 2, 2018 at 7:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > That doesn't seem like an unreasonable argument.  I'm not sure whether
> > > the right threshold is 4 or something a little bigger, but I bet it's
> > > not very large.  It seems important to me that before anybody thinks
> > > about committing this, we construct some kind of destruction case
> > > where repeated scans of the whole table are triggered as frequently as
> > > possible, and then run that test with varying thresholds.
> >
> > Why do you think repeated scans will be a destruction case when there
> > is no FSM for a small table?
>
> That's not what I'm saying.  If we don't have the FSM, we have to
> check every page of the table.  If there's a workload where that
> happens a lot on a table that is just under the size threshold for
> creating the FSM, then it's likely to be a worst case for this patch.
>

That makes sense and this is the first thing I was also worried about
after looking at the initial patch and suggested a test [1] which can
hit the worst case.


[1] - https://www.postgresql.org/message-id/CAA4eK1%2BhP-jGYWi25-1QMedxeM_0H01s%3D%3D4-t74oEgL2EDVicw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Nov 2, 2018 at 7:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > That's not what I'm saying.  If we don't have the FSM, we have to
> > check every page of the table.  If there's a workload where that
> > happens a lot on a table that is just under the size threshold for
> > creating the FSM, then it's likely to be a worst case for this patch.
>
> Hmm, you're assuming something not in evidence: why would that be the
> algorithm?  On a FSM-less table, I'd be inclined to just check the
> last page and then grow the table if the tuple doesn't fit there.
> This would, in many cases, soon result in a FSM being created, but
> I think that's just fine.  The point of the change is to optimize
> for cases where a table *never* gets more than a few inserts.  Not, IMO,
> for cases where a table gets a lot of churn but never has a whole lot of
> live tuples.  In the latter scenario we are far better off having a FSM.
>

In the past, you seem to have suggested an approach to try each block
[1] for small tables which don't have FSM.  I think if we do what you
are suggesting now, then we don't need to worry much about any
regression and code will be somewhat simpler, but OTOH, I don't see
much harm in trying every block if we keep the threshold as no more
than 4.  That would address more cases.

[1] - https://www.postgresql.org/message-id/11360.1345502641%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Michael Paquier
Date:
On Fri, Nov 02, 2018 at 10:38:45AM -0400, Robert Haas wrote:
> I think it's in evidence, in the form of several messages mentioning a
> flag called try_every_block.
>
> Just checking the last page of the table doesn't sound like a good
> idea to me.  I think that will just lead to a lot of stupid bloat.  It
> seems likely that checking every page of the table is fine for npages
> <= 3, and that would still be win in a very significant number of
> cases, since lots of instances have many empty or tiny tables.  I was
> merely reacting to the suggestion that the approach should be used for
> npages <= 32; that threshold sounds way too high.

It seems to me that it would be costly for schemas which have one core
table with a couple of records used in many joins with other queries.
Imagine for example a core table like that:
CREATE TABLE us_states (id serial, initials varchar(2));
INSERT INTO us_states VALUES (DEFAULT, 'CA');

If there is a workload where those initials need to be fetched a lot,
this patch could cause a loss.  It looks hard to me to put a straight
number on when not having the FSM is better than having it because that
could be environment-dependent, so there is an argument for making the
default very low, still configurable?
--
Michael

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sun, Nov 4, 2018 at 5:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 02, 2018 at 10:38:45AM -0400, Robert Haas wrote:
> > I think it's in evidence, in the form of several messages mentioning a
> > flag called try_every_block.
> >
> > Just checking the last page of the table doesn't sound like a good
> > idea to me.  I think that will just lead to a lot of stupid bloat.  It
> > seems likely that checking every page of the table is fine for npages
> > <= 3, and that would still be win in a very significant number of
> > cases, since lots of instances have many empty or tiny tables.  I was
> > merely reacting to the suggestion that the approach should be used for
> > npages <= 32; that threshold sounds way too high.
>
> It seems to me that it would be costly for schemas which have one core
> table with a couple of records used in many joins with other queries.
> Imagine for example a core table like that:
> CREATE TABLE us_states (id serial, initials varchar(2));
> INSERT INTO us_states VALUES (DEFAULT, 'CA');
>
> If there is a workload where those initials need to be fetched a lot,
> this patch could cause a loss.
>

How alone fetching would cause any loss? If it gets updated, then
there is a chance that we might have some performance impact.

>  It looks hard to me to put a straight
> number on when not having the FSM is better than having it because that
> could be environment-dependent, so there is an argument for making the
> default very low, still configurable?
>

I think 3 or 4 as threshold should work fine (though we need to
thoroughly test that) as we will anyway avoid having three additional
pages of FSM for such tables.  I am not sure how easy it would be for
users to set this value if we make it configurable or on what basis
can they configure?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 10/31/18, Robert Haas <robertmhaas@gmail.com> wrote:
> It seems important to me that before anybody thinks
> about committing this, we construct some kind of destruction case
> where repeated scans of the whole table are triggered as frequently as
> possible, and then run that test with varying thresholds.  I might be
> totally wrong, but I bet with a value as large as 32 you will be able
> to find cases where it regresses in a big way.

Here's an attempt at a destruction case: Lobotomize the heap insert
logic such that it never checks the cached target block and has to
call the free space logic for every single insertion, like this:

index ff13c03083..5d5b36af29 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -377,7 +377,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
     else if (bistate && bistate->current_buf != InvalidBuffer)
         targetBlock = BufferGetBlockNumber(bistate->current_buf);
     else
-        targetBlock = RelationGetTargetBlock(relation);
+        targetBlock = InvalidBlockNumber;

     if (targetBlock == InvalidBlockNumber && use_fsm)
     {

(with the threshold patch I had to do additional work)
With the small tuples used in the attached v2 test, this means the
free space logic is called ~225 times per block. The test tables are
pre-filled with one tuple and vacuumed so that the FSMs are already
created when testing the master branch. The patch branch is compiled
with a threshold of 8, but testing inserts of 4 pages will effectively
simulate a threshold of 4, etc. As before, trimmed average of 10 runs,
loading to 100 tables each:

# blocks    master        patch
2            25.1ms        30.3ms
4            40.7ms        48.1ms
6            56.6ms        64.7ms
8            73.1ms        82.0ms

Without this artificial penalty, the 8 block case was about 50ms for
both branches. So if I calculated right, of that 50 ms, master is
spending ~0.10ms looking for free space, and the patch is spending
about ~0.15ms. So, from that perspective, the difference is trivial.
Of course, this is a single client, so not entirely realistic. I think
that shared buffer considerations are most important for deciding the
threshold.

> We also need to think about what happens on the standby, where the FSM
> is updated in a fairly different way.

Were you referring to performance or just functionality? Because the
threshold works on the standby, but I don't know about the performance
there.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 11/2/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> On a FSM-less table, I'd be inclined to just check the
> last page and then grow the table if the tuple doesn't fit there.
> This would, in many cases, soon result in a FSM being created, but
> I think that's just fine.  The point of the change is to optimize
> for cases where a table *never* gets more than a few inserts.  Not, IMO,
> for cases where a table gets a lot of churn but never has a whole lot of
> live tuples.  In the latter scenario we are far better off having a FSM.

and...

On 11/2/18, Robert Haas <robertmhaas@gmail.com> wrote:
> Just checking the last page of the table doesn't sound like a good
> idea to me.  I think that will just lead to a lot of stupid bloat.  It
> seems likely that checking every page of the table is fine for npages
> <= 3, and that would still be win in a very significant number of
> cases,

I see the merit of both of these arguments, and it occurred to me that
there is middle ground between checking only the last page and
checking every page: Check the last 3 pages and set the threshold to
6. That way, with npages <= 3, every page will be checked. In the
unlikely case that npages = 6 and the first 3 pages are all wasted
space, that's the amount of space that would have gone to the FSM
anyway, and the relation will likely grow beyond the threshold soon,
at which point the free space will become visible again.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Oct 22, 2018 at 12:14 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 10/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think we can avoid using prevBlockNumber and try_every_block, if we
> > maintain a small cache which tells whether the particular block is
> > tried or not.  What I am envisioning is that while finding the block
> > with free space, if we came to know that the relation in question is
> > small enough that it doesn't have FSM, we can perform a local_search.
> > In local_seach, we can enquire the cache for any block that we can try
> > and if we find any block, we can try inserting in that block,
> > otherwise, we need to extend the relation.  One simple way to imagine
> > such a cache would be an array of structure and structure has blkno
> > and status fields.  After we get the usable block, we need to clear
> > the cache, if exists.
>
> Here is the design I've implemented in the attached v6. There is more
> code than v5, but there's a cleaner separation between freespace.c and
> hio.c, as you preferred.
>

This approach seems better.

> I also think it's more robust. I've expended
> some effort to avoid doing unnecessary system calls to get the number
> of blocks.
> --
>
> For the local, in-memory map, maintain a static array of status
> markers, of fixed-length HEAP_FSM_CREATION_THRESHOLD, indexed by block
> number. This is populated every time we call GetPageWithFreeSpace() on
> small tables with no FSM. The statuses are
>
> 'zero' (beyond the relation)
> 'available to try'
> 'tried already'
>

+/* Status codes for the local map. */
+#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */
+#define FSM_LOCAL_AVAIL 0x01 /* Available to try */
+#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */

Instead of maintaining three states, can't we do with two states
(Available and Not Available), basically combine 0 and 2 in your case.
I think it will save some cycles in
fsm_local_set, where each time you need to initialize all the entries
in the map.  I think we can argue that it is not much overhead, but I
think it is better code-wise also if we can make it happen with fewer
states.

Some assorted comments:
1.
 <para>
-Each heap and index relation, except for hash indexes, has a Free Space Map
+Each heap relation, unless it is very small, and each index relation,
+except for hash indexes, has a Free Space Map
 (FSM) to keep track of available space in the relation. It's stored

It appears that line has ended abruptly.

2.
page = BufferGetPage(buffer);
+ targetBlock = BufferGetBlockNumber(buffer);

  if (!PageIsNew(page))
  elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
- BufferGetBlockNumber(buffer),
+ targetBlock,
  RelationGetRelationName(relation));

  PageInit(page, BufferGetPageSize(buffer), 0);
@@ -623,7 +641,18 @@ loop:
  * current backend to make more insertions or not, which is probably a
  * good bet most of the time.  So for now, don't add it to FSM yet.
  */
- RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
+ RelationSetTargetBlock(relation, targetBlock);

Is this related to this patch? If not, I suggest let's do it
separately if required.

3.
 static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
-    uint8 newValue, uint8 minValue);
+ uint8 newValue, uint8 minValue);

This appears to be a spurious change.

4.
@@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size len,
  * target.
  */
  targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+ /*
+ * In case we used an in-memory map of available blocks, reset
+ * it for next use.
+ */
+ if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
+ ClearLocalMap();

How will you clear the local map during error?  I think you need to
clear it in abort path and you can name the function as
FSMClearLocalMap or something like that.

5.
+/*#define TRACE_TARGETBLOCK */

Debugging leftover, do you want to retain this and related stuff
during the development of patch?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> +/* Status codes for the local map. */
> +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */
> +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */
> +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */
>
> Instead of maintaining three states, can't we do with two states
> (Available and Not Available), basically combine 0 and 2 in your case.
> I think it will save some cycles in
> fsm_local_set, where each time you need to initialize all the entries
> in the map.  I think we can argue that it is not much overhead, but I
> think it is better code-wise also if we can make it happen with fewer
> states.

That'd work too, but let's consider this scenario: We have a 2-block
table that has no free space. After trying each block, the local cache
looks like

0123
TT00

Let's say we have to wait to acquire a relation extension lock,
because another backend had already started extending the heap by 1
block. We call GetPageWithFreeSpace() and now the local map looks like

0123
TTA0

By using bitwise OR to set availability, the already-tried blocks
remain as they are. With only 2 states, the map would look like this
instead:

0123
AAAN

If we assume that an insert into the newly-created block 2 will almost
always succeed, we don't have to worry about wasting time re-checking
the first 2 full blocks. Does that sound right to you?


> Some assorted comments:
> 1.
>  <para>
> -Each heap and index relation, except for hash indexes, has a Free Space
> Map
> +Each heap relation, unless it is very small, and each index relation,
> +except for hash indexes, has a Free Space Map
>  (FSM) to keep track of available space in the relation. It's stored
>
> It appears that line has ended abruptly.

Not sure what you're referring to here.

> 2.
> page = BufferGetPage(buffer);
> + targetBlock = BufferGetBlockNumber(buffer);
>
>   if (!PageIsNew(page))
>   elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
> - BufferGetBlockNumber(buffer),
> + targetBlock,
>   RelationGetRelationName(relation));
>
>   PageInit(page, BufferGetPageSize(buffer), 0);
> @@ -623,7 +641,18 @@ loop:
>   * current backend to make more insertions or not, which is probably a
>   * good bet most of the time.  So for now, don't add it to FSM yet.
>   */
> - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
> + RelationSetTargetBlock(relation, targetBlock);
>
> Is this related to this patch? If not, I suggest let's do it
> separately if required.

I will separate this out.

> 3.
>  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> -    uint8 newValue, uint8 minValue);
> + uint8 newValue, uint8 minValue);
>
> This appears to be a spurious change.

It was intentional, but I will include it separately as above.

> 4.
> @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size
> len,
>   * target.
>   */
>   targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> +
> + /*
> + * In case we used an in-memory map of available blocks, reset
> + * it for next use.
> + */
> + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> + ClearLocalMap();
>
> How will you clear the local map during error?  I think you need to
> clear it in abort path and you can name the function as
> FSMClearLocalMap or something like that.

That sounds right, and I will rename the function that way. For the
abort path, were you referring to this or somewhere else?

if (!PageIsNew(page))
    elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
         targetBlock,
         RelationGetRelationName(relation));

> 5.
> +/*#define TRACE_TARGETBLOCK */
>
> Debugging leftover, do you want to retain this and related stuff
> during the development of patch?

I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's
useful for development, but I don't particularly care whether it's in
the final verision.

Also, I found an off-by-one error that caused an unnecessary
smgrexists() call in tables with threshold + 1 pages. This will be
fixed in the next version.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Nov 19, 2018 at 7:30 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > +/* Status codes for the local map. */
> > +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */
> > +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */
> > +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */
> >
> > Instead of maintaining three states, can't we do with two states
> > (Available and Not Available), basically combine 0 and 2 in your case.
> > I think it will save some cycles in
> > fsm_local_set, where each time you need to initialize all the entries
> > in the map.  I think we can argue that it is not much overhead, but I
> > think it is better code-wise also if we can make it happen with fewer
> > states.
>
> That'd work too, but let's consider this scenario: We have a 2-block
> table that has no free space. After trying each block, the local cache
> looks like
>
> 0123
> TT00
>
> Let's say we have to wait to acquire a relation extension lock,
> because another backend had already started extending the heap by 1
> block. We call GetPageWithFreeSpace() and now the local map looks like
>
> 0123
> TTA0
>
> By using bitwise OR to set availability, the already-tried blocks
> remain as they are. With only 2 states, the map would look like this
> instead:
>
> 0123
> AAAN
>

I expect below part of code to go-away.
+fsm_local_set(Relation rel, BlockNumber nblocks)
{
..
+ /*
+ * If the blkno is beyond the end of the relation, the status should
+ * be zero already, but make sure it is.  If the blkno is within the
+ * relation, mark it available unless it's already been tried.
+ */
+ for (blkno = 0; blkno < HEAP_FSM_CREATION_THRESHOLD; blkno++)
+ {
+ if (blkno < nblocks)
+ FSMLocalMap[blkno] |= FSM_LOCAL_AVAIL;
+ else
+ FSMLocalMap[blkno] = FSM_LOCAL_ZERO;
+ }
..
}

In my mind for such a case it should look like below:
0123
NNAN

> If we assume that an insert into the newly-created block 2 will almost
> always succeed, we don't have to worry about wasting time re-checking
> the first 2 full blocks. Does that sound right to you?
>

As explained above, such a situation won't exist.

>
> > Some assorted comments:
> > 1.
> >  <para>
> > -Each heap and index relation, except for hash indexes, has a Free Space
> > Map
> > +Each heap relation, unless it is very small, and each index relation,
> > +except for hash indexes, has a Free Space Map
> >  (FSM) to keep track of available space in the relation. It's stored
> >
> > It appears that line has ended abruptly.
>
> Not sure what you're referring to here.
>

There is a space after "has a Free Space Map   " so you can combine next line.

> > 2.
> > page = BufferGetPage(buffer);
> > + targetBlock = BufferGetBlockNumber(buffer);
> >
> >   if (!PageIsNew(page))
> >   elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
> > - BufferGetBlockNumber(buffer),
> > + targetBlock,
> >   RelationGetRelationName(relation));
> >
> >   PageInit(page, BufferGetPageSize(buffer), 0);
> > @@ -623,7 +641,18 @@ loop:
> >   * current backend to make more insertions or not, which is probably a
> >   * good bet most of the time.  So for now, don't add it to FSM yet.
> >   */
> > - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
> > + RelationSetTargetBlock(relation, targetBlock);
> >
> > Is this related to this patch? If not, I suggest let's do it
> > separately if required.
>
> I will separate this out.
>
> > 3.
> >  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> > -    uint8 newValue, uint8 minValue);
> > + uint8 newValue, uint8 minValue);
> >
> > This appears to be a spurious change.
>
> It was intentional, but I will include it separately as above.
>
> > 4.
> > @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size
> > len,
> >   * target.
> >   */
> >   targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> > +
> > + /*
> > + * In case we used an in-memory map of available blocks, reset
> > + * it for next use.
> > + */
> > + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> > + ClearLocalMap();
> >
> > How will you clear the local map during error?  I think you need to
> > clear it in abort path and you can name the function as
> > FSMClearLocalMap or something like that.
>
> That sounds right, and I will rename the function that way. For the
> abort path, were you referring to this or somewhere else?
>
> if (!PageIsNew(page))
>         elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
>                  targetBlock,
>                  RelationGetRelationName(relation));
>

I think it might come from any other place between when you set it and
before it got cleared (like any intermediate buffer and pin related
API's).

> > 5.
> > +/*#define TRACE_TARGETBLOCK */
> >
> > Debugging leftover, do you want to retain this and related stuff
> > during the development of patch?
>
> I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's
> useful for development, but I don't particularly care whether it's in
> the final verision.
>

Okay, so if you want to retain it for the period of development, then
I am fine with it.  We can see at the end if it makes sense to retain
it.

> Also, I found an off-by-one error that caused an unnecessary
> smgrexists() call in tables with threshold + 1 pages. This will be
> fixed in the next version.
>

Thanks.

One other thing that slightly bothers me is the call to
RelationGetNumberOfBlocks via fsm_allow_writes.  It seems that call
will happen quite frequently in this code-path and can have some
performance impact.  As of now, I don't have any idea to avoid it or
reduce it more than what you already have in the patch, but I think we
should try some more to avoid it.  Let me know if you have any ideas
around that?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Nov 19, 2018 at 7:30 AM John Naylor <jcnaylor@gmail.com> wrote:
>> Let's say we have to wait to acquire a relation extension lock,
>> because another backend had already started extending the heap by 1
>> block. We call GetPageWithFreeSpace() and now the local map looks like
>>
>> 0123
>> TTA0
>>
>> By using bitwise OR to set availability, the already-tried blocks
>> remain as they are. With only 2 states, the map would look like this
>> instead:
>>
>> 0123
>> AAAN
>>

> In my mind for such a case it should look like below:
> 0123
> NNAN

Okay, to retain that behavior with only 2 status codes, I have
implemented the map as a struct with 2 members: the cached number of
blocks, plus the same array I had before. This also allows a more
efficient implementation at the micro level. I just need to do some
more testing on it.

[ abortive states ]
> I think it might come from any other place between when you set it and
> before it got cleared (like any intermediate buffer and pin related
> API's).

Okay, I will look into that.

> One other thing that slightly bothers me is the call to
> RelationGetNumberOfBlocks via fsm_allow_writes.  It seems that call
> will happen quite frequently in this code-path and can have some
> performance impact.  As of now, I don't have any idea to avoid it or
> reduce it more than what you already have in the patch, but I think we
> should try some more to avoid it.  Let me know if you have any ideas
> around that?

FWIW, I believe that the callers of RecordPageWithFreeSpace() will
almost always avoid that call. Otherwise, there is at least one detail
that could use attention: If rel->rd_rel->relpages shows fewer pages
than the threshold, than the code doesn't trust it to be true. Might
be worth revisiting.
Aside from that, I will have to think about it.

More generally, I have a couple ideas about performance:

1. Only mark available every other block such that visible blocks are
interleaved as the relation extends. To explain, this diagram shows a
relation extending, with 1 meaning marked available and 0 meaning
marked not-available.

A
NA
ANA
NANA

So for a 3-block table, we never check block 1. Any free space it has
acquired will become visible when it extends to 4 blocks. For a
4-block threshold, we only check 2 blocks or less. This reduces the
number of lock/pin events but still controls bloat. We could also
check both blocks of a 2-block table.

2. During manual testing I seem to remember times that the FSM code
was invoked even though I expected the smgr entry to have a cached
target block. Perhaps VACUUM or something is clearing that away
unnecessarily. It seems worthwhile to verify and investigate, but that
seems like a separate project.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Nov 19, 2018 at 4:40 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Nov 19, 2018 at 7:30 AM John Naylor <jcnaylor@gmail.com> wrote:
> >> Let's say we have to wait to acquire a relation extension lock,
> >> because another backend had already started extending the heap by 1
> >> block. We call GetPageWithFreeSpace() and now the local map looks like
> >>
> >> 0123
> >> TTA0
> >>
> >> By using bitwise OR to set availability, the already-tried blocks
> >> remain as they are. With only 2 states, the map would look like this
> >> instead:
> >>
> >> 0123
> >> AAAN
> >>
>
> > In my mind for such a case it should look like below:
> > 0123
> > NNAN
>
> Okay, to retain that behavior with only 2 status codes, I have
> implemented the map as a struct with 2 members: the cached number of
> blocks, plus the same array I had before. This also allows a more
> efficient implementation at the micro level. I just need to do some
> more testing on it.
>

Okay.

> [ abortive states ]
> > I think it might come from any other place between when you set it and
> > before it got cleared (like any intermediate buffer and pin related
> > API's).
>
> Okay, I will look into that.
>
> > One other thing that slightly bothers me is the call to
> > RelationGetNumberOfBlocks via fsm_allow_writes.  It seems that call
> > will happen quite frequently in this code-path and can have some
> > performance impact.  As of now, I don't have any idea to avoid it or
> > reduce it more than what you already have in the patch, but I think we
> > should try some more to avoid it.  Let me know if you have any ideas
> > around that?
>
> FWIW, I believe that the callers of RecordPageWithFreeSpace() will
> almost always avoid that call. Otherwise, there is at least one detail
> that could use attention: If rel->rd_rel->relpages shows fewer pages
> than the threshold, than the code doesn't trust it to be true. Might
> be worth revisiting.
>

I think it is less of a concern when called from vacuum code path.

> Aside from that, I will have to think about it.
>
> More generally, I have a couple ideas about performance:
>
> 1. Only mark available every other block such that visible blocks are
> interleaved as the relation extends. To explain, this diagram shows a
> relation extending, with 1 meaning marked available and 0 meaning
> marked not-available.
>
> A
> NA
> ANA
> NANA
>
> So for a 3-block table, we never check block 1. Any free space it has
> acquired will become visible when it extends to 4 blocks. For a
> 4-block threshold, we only check 2 blocks or less. This reduces the
> number of lock/pin events but still controls bloat. We could also
> check both blocks of a 2-block table.
>

We can try something like this if we see there is any visible
performance hit in some scenario.

> 2. During manual testing I seem to remember times that the FSM code
> was invoked even though I expected the smgr entry to have a cached
> target block. Perhaps VACUUM or something is clearing that away
> unnecessarily. It seems worthwhile to verify and investigate, but that
> seems like a separate project.
>

makes sense, let's not get distracted by stuff that is not related to
this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
I wrote:

> On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> [ abortive states ]
>> I think it might come from any other place between when you set it and
>> before it got cleared (like any intermediate buffer and pin related
>> API's).
>
> Okay, I will look into that.

LockBuffer(), visibilitymap_pin(), and GetVisibilityMapPins() don't
call errors at this level. I don't immediately see any additional good
places from which to clear the local map.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Tue, Nov 20, 2018 at 1:42 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> I wrote:
>
> > On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > [ abortive states ]
> >> I think it might come from any other place between when you set it and
> >> before it got cleared (like any intermediate buffer and pin related
> >> API's).
> >
> > Okay, I will look into that.
>
> LockBuffer(), visibilitymap_pin(), and GetVisibilityMapPins() don't
> call errors at this level. I don't immediately see any additional good
> places from which to clear the local map.
>

LockBuffer()->LWLockAcquire() can error out.  Similarly,
ReadBuffer()->ReadBufferExtended() and calls below it can error ou.
To handle them, you need to add a call to clear local map in
Abortransaction code path.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:

I've attached v8, which includes the 2-state map and addresses the points below:

> Some assorted comments:
> 1.
>  <para>
> -Each heap and index relation, except for hash indexes, has a Free Space
> Map
> +Each heap relation, unless it is very small, and each index relation,
> +except for hash indexes, has a Free Space Map
>  (FSM) to keep track of available space in the relation. It's stored
>
> It appears that line has ended abruptly.

Revised.

> 2.
> page = BufferGetPage(buffer);
> + targetBlock = BufferGetBlockNumber(buffer);
>
>   if (!PageIsNew(page))
>   elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
> - BufferGetBlockNumber(buffer),
> + targetBlock,
>   RelationGetRelationName(relation));
>
>   PageInit(page, BufferGetPageSize(buffer), 0);
> @@ -623,7 +641,18 @@ loop:
>   * current backend to make more insertions or not, which is probably a
>   * good bet most of the time.  So for now, don't add it to FSM yet.
>   */
> - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
> + RelationSetTargetBlock(relation, targetBlock);
>
> Is this related to this patch? If not, I suggest let's do it
> separately if required.
>
> 3.
>  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> -    uint8 newValue, uint8 minValue);
> + uint8 newValue, uint8 minValue);
>
> This appears to be a spurious change.

2 and 3 are separated into 0001.

> 4.
> @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size
> len,
>   * target.
>   */
>   targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> +
> + /*
> + * In case we used an in-memory map of available blocks, reset
> + * it for next use.
> + */
> + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> + ClearLocalMap();
>
> How will you clear the local map during error?  I think you need to
> clear it in abort path and you can name the function as
> FSMClearLocalMap or something like that.

Done. I've put this call last before abort processing.

> 5.
> +/*#define TRACE_TARGETBLOCK */
>
> Debugging leftover, do you want to retain this and related stuff
> during the development of patch?

I have kept it aside as a separate patch but not attached it for now.

Also, we don't quite have a consensus on the threshold value, but I
have set it to 4 pages for v8. If this is still considered too
expensive (and basic tests show it shouldn't be), I suspect it'd be
better to interleave the available block numbers as described a couple
days ago than lower the threshold further.

I have looked at zhio.c, and it seems trivial to adapt zheap to this patchset.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Nov 23, 2018 at 11:56 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> >
> > 3.
> >  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> > -    uint8 newValue, uint8 minValue);
> > + uint8 newValue, uint8 minValue);
> >
> > This appears to be a spurious change.
>
> 2 and 3 are separated into 0001.
>

Is the point 3 change related to pgindent?  I think even if you want
these, then don't prepare other patches on top of this, keep it
entirely separate.

>
> Also, we don't quite have a consensus on the threshold value, but I
> have set it to 4 pages for v8. If this is still considered too
> expensive (and basic tests show it shouldn't be), I suspect it'd be
> better to interleave the available block numbers as described a couple
> days ago than lower the threshold further.
>

Can you please repeat the copy test you have done above with
fillfactor as 20 and 30?

> I have looked at zhio.c, and it seems trivial to adapt zheap to this patchset.
>

Cool, I also think so.

Few more comments:
-------------------------------
1. I think we can add some test(s) to test the new functionality, may
be something on the lines of what Robert has originally provided as an
example of this behavior [1].

2.
@@ -2554,6 +2555,12 @@ AbortTransaction(void)
  s->parallelModeLevel = 0;
  }

+ /*
+ * In case we aborted during RelationGetBufferForTuple(),
+ * clear the local map of heap pages.
+ */
+ FSMClearLocalMap();
+

The similar call is required in AbortSubTransaction function as well.
I suggest to add it after pgstat_progress_end_command in both
functions.

3.
GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
{
..
+ if (target_block == InvalidBlockNumber &&
+ rel->rd_rel->relkind == RELKIND_RELATION)
+ {
+ nblocks = RelationGetNumberOfBlocks(rel);
+
+ if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
+ {
+ /*
+ * If the FSM knows nothing of the rel, try the last page before
+ * we give up and extend.  This avoids one-tuple-per-page syndrome
+ * during bootstrapping or in a recently-started system.
+ */
+ target_block = nblocks - 1;
+ }
..
}

Moving this check inside GetPageWithFreeSpace has one disadvantage, we
will always consider last block which can have some inadvertent
effects.  Consider when this function gets called from
RelationGetBufferForTuple just before extension, it can consider to
check for the last block even though that is already being done in the
begining when GetPageWithFreeSpace was called.  I am not completely
sure how much this is a case to worry because  it will help to check
last block when the same is concurrently added and FSM is not updated
for same.  I am slightly worried because the unpatched code doesn't
care for such case and we have no intention to change this behaviour.
What do you think?

4. You have mentioned above that "system catalogs created during
bootstrap still have a FSM if they have any data." and I can also see
this behavior, have you investigated this point further?

5. Your logic to update FSM on standby seems okay, but can you show
some tests which proves its sanity?

[1] - https://www.postgresql.org/message-id/CA%2BTgmoac%2B6qTNp2U%2BwedY8-PU6kK_b6hbdhR5xYGBG3GtdFcww%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Nov 23, 2018 at 11:56 AM John Naylor <jcnaylor@gmail.com> wrote:
>> On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> >
>> > 3.
>> >  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16
>> > slot,
>> > -    uint8 newValue, uint8 minValue);
>> > + uint8 newValue, uint8 minValue);
>> >
>> > This appears to be a spurious change.
>>
>> 2 and 3 are separated into 0001.
>>
>
> Is the point 3 change related to pgindent?  I think even if you want
> these, then don't prepare other patches on top of this, keep it
> entirely separate.

There are some places in the codebase that have spaces after tabs for
no apparent reason (not to line up function parameters or pointer
declarations). pgindent hasn't been able to fix this. If you wish, you
are of course free to apply 0001 separately at any time.

>> Also, we don't quite have a consensus on the threshold value, but I
>> have set it to 4 pages for v8. If this is still considered too
>> expensive (and basic tests show it shouldn't be), I suspect it'd be
>> better to interleave the available block numbers as described a couple
>> days ago than lower the threshold further.
>>
>
> Can you please repeat the copy test you have done above with
> fillfactor as 20 and 30?

I did two kinds of tests. The first had a fill-factor of 10 [1], the
second had the default storage, but I prevented the backend from
caching the target block [2], to fully exercise the free space code.
Would you like me to repeat the first one with 20 and 30? And do you
think it is useful enough to test the copying of 4 blocks and not
smaller numbers?

> Few more comments:
> -------------------------------
> 1. I think we can add some test(s) to test the new functionality, may
> be something on the lines of what Robert has originally provided as an
> example of this behavior [1].

Maybe the SQL script attached to [3] (which I probably based on
Robert's report) can be cleaned up into a regression test.

> 3.
> GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
> {
> ..
> + if (target_block == InvalidBlockNumber &&
> + rel->rd_rel->relkind == RELKIND_RELATION)
> + {
> + nblocks = RelationGetNumberOfBlocks(rel);
> +
> + if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
> + {
> + /*
> + * If the FSM knows nothing of the rel, try the last page before
> + * we give up and extend.  This avoids one-tuple-per-page syndrome
> + * during bootstrapping or in a recently-started system.
> + */
> + target_block = nblocks - 1;
> + }
> ..
> }
>
> Moving this check inside GetPageWithFreeSpace has one disadvantage, we
> will always consider last block which can have some inadvertent
> effects.  Consider when this function gets called from
> RelationGetBufferForTuple just before extension, it can consider to
> check for the last block even though that is already being done in the
> begining when GetPageWithFreeSpace was called.  I am not completely
> sure how much this is a case to worry because  it will help to check
> last block when the same is concurrently added and FSM is not updated
> for same.  I am slightly worried because the unpatched code doesn't
> care for such case and we have no intention to change this behaviour.
> What do you think?

I see what you mean. If the other backend extended by 1 block, the
intention is to keep it out of the FSM at first, and by extension, not
visible in other ways. The comment implies that's debatable, but I
agree we shouldn't change that without a reason to. One simple idea is
add a 3rd boolean parameter to GetPageWithFreeSpace() to control
whether it gives up if the FSM fork doesn't indicate free space, like

    if (target_block == InvalidBlockNumber &&
        rel->rd_rel->relkind == RELKIND_RELATION &&
        !check_fsm_only)
    {
        nblocks = RelationGetNumberOfBlocks(rel);

> 4. You have mentioned above that "system catalogs created during
> bootstrap still have a FSM if they have any data." and I can also see
> this behavior, have you investigated this point further?

Code reading didn't uncover the cause. I might have to step through
with a debugger or something similar. I should find time for that next
month.

> 5. Your logic to update FSM on standby seems okay, but can you show
> some tests which proves its sanity?

I believe to convince myself it was working, I used the individual
commands in the sql file in [3], then used the size function on the
secondary. I'll redo that to verify.

--
[1] https://www.postgresql.org/message-id/CAJVSVGWCRMyi8sSqguf6PfFcpM3hwNY5YhPZTt-8Q3ZGv0UGYw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAJVSVGWMXzsqYpPhO3Snz4n5y8Tq-QiviuSCKyB5czCTnq9rzA%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAJVSVGWvB13PzpbLEecFuGFc5V2fsO736BsdTakPiPAcdMM5tQ%40mail.gmail.com


-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Nov 26, 2018 at 3:46 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Fri, Nov 23, 2018 at 11:56 AM John Naylor <jcnaylor@gmail.com> wrote:
> >> On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> >
> >> > 3.
> >> >  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16
> >> > slot,
> >> > -    uint8 newValue, uint8 minValue);
> >> > + uint8 newValue, uint8 minValue);
> >> >
> >> > This appears to be a spurious change.
> >>
> >> 2 and 3 are separated into 0001.
> >>
> >
> > Is the point 3 change related to pgindent?  I think even if you want
> > these, then don't prepare other patches on top of this, keep it
> > entirely separate.
>
> There are some places in the codebase that have spaces after tabs for
> no apparent reason (not to line up function parameters or pointer
> declarations). pgindent hasn't been able to fix this. If you wish, you
> are of course free to apply 0001 separately at any time.
>

I am not sure that I am interested in generally changing the parts of
code that are not directly related to this patch, feel free to post
them separately.

> >> Also, we don't quite have a consensus on the threshold value, but I
> >> have set it to 4 pages for v8. If this is still considered too
> >> expensive (and basic tests show it shouldn't be), I suspect it'd be
> >> better to interleave the available block numbers as described a couple
> >> days ago than lower the threshold further.
> >>
> >
> > Can you please repeat the copy test you have done above with
> > fillfactor as 20 and 30?
>
> I did two kinds of tests. The first had a fill-factor of 10 [1], the
> second had the default storage, but I prevented the backend from
> caching the target block [2], to fully exercise the free space code.
> Would you like me to repeat the first one with 20 and 30?
>

Yes.

> And do you
> think it is useful enough to test the copying of 4 blocks and not
> smaller numbers?
>

You can try that, but I am mainly interested with 4 as threshold or
may be higher to see where we start loosing.  I think 4 should be a
reasonable default for this patch, if later anyone wants to extend, we
might want to provide a table level knob.

> > Few more comments:
> > -------------------------------
> > 1. I think we can add some test(s) to test the new functionality, may
> > be something on the lines of what Robert has originally provided as an
> > example of this behavior [1].
>
> Maybe the SQL script attached to [3] (which I probably based on
> Robert's report) can be cleaned up into a regression test.
>

Yeah, something on those lines works for me.

> > 3.
> > GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
> > {
> > ..
> > + if (target_block == InvalidBlockNumber &&
> > + rel->rd_rel->relkind == RELKIND_RELATION)
> > + {
> > + nblocks = RelationGetNumberOfBlocks(rel);
> > +
> > + if (nblocks > HEAP_FSM_CREATION_THRESHOLD)
> > + {
> > + /*
> > + * If the FSM knows nothing of the rel, try the last page before
> > + * we give up and extend.  This avoids one-tuple-per-page syndrome
> > + * during bootstrapping or in a recently-started system.
> > + */
> > + target_block = nblocks - 1;
> > + }
> > ..
> > }
> >
> > Moving this check inside GetPageWithFreeSpace has one disadvantage, we
> > will always consider last block which can have some inadvertent
> > effects.  Consider when this function gets called from
> > RelationGetBufferForTuple just before extension, it can consider to
> > check for the last block even though that is already being done in the
> > begining when GetPageWithFreeSpace was called.  I am not completely
> > sure how much this is a case to worry because  it will help to check
> > last block when the same is concurrently added and FSM is not updated
> > for same.  I am slightly worried because the unpatched code doesn't
> > care for such case and we have no intention to change this behaviour.
> > What do you think?
>
> I see what you mean. If the other backend extended by 1 block, the
> intention is to keep it out of the FSM at first, and by extension, not
> visible in other ways. The comment implies that's debatable, but I
> agree we shouldn't change that without a reason to. One simple idea is
> add a 3rd boolean parameter to GetPageWithFreeSpace() to control
> whether it gives up if the FSM fork doesn't indicate free space, like
>
>         if (target_block == InvalidBlockNumber &&
>                 rel->rd_rel->relkind == RELKIND_RELATION &&
>                 !check_fsm_only)
>         {
>                 nblocks = RelationGetNumberOfBlocks(rel);
>

I have the exact fix in my mind, so let's do it that way.

> > 4. You have mentioned above that "system catalogs created during
> > bootstrap still have a FSM if they have any data." and I can also see
> > this behavior, have you investigated this point further?
>
> Code reading didn't uncover the cause. I might have to step through
> with a debugger or something similar. I should find time for that next
> month.
>

Okay.

> > 5. Your logic to update FSM on standby seems okay, but can you show
> > some tests which proves its sanity?
>
> I believe to convince myself it was working, I used the individual
> commands in the sql file in [3], then used the size function on the
> secondary. I'll redo that to verify.
>

Okay.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
> Is the point 3 change related to pgindent?  I think even if you want
> these, then don't prepare other patches on top of this, keep it
> entirely separate.

Both removed.

>> Also, we don't quite have a consensus on the threshold value, but I
>> have set it to 4 pages for v8. If this is still considered too
>> expensive (and basic tests show it shouldn't be), I suspect it'd be
>> better to interleave the available block numbers as described a couple
>> days ago than lower the threshold further.
>>
>
> Can you please repeat the copy test you have done above with
> fillfactor as 20 and 30?

I will send the results in a separate email soon.

> Few more comments:
> -------------------------------
> 1. I think we can add some test(s) to test the new functionality, may
> be something on the lines of what Robert has originally provided as an
> example of this behavior [1].

Done. I tried adding it to several schedules, but for some reason
vacuuming an empty table failed to truncate the heap to 0 blocks.
Putting the test in its own group fixed the problem, but that doesn't
seem ideal.

> 2.
> The similar call is required in AbortSubTransaction function as well.
> I suggest to add it after pgstat_progress_end_command in both
> functions.

Done.

> 3.
>> agree we shouldn't change that without a reason to. One simple idea is
>> add a 3rd boolean parameter to GetPageWithFreeSpace() to control
>> whether it gives up if the FSM fork doesn't indicate free space, like
> I have the exact fix in my mind, so let's do it that way.

Done. This also reverts comments and variable names that referred to
updating the local map after relation extension.

While at it, I changed a couple conditionals to check the locally
cached nblocks rather than the threshold. No functional change, but
looks more precise. Might save a few cycles as well.

>> > 5. Your logic to update FSM on standby seems okay, but can you show
>> > some tests which proves its sanity?
>>
>> I believe to convince myself it was working, I used the individual
>> commands in the sql file in [3], then used the size function on the
>> secondary. I'll redo that to verify.

I've verified the standby behaves precisely as the primary, as far as
the aforementioned script goes.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote:
> > Few more comments:
> > -------------------------------
> > 1. I think we can add some test(s) to test the new functionality, may
> > be something on the lines of what Robert has originally provided as an
> > example of this behavior [1].
>
> Done. I tried adding it to several schedules, but for some reason
> vacuuming an empty table failed to truncate the heap to 0 blocks.
> Putting the test in its own group fixed the problem, but that doesn't
> seem ideal.
>

It might be because it fails the should_attempt_truncation() check.
See below code:

if (should_attempt_truncation(vacrelstats))
lazy_truncate_heap(onerel, vacrelstats, vac_strategy);


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 11/29/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote:
>> Done. I tried adding it to several schedules, but for some reason
>> vacuuming an empty table failed to truncate the heap to 0 blocks.
>> Putting the test in its own group fixed the problem, but that doesn't
>> seem ideal.
>>
>
> It might be because it fails the should_attempt_truncation() check.
> See below code:
>
> if (should_attempt_truncation(vacrelstats))
> lazy_truncate_heap(onerel, vacrelstats, vac_strategy);

I see. I think truncating the FSM is not essential to show either the
old or new behavior -- I could skip that portion to enable running the
test in a parallel group.

>> Can you please repeat the copy test you have done above with
>> fillfactor as 20 and 30?
>
> I will send the results in a separate email soon.

I ran the attached scripts which populates 100 tables with either 4 or
8 blocks. The test tables were pre-filled with one tuple and vacuumed
so that the FSMs were already created when testing the master branch.
The patch branch was compiled with a threshold of 8, but testing
inserts of 4 pages effectively simulates a threshold of 4. Config was
stock, except for fsync = off. I took the average of 40 runs (2
complete tests of 20 runs each) after removing the 10% highest and
lowest:

fillfactor=20
# blocks    master        patch
4            19.1ms        17.5ms
8            33.4ms        30.9ms

fillfactor=30
# blocks    master        patch
4            20.1ms        19.7ms
8            34.7ms        34.9ms

It seems the patch might be a bit faster with fillfactor=20, but I'm
at a loss as to why that would be. Previous testing with a higher
threshold showed a significant performance penalty starting around 10
blocks [1], but that used truncation rather than deletion, and had a
fill-factor of 10.

--
[1] https://www.postgresql.org/message-id/CAJVSVGWCRMyi8sSqguf6PfFcpM3hwNY5YhPZTt-8Q3ZGv0UGYw%40mail.gmail.com

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sat, Dec 1, 2018 at 12:42 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 11/29/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote:
> >> Done. I tried adding it to several schedules, but for some reason
> >> vacuuming an empty table failed to truncate the heap to 0 blocks.
> >> Putting the test in its own group fixed the problem, but that doesn't
> >> seem ideal.
> >>
> >
> > It might be because it fails the should_attempt_truncation() check.
> > See below code:
> >
> > if (should_attempt_truncation(vacrelstats))
> > lazy_truncate_heap(onerel, vacrelstats, vac_strategy);
>
> I see. I think truncating the FSM is not essential to show either the
> old or new behavior -- I could skip that portion to enable running the
> test in a parallel group.
>
> >> Can you please repeat the copy test you have done above with
> >> fillfactor as 20 and 30?
> >
> > I will send the results in a separate email soon.
>
> I ran the attached scripts which populates 100 tables with either 4 or
> 8 blocks. The test tables were pre-filled with one tuple and vacuumed
> so that the FSMs were already created when testing the master branch.
> The patch branch was compiled with a threshold of 8, but testing
> inserts of 4 pages effectively simulates a threshold of 4. Config was
> stock, except for fsync = off. I took the average of 40 runs (2
> complete tests of 20 runs each) after removing the 10% highest and
> lowest:
>
> fillfactor=20
> # blocks        master          patch
> 4                       19.1ms          17.5ms
> 8                       33.4ms          30.9ms
>
> fillfactor=30
> # blocks        master          patch
> 4                       20.1ms          19.7ms
> 8                       34.7ms          34.9ms
>
> It seems the patch might be a bit faster with fillfactor=20, but I'm
> at a loss as to why that would be.
>

I see that in your previous tests also with patch, the performance was
slightly better.  One probable reason could be that for small tables
the total number of pages accessed via shared buffers is more without
the patch (probably 3 FSM pages + 4 relation).  With the patch, you
need to only access 4 relation pages.  The other overhead of patch
(retrying each page) seems to be compensated with FSM search.  I think
you can once check perf profiles to confirm the same.

> Previous testing with a higher
> threshold showed a significant performance penalty starting around 10
> blocks [1], but that used truncation rather than deletion, and had a
> fill-factor of 10.
>

Can you check whether the number of pages after test are the same with
and without a patch in this setup?




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 12/1/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Can you check whether the number of pages after test are the same with
> and without a patch in this setup?

I did verify that the number of pages was as intended.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote:
>

- * Copy/link any fsm and vm files, if they exist
+ *
Copy/link any fsm and vm files, if they exist and if they would
+ * be created in the
new cluster.
  */
- transfer_relfile(&maps[mapnum], "_fsm",
vm_must_add_frozenbit);
+ if (maps[mapnum].relkind != RELKIND_RELATION ||
+
first_seg_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ ||
+
GET_MAJOR_VERSION(new_cluster.major_version) <= 1100)
+ (void) transfer_relfile
(&maps[mapnum], "_fsm", vm_must_add_frozenbit);

> During pg_upgrade, skip transfer of FSMs if they wouldn't have been created on the new cluster.

I think in some cases, it won't be consistent with HEAD's behavior.
After truncate, we leave the FSM as it is, so the case where before
upgrade the relation was truncated, we won't create the FSM in new
cluster and that will be inconsistent with the behavior of HEAD.  I
think similar anomaly will be there when we delete rows from the table
such that after deletion size of relation becomes smaller than
HEAP_FSM_CREATION_THRESHOLD.

I am not sure if it is a good idea to *not* transfer FSM files during
upgrade unless we ensure that we remove FSM whenever the relation size
falls below HEAP_FSM_CREATION_THRESHOLD.  What do you think?  BTW,
what is your reasoning for not removing FSM on truncate?

Anybody else has an opinion on this matter?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Dec 3, 2018 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote:
> >
>
v8 code:
+fsm_local_set(Relation rel, BlockNumber new_nblocks)
+{
+ BlockNumber blkno,
+ cached_target_block;
+
+ /*
+ * Mark blocks available starting after the last block number we have
+ * cached, and ending at the current last block in the relation.
+ * When we first set the map, this will flag all blocks as available
+ * to try.  If we reset the map while waiting for a relation
+ * extension lock, this will only flag new blocks as available,
+ * if any were created by another backend.
+ */
+ for (blkno = fsm_local_map.nblocks; blkno < new_nblocks; blkno++)
+ fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;

v9 code:
+static void
+fsm_local_set(Relation rel, BlockNumber nblocks)
+{
+ BlockNumber blkno,
+ cached_target_block;
+
+ for (blkno = 0; blkno < nblocks; blkno++)
+ fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;

What is the reason for the above code change in the latest patch version?

It would be good if you add few comments atop functions
GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace and
RecordPageWithFreeSpace about their interaction with local map.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> During pg_upgrade, skip transfer of FSMs if they wouldn't have been
>> created on the new cluster.
>
> I think in some cases, it won't be consistent with HEAD's behavior.
> After truncate, we leave the FSM as it is, so the case where before
> upgrade the relation was truncated, we won't create the FSM in new
> cluster and that will be inconsistent with the behavior of HEAD.

To be precise, with the TRUNCATE statement, the FSM (everything but
the main relation fork, I think) is deleted, but using DELETE to
remove all rows from the table will preserve the forks. In the latter
case, when VACUUM truncates the FSM, it removes all leaf pages leaving
behind the root page and one mid-level page. I haven't changed this
behavior.

> I think similar anomaly will be there when we delete rows from the table
> such that after deletion size of relation becomes smaller than
> HEAP_FSM_CREATION_THRESHOLD.

Yes, in that case there will be inconsistency, but I'm comfortable
with it. Others may not be.

> I am not sure if it is a good idea to *not* transfer FSM files during
> upgrade unless we ensure that we remove FSM whenever the relation size
> falls below HEAP_FSM_CREATION_THRESHOLD.  What do you think?  BTW,
> what is your reasoning for not removing FSM on truncate?

My reasoning is that if we ever went past the threshold, it's likely
we'll do so again, so I didn't feel it was worth the extra code
complexity to remove the FSM. In the pg_upgrade case, however, it is
simple to not copy the FSM.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Dec 3, 2018 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote:
>> >
>>
> v8 code:
> +fsm_local_set(Relation rel, BlockNumber new_nblocks)
> +{
> + BlockNumber blkno,
> + cached_target_block;
> +
> + /*
> + * Mark blocks available starting after the last block number we have
> + * cached, and ending at the current last block in the relation.
> + * When we first set the map, this will flag all blocks as available
> + * to try.  If we reset the map while waiting for a relation
> + * extension lock, this will only flag new blocks as available,
> + * if any were created by another backend.
> + */
> + for (blkno = fsm_local_map.nblocks; blkno < new_nblocks; blkno++)
> + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
>
> v9 code:
> +static void
> +fsm_local_set(Relation rel, BlockNumber nblocks)
> +{
> + BlockNumber blkno,
> + cached_target_block;
> +
> + for (blkno = 0; blkno < nblocks; blkno++)
> + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
>
> What is the reason for the above code change in the latest patch version?

Per your recent comment, we no longer check relation size if we waited
on a relation extension lock, so this is essentially a reversion to an
earlier version. Keeping v8 would have the advantage that it'd be
simple to change our minds about this. Do you have an opinion about
that?

> It would be good if you add few comments atop functions
> GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace and
> RecordPageWithFreeSpace about their interaction with local map.

Good idea, will do.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Dec 3, 2018 at 11:15 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Dec 3, 2018 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >> On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote:
> >> >
> >>
> > v8 code:
> > +fsm_local_set(Relation rel, BlockNumber new_nblocks)
> > +{
> > + BlockNumber blkno,
> > + cached_target_block;
> > +
> > + /*
> > + * Mark blocks available starting after the last block number we have
> > + * cached, and ending at the current last block in the relation.
> > + * When we first set the map, this will flag all blocks as available
> > + * to try.  If we reset the map while waiting for a relation
> > + * extension lock, this will only flag new blocks as available,
> > + * if any were created by another backend.
> > + */
> > + for (blkno = fsm_local_map.nblocks; blkno < new_nblocks; blkno++)
> > + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
> >
> > v9 code:
> > +static void
> > +fsm_local_set(Relation rel, BlockNumber nblocks)
> > +{
> > + BlockNumber blkno,
> > + cached_target_block;
> > +
> > + for (blkno = 0; blkno < nblocks; blkno++)
> > + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
> >
> > What is the reason for the above code change in the latest patch version?
>
> Per your recent comment, we no longer check relation size if we waited
> on a relation extension lock, so this is essentially a reversion to an
> earlier version.
>

fsm_local_set is being called from RecordAndGetPageWithFreeSpace and
GetPageWithFreeSpace whereas the change we have discussed was specific
to GetPageWithFreeSpace, so not sure if we need any change in
fsm_local_set.

> Keeping v8 would have the advantage that it'd be
> simple to change our minds about this. Do you have an opinion about
> that?
>
> > It would be good if you add few comments atop functions
> > GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace and
> > RecordPageWithFreeSpace about their interaction with local map.
>
> Good idea, will do.
>

Thanks.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Dec 3, 2018 at 11:15 AM John Naylor <jcnaylor@gmail.com> wrote:
>> Per your recent comment, we no longer check relation size if we waited
>> on a relation extension lock, so this is essentially a reversion to an
>> earlier version.
>>
>
> fsm_local_set is being called from RecordAndGetPageWithFreeSpace and
> GetPageWithFreeSpace whereas the change we have discussed was specific
> to GetPageWithFreeSpace, so not sure if we need any change in
> fsm_local_set.

Not needed, but I assumed wrongly you'd think it unclear otherwise.
I've now restored the generality and updated the comments to be closer
to v8.

> It would be good if you add few comments atop functions
> GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace and
> RecordPageWithFreeSpace about their interaction with local map.

Done. Also additional minor comment editing.

I've added an additional regression test for finding the right block
and removed a test I thought was redundant. I've kept the test file in
its own schedule.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> I've added an additional regression test for finding the right block
> and removed a test I thought was redundant. I've kept the test file in
> its own schedule.
>

+# ----------
+# fsm does a vacuum, and running it in parallel seems to prevent heap
truncation.
+# ----------
+test: fsm
+

It is not clear to me from the comment why running it in parallel
prevents heap truncation, can you explain what behavior are you seeing
and what makes you think that running it in parallel caused it?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 12/6/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
>>
>> I've added an additional regression test for finding the right block
>> and removed a test I thought was redundant. I've kept the test file in
>> its own schedule.
>>
>
> +# ----------
> +# fsm does a vacuum, and running it in parallel seems to prevent heap
> truncation.
> +# ----------
> +test: fsm
> +
>
> It is not clear to me from the comment why running it in parallel
> prevents heap truncation, can you explain what behavior are you seeing
> and what makes you think that running it in parallel caused it?

One of the tests deletes all records from the relation and vacuums. In
serial schedule, the heap and FSM are truncated; in parallel they are
not. Make check fails, since since the tests measure relation size.
Taking a closer look, I'm even more alarmed to discover that vacuum
doesn't even seem to remove deleted rows in parallel schedule (that
was in the last test I added), which makes no sense and causes that
test to fail. I looked in vacuum.sql for possible clues, but didn't
see any. I'll have to investigate further.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 12/6/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
> >>
> >> I've added an additional regression test for finding the right block
> >> and removed a test I thought was redundant. I've kept the test file in
> >> its own schedule.
> >>
> >
> > +# ----------
> > +# fsm does a vacuum, and running it in parallel seems to prevent heap
> > truncation.
> > +# ----------
> > +test: fsm
> > +
> >
> > It is not clear to me from the comment why running it in parallel
> > prevents heap truncation, can you explain what behavior are you seeing
> > and what makes you think that running it in parallel caused it?
>
> One of the tests deletes all records from the relation and vacuums. In
> serial schedule, the heap and FSM are truncated; in parallel they are
> not. Make check fails, since since the tests measure relation size.
> Taking a closer look, I'm even more alarmed to discover that vacuum
> doesn't even seem to remove deleted rows in parallel schedule (that
> was in the last test I added), which makes no sense and causes that
> test to fail. I looked in vacuum.sql for possible clues, but didn't
> see any.
>

I couldn't resist the temptation to figure out what's going on here.
The newly added tests have deletes followed by vacuum and then you
check whether the vacuum has removed the data by checking heap and or
FSM size.  Now, when you run such a test in parallel, the vacuum can
sometimes skip removing the rows because there are parallel
transactions open which can see the deleted rows.  You can easily
verify this phenomenon by running the newly added tests in one session
in psql when there is another parallel session which has an open
transaction.  For example:

Session-1
Begin;
Insert into foo values(1);

Session-2
\i fsm.sql

Now, you should see the results similar to what you are seeing when
you ran the fsm test by adding it to one of the parallel group.  Can
you test this at your end and confirm whether my analysis is correct
or not.

So, you can keep the test as you have in parallel_schedule, but
comment needs to be changed.  Also, you need to add the new test in
serial_schedule.  I have done both the changes in the attached patch,
kindly confirm if this looks correct to you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 12/8/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote:
> I couldn't resist the temptation to figure out what's going on here.
> The newly added tests have deletes followed by vacuum and then you
> check whether the vacuum has removed the data by checking heap and or
> FSM size.  Now, when you run such a test in parallel, the vacuum can
> sometimes skip removing the rows because there are parallel
> transactions open which can see the deleted rows.

Ah yes, of course.

> You can easily
> verify this phenomenon by running the newly added tests in one session
> in psql when there is another parallel session which has an open
> transaction.  For example:
>
> Session-1
> Begin;
> Insert into foo values(1);
>
> Session-2
> \i fsm.sql
>
> Now, you should see the results similar to what you are seeing when
> you ran the fsm test by adding it to one of the parallel group.  Can
> you test this at your end and confirm whether my analysis is correct
> or not.

Yes, I see the same behavior.

> So, you can keep the test as you have in parallel_schedule, but
> comment needs to be changed.  Also, you need to add the new test in
> serial_schedule.  I have done both the changes in the attached patch,
> kindly confirm if this looks correct to you.

Looks good to me. I'll just note that the new line in the serial
schedule has an extra space at the end. Thanks for looking into this.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 4. You have mentioned above that "system catalogs created during
> bootstrap still have a FSM if they have any data." and I can also see
> this behavior, have you investigated this point further?

I found the cause of this. There is some special-case code in md.c to
create any file if it's opened in bootstrap mode. I removed this and a
similar special case (attached), and make check still passes. After
digging through the history, I'm guessing this has been useless code
since about 2001, when certain special catalogs were removed.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Dec 13, 2018 at 3:18 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 4. You have mentioned above that "system catalogs created during
> > bootstrap still have a FSM if they have any data." and I can also see
> > this behavior, have you investigated this point further?
>
> I found the cause of this. There is some special-case code in md.c to
> create any file if it's opened in bootstrap mode. I removed this and a
> similar special case (attached), and make check still passes. After
> digging through the history, I'm guessing this has been useless code
> since about 2001, when certain special catalogs were removed.
>

Good finding, but I think it is better to discuss this part
separately.  I have started a new thread for this issue [1].

[1] - https://www.postgresql.org/message-id/CAA4eK1KsET6sotf%2BrzOTQfb83pzVEzVhbQi1nxGFYVstVWXUGw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Mithun Cy
Date:
On Sat, Dec 8, 2018 at 6:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote:
> >
> > On 12/6/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
> > >>
> > >> I've added an additional regression test for finding the right block

I did run some performance tests on the latest patch v11, I see small regression in execution time of COPY statement. Tests I have used is same as provided in [1] just that I ran it for fill factor 20 and 70. Here are my results!

Machine : cthulhu (Intel based  8 numa machine)
Server setting is default, configured with HEAP_FSM_CREATION_THRESHOLD = 4,
Entire data directory was on HDD.

Results are execution time(unit ms) taken by copy statement when number of records  equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4 pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70 till tid (3, 157). Result is taken as a median of 10 runs.

Fill factor 20
Tables         Base        Patch          % of increase in execution time
500            121.97      125.315       2.7424776584
1000          246.592    253.789       2.9185861666

Fill factor 70
500            211.502    217.128       2.6600221275
1000          420.309    432.606       2.9257046601

So 2-3% consistent regression, And on every run I can see for patch v11 execution time is slightly more than base. I also tried to insert more records till 8 pages and same regression is observed! So I guess even HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!

[1] https://www.postgresql.org/message-id/CAJVSVGX%3D2Q52fwijD9cjeq1UdiYGXns2_9WAPFf%3DE8cwbFCDvQ%40mail.gmail.com

--
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Results are execution time(unit ms) taken by copy statement when number of
> records  equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4
> pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70
> till tid (3, 157). Result is taken as a median of 10 runs.

> So 2-3% consistent regression, And on every run I can see for patch v11
> execution time is slightly more than base.

Thanks for testing!

> I also tried to insert more
> records till 8 pages and same regression is observed! So I guess even
> HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!

That's curious, because once the table exceeds the threshold, it would
be allowed to update the FSM, and in the process write 3 pages that it
didn't have to in the 4 page test. The master branch has the FSM
already, so I would expect the 8 page case to regress more.

What I can do later is provide a supplementary patch to go on top of
mine that only checks the last block. If that improves performance,
I'll alter my patch to only check every other page.

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> > Results are execution time(unit ms) taken by copy statement when number of
> > records  equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4
> > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70
> > till tid (3, 157). Result is taken as a median of 10 runs.
>
> > So 2-3% consistent regression, And on every run I can see for patch v11
> > execution time is slightly more than base.
>

Have you by any chance checked at scale factor 80 or 100?

> Thanks for testing!
>
> > I also tried to insert more
> > records till 8 pages and same regression is observed! So I guess even
> > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
>
> That's curious, because once the table exceeds the threshold, it would
> be allowed to update the FSM, and in the process write 3 pages that it
> didn't have to in the 4 page test. The master branch has the FSM
> already, so I would expect the 8 page case to regress more.
>

It is not clear to me why you think there should be regression at 8
pages when HEAP_FSM_CREATION_THRESHOLD is 4.  Basically, once FSM
starts getting updated, we should be same as HEAD as it won't take any
special path?

> What I can do later is provide a supplementary patch to go on top of
> mine that only checks the last block. If that improves performance,
> I'll alter my patch to only check every other page.
>

Sure, but I guess first we should try to see what is exactly slowing
down via perf report.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Mithun Cy
Date:
On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
> On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > fsm_local_set is being called from RecordAndGetPageWithFreeSpace and
> > GetPageWithFreeSpace whereas the change we have discussed was specific
> > to GetPageWithFreeSpace, so not sure if we need any change in
> > fsm_local_set.

I have some minor comments for pg_upgrade patch
1. Now we call stat main fork file in transfer_relfile()
+        sret = stat(old_file, &statbuf);

+        /* Save the size of the first segment of the main fork. */
+        if (type_suffix[0] == '\0' && segno == 0)
+            first_seg_size = statbuf.st_size;

But we do not handle the case if stat has returned any error!

2. src/bin/pg_upgrade/pg_upgrade.h

     char       *relname;
+
+    char        relkind;        /* relation relkind -- see pg_class.h */

I think we can remove the added empty line.

-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Mithun Cy
Date:
Thanks,

On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote:
> On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> > Results are execution time(unit ms) taken by copy statement when number of
> > records  equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4
> > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70
> > till tid (3, 157). Result is taken as a median of 10 runs.
>
> > So 2-3% consistent regression, And on every run I can see for patch v11
> > execution time is slightly more than base.
>
> Thanks for testing!
>
> > I also tried to insert more
> > records till 8 pages and same regression is observed! So I guess even
> > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
>
> That's curious, because once the table exceeds the threshold, it would
> be allowed to update the FSM, and in the process write 3 pages that it
> didn't have to in the 4 page test. The master branch has the FSM
> already, so I would expect the 8 page case to regress more.

I tested with configuration HEAP_FSM_CREATION_THRESHOLD = 4 and just
tried to insert till 8 blocks to see if regression is carried on with
further inserts.

> What I can do later is provide a supplementary patch to go on top of
> mine that only checks the last block. If that improves performance,
> I'll alter my patch to only check every other page.

Running callgrind for same test shows below stats
Before patch
==========
Number of calls                function_name
2000                                 heap_multi_insert
2000                                 RelationGetBufferForTuple
3500                                 ReadBufferBI

After Patch
=========
Number of calls                function_name
2000                                 heap_multi_insert
2000                                 RelationGetBufferForTuple
5000                                 ReadBufferBI

I guess Increase in ReadBufferBI() calls might be the reason which is
causing regression. Sorry I have not investigated it. I will check
same with your next patch!

-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Mithun Cy
Date:
On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote:
> On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> That's curious, because once the table exceeds the threshold, it would
> be allowed to update the FSM, and in the process write 3 pages that it
> didn't have to in the 4 page test. The master branch has the FSM
> already, so I would expect the 8 page case to regress more.

I tested with configuration HEAP_FSM_CREATION_THRESHOLD = 4 and just
tried to insert till 8 blocks to see if regression is carried on with
further inserts.

> What I can do later is provide a supplementary patch to go on top of
> mine that only checks the last block. If that improves performance,
> I'll alter my patch to only check every other page.

Running callgrind for same test shows below stats
Before patch
==========
Number of calls                function_name
2000                                 heap_multi_insert
2000                                 RelationGetBufferForTuple
3500                                 ReadBufferBI

After Patch
=========
Number of calls                function_name
2000                                 heap_multi_insert
2000                                 RelationGetBufferForTuple
5000                                 ReadBufferBI

I guess Increase in ReadBufferBI() calls might be the reason which is
causing regression. Sorry I have not investigated it. I will check
same with your next patch!

-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Jan 4, 2019 at 8:23 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Thanks,
>
> On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote:
> > On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> > > Results are execution time(unit ms) taken by copy statement when number of
> > > records  equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4
> > > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70
> > > till tid (3, 157). Result is taken as a median of 10 runs.
> >
> > > So 2-3% consistent regression, And on every run I can see for patch v11
> > > execution time is slightly more than base.
> >
> > Thanks for testing!
> >
> > > I also tried to insert more
> > > records till 8 pages and same regression is observed! So I guess even
> > > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
> >
> > That's curious, because once the table exceeds the threshold, it would
> > be allowed to update the FSM, and in the process write 3 pages that it
> > didn't have to in the 4 page test. The master branch has the FSM
> > already, so I would expect the 8 page case to regress more.
>
> I tested with configuration HEAP_FSM_CREATION_THRESHOLD = 4 and just
> tried to insert till 8 blocks to see if regression is carried on with
> further inserts.
>
> > What I can do later is provide a supplementary patch to go on top of
> > mine that only checks the last block. If that improves performance,
> > I'll alter my patch to only check every other page.
>
> Running callgrind for same test shows below stats
> Before patch
> ==========
> Number of calls                function_name
> 2000                                 heap_multi_insert
> 2000                                 RelationGetBufferForTuple
> 3500                                 ReadBufferBI
>
> After Patch
> =========
> Number of calls                function_name
> 2000                                 heap_multi_insert
> 2000                                 RelationGetBufferForTuple
> 5000                                 ReadBufferBI
>
> I guess Increase in ReadBufferBI() calls might be the reason which is
> causing regression. Sorry I have not investigated it.
>

I think the reason is that we are checking each block when blocks are
less than HEAP_FSM_CREATION_THRESHOLD.  Even though all the blocks are
in memory, there is some cost to check them all.  OTOH, without the
patch, even if it accesses FSM, it won't have to make so many
in-memory reads for blocks.

BTW, have you check for scale_factor 80 or 100 as suggested last time?

> I will check
> same with your next patch!
>

Yeah, that makes sense, John, can you provide a patch on top of the
current patch where we check either the last block or every other
block.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 1/3/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, that makes sense, John, can you provide a patch on top of the
> current patch where we check either the last block or every other
> block.

I've attached two patches for testing. Each one applies on top of the
current patch.

Mithun, I'll respond to your other review comments later this week.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Sun, Dec 30, 2018 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote:
> > > I also tried to insert more
> > > records till 8 pages and same regression is observed! So I guess even
> > > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
> >
> > That's curious, because once the table exceeds the threshold, it would
> > be allowed to update the FSM, and in the process write 3 pages that it
> > didn't have to in the 4 page test. The master branch has the FSM
> > already, so I would expect the 8 page case to regress more.
> >
>
> It is not clear to me why you think there should be regression at 8
> pages when HEAP_FSM_CREATION_THRESHOLD is 4.  Basically, once FSM
> starts getting updated, we should be same as HEAD as it won't take any
> special path?

In this particular test, the FSM is already created ahead of time for
the master branch, so we can compare accessing FSM versus checking
every page. My reasoning is that passing the threshold would take some
time to create 3 FSM pages with the patch, leading to a larger
regression. It seems we don't observe this, however.

On Sun, Dec 30, 2018 at 10:59 PM Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> I have some minor comments for pg_upgrade patch
> 1. Now we call stat main fork file in transfer_relfile()
> +        sret = stat(old_file, &statbuf);
>
> +        /* Save the size of the first segment of the main fork. */
> +        if (type_suffix[0] == '\0' && segno == 0)
> +            first_seg_size = statbuf.st_size;
>
> But we do not handle the case if stat has returned any error!

How about this:

/* Did file open fail? */
if (stat(old_file, &statbuf) != 0)
{
    /* Extent, fsm, or vm does not exist?  That's OK, just return */
    if (errno == ENOENT &&
        (type_suffix[0] != '\0' || segno != 0))
        return first_seg_size;
    else
        pg_fatal("error while checking for file existence \"%s.%s\"
(\"%s\" to \"%s\"): %s\n",
                 map->nspname, map->relname, old_file, new_file,
                 strerror(errno));
}

/* Save the size of the first segment of the main fork. */
else if (type_suffix[0] == '\0' && segno == 0)
    first_seg_size = statbuf.st_size;

/* If extent, fsm, or vm is empty, just return */
else if (statbuf.st_size == 0)
    return first_seg_size;

> 2. src/bin/pg_upgrade/pg_upgrade.h
>
>      char       *relname;
> +
> +    char        relkind;        /* relation relkind -- see pg_class.h */
>
> I think we can remove the added empty line.

In the full context:

-    /* the rest are used only for logging and error reporting */
+
+    /* These are used only for logging and error reporting. */
     char       *nspname;        /* namespaces */
     char       *relname;
+
+    char        relkind;        /* relation relkind -- see pg_class.h */

Relkind is not used for logging or error reporting, so the space sets
it apart from the previous members. I could instead put relkind before
those other two...

-John Naylor


Re: WIP: Avoid creation of the free space map for small tables

From
Mithun Cy
Date:
Hi John Naylor,
On Tue, Jan 8, 2019 at 2:27 AM John Naylor <jcnaylor@gmail.com> wrote:
> I've attached two patches for testing. Each one applies on top of the
> current patch.

Thanks for the patch, I did a quick test for both of the patches same
tests as in [1], now for fillfactors 20, 70, 100 (Note for
HEAP_FSM_CREATION_THRESHOLD = 4 highest tid inserted was 20 fillfactor
is (3,43),  for 70 fillfactor is (3, 157) and for 100 fillfactor is
(3, 225), so exactly 4 pages are used)

Machine : cthulhu, same as before [2] and server settings is default.
Test: COPY command as in [1], for 500 tables.

Fill factor 20
                      execution time in ms            %increase in
execution time
Base                             119.238
v11-all-pages                121.974                 2.2945705228
v11-Every-other-page   114.455                 -4.0113051209
v11-last-page                113.573                 -4.7510021973

Fill factor 70
                     execution time in ms           %increase in execution time
Base                              209.991
v11-all-pages                 211.076                0.5166888105
v11-Every-other-page    206.476              -1.6738812616
v11-last-page                 203.591              -3.0477496655

Fill factor 100
                   execution time in ms            %increase in execution time
Base                              269.691
v11-all-pages                 270.078                 0.1434975583
v11-Every-other-page    262.691                -2.5955630703
v11-last-page                 260.293                -3.4847288193

Observations
1. Execution time of both base and v11-all-pages patch has improved
than my earlier results [2]. But still v11-all-pages is slightly
behind base.
2. v11-Every-other-page and v11-last-page patches improve the
performance from base.
3. IMHO v11-Every-other-page would be ideal to consider it improves
the performance and also to an extent avoid expansion if space is
already available.

[1] https://www.postgresql.org/message-id/CAJVSVGX%3D2Q52fwijD9cjeq1UdiYGXns2_9WAPFf%3DE8cwbFCDvQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAD__Ouj%3Dat4hy2wYidK90v92qSRLjU%2BQe4y-PwfjLLeGkhc6ZA%40mail.gmail.com

-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Jan 9, 2019 at 7:33 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> 2. v11-Every-other-page and v11-last-page patches improve the
> performance from base.
> 3. IMHO v11-Every-other-page would be ideal to consider it improves
> the performance and also to an extent avoid expansion if space is
> already available.

Good to hear. I'll clean up the every-other-page patch and include it
in my next version.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Jan 9, 2019 at 9:00 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 9, 2019 at 7:33 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> > 2. v11-Every-other-page and v11-last-page patches improve the
> > performance from base.
> > 3. IMHO v11-Every-other-page would be ideal to consider it improves
> > the performance and also to an extent avoid expansion if space is
> > already available.
>

Thanks, Mithun for performance testing, it really helps us to choose
the right strategy here.  Once John provides next version, it would be
good to see the results of regular pgbench (read-write) runs (say at
50 and 300 scale factor) and the results of large copy.  I don't think
there will be any problem, but we should just double check that.

> Good to hear. I'll clean up the every-other-page patch and include it
> in my next version.
>

+1.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Thanks, Mithun for performance testing, it really helps us to choose
> the right strategy here.  Once John provides next version, it would be
> good to see the results of regular pgbench (read-write) runs (say at
> 50 and 300 scale factor) and the results of large copy.  I don't think
> there will be any problem, but we should just double check that.

Attached is v12 using the alternating-page strategy. I've updated the
comments and README as needed. In addition, I've

-handled a possible stat() call failure during pg_upgrade
-added one more assertion
-moved the new README material into a separate paragraph
-added a comment to FSMClearLocalMap() about transaction abort
-corrected an outdated comment that erroneously referred to extension
rather than creation
-fleshed out the draft commit messages

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Mithun Cy
Date:
On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Thanks, Mithun for performance testing, it really helps us to choose
> > the right strategy here.  Once John provides next version, it would be
> > good to see the results of regular pgbench (read-write) runs (say at
> > 50 and 300 scale factor) and the results of large copy.  I don't think
> > there will be any problem, but we should just double check that.
>
> Attached is v12 using the alternating-page strategy. I've updated the
> comments and README as needed. In addition, I've

Below are my performance tests and numbers
Machine : cthulhu

Tests and setups
Server settings:
max_connections = 200
shared_buffers=8GB
checkpoint_timeout =15min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
min_wal_size=15GB and max_wal_size=20GB.

pgbench settings:
-----------------------
read-write settings (TPCB like tests)
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared postgres

scale factor 50 -- median of 3 TPS
clients       v12-patch          base patch                % diff
1               826.081588       834.328238           -0.9884179421
16             10805.807081   10800.662805        0.0476292621
32             19722.277019   19641.546628        0.4110185034
64             30232.681889   30263.616073       -0.1022157561

scale factor 300 -- median of 3 TPS
clients       v12-patch          base patch                % diff
1              813.646062       822.18648              -1.038744641
16           11379.028702    11277.05586            0.9042505709
32           21688.084093    21613.044463          0.3471960192
64           36288.85711      36348.6178             -0.1644098005


Copy command
Test: setup
 ./psql -d postgres -c "COPY pgbench_accounts TO '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out' WITH csv"
 ./psql -d postgres -c "CREATE UNLOGGED TABLE pgbench_accounts_ulg (LIKE pgbench_accounts) WITH (fillfactor = 100);"
Test run:
TRUNCATE TABLE pgbench_accounts_ulg;
\timing
COPY pgbench_accounts_ulg FROM '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out' WITH csv;
\timing

execution time in ms. (scale factor indicates size of pgbench_accounts)
scale factor       v12-patch        base patch       % diff
300                   77166.407       77862.041     -0.8934186557
50                     13329.233      13284.583       0.3361038882

So for large table tests do not show any considerable performance variance from base code!



On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Thanks, Mithun for performance testing, it really helps us to choose
> the right strategy here.  Once John provides next version, it would be
> good to see the results of regular pgbench (read-write) runs (say at
> 50 and 300 scale factor) and the results of large copy.  I don't think
> there will be any problem, but we should just double check that.

Attached is v12 using the alternating-page strategy. I've updated the
comments and README as needed. In addition, I've

-handled a possible stat() call failure during pg_upgrade
-added one more assertion
-moved the new README material into a separate paragraph
-added a comment to FSMClearLocalMap() about transaction abort
-corrected an outdated comment that erroneously referred to extension
rather than creation
-fleshed out the draft commit messages


--
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Thanks, Mithun for performance testing, it really helps us to choose
> > the right strategy here.  Once John provides next version, it would be
> > good to see the results of regular pgbench (read-write) runs (say at
> > 50 and 300 scale factor) and the results of large copy.  I don't think
> > there will be any problem, but we should just double check that.
>
> Attached is v12 using the alternating-page strategy. I've updated the
> comments and README as needed. In addition, I've
>

Few comments:
---------------------------
1.
Commit message:
> Any pages with wasted free space become visible at next relation extension, so we still control table bloat.

I think the free space will be available after the next pass of
vacuum, no?  How can relation extension make it available?

2.
+2. For very small heap relations, the FSM would be relatively large and
+wasteful, so as of PostgreSQL 12 we refrain from creating the FSM for
+heaps with HEAP_FSM_CREATION_THRESHOLD pages or fewer, both to save space
+and to improve performance.  To locate free space in this case, we simply
+iterate over the heap, trying alternating pages in turn.  There may be some
+wasted free space in this case, but it becomes visible again upon next
+relation extension.

a. Again, how space becomes available at next relation extension.
b. I think there is no use of mentioning the version number in the
above comment, this code will be present from PG-12, so one can find
out from which version this optimization is added.


3.
BlockNumber
 RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
    Size oldSpaceAvail, Size spaceNeeded)
{
..
+ /* First try the local map, if it exists. */
+ if (oldPage < fsm_local_map.nblocks)
+ {
..
}

The comment doesn't appear to be completely in sync with the code.
Can't we just check whether "fsm_local_map.nblocks >  0", if so, we
can use a macro for the same? I have changed this in the attached
patch, see what you think about it.  I have used it at a few other
places as well.

4.
+ * When we initialize the map, the whole heap is potentially available to
+ * try.  If a caller wanted to reset the map after another backend extends
+ * the relation, this will only flag new blocks as available.  No callers
+ * do this currently, however.
+ */
+static void
+fsm_local_set(Relation rel, BlockNumber curr_nblocks)
{
..
+ if (blkno >= fsm_local_map.nblocks + 2)
..
}


The way you have tried to support the case as quoted in the comment
"If a caller wanted to reset the map after another backend extends .."
doesn't appear to be solid and I am not sure if it is correct either.
We don't have any way to test the same, so I suggest let's try to
simplify the case w.r.t current requirement of this API.  I think we
should
some simple logic to try every other block like:

+ blkno = cur_nblocks - 1;
+ while (true)
+ {
+ fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
+ if (blkno >= 2)
+ blkno -= 2;
+ else
+ break;
+ }

I have changed this in the attached patch.

5.
+/*
+ * Search the local map for an available block to try, in descending order.
+ *
+ * For use when there is no FSM.
+ */
+static BlockNumber
+fsm_local_search(void)

We should give a brief explanation as to why we try in descending
order.  I have added some explanation in the attached patch, see what
you think about it?

Apart from the above, I have modified a few comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Jan 16, 2019 at 9:25 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>
> On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Thanks, Mithun for performance testing, it really helps us to choose
> > > the right strategy here.  Once John provides next version, it would be
> > > good to see the results of regular pgbench (read-write) runs (say at
> > > 50 and 300 scale factor) and the results of large copy.  I don't think
> > > there will be any problem, but we should just double check that.
> >
> > Attached is v12 using the alternating-page strategy. I've updated the
> > comments and README as needed. In addition, I've
>
>
> execution time in ms. (scale factor indicates size of pgbench_accounts)
> scale factor       v12-patch        base patch       % diff
> 300                   77166.407       77862.041     -0.8934186557
> 50                     13329.233      13284.583       0.3361038882
>
> So for large table tests do not show any considerable performance variance from base code!
>

I think with these results, we can conclude this patch doesn't seem to
have any noticeable regression for all the tests we have done, right?
Thanks a lot for doing various performance tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Jan 16, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> 1.
> Commit message:
> > Any pages with wasted free space become visible at next relation extension, so we still control table bloat.
>
> I think the free space will be available after the next pass of
> vacuum, no?  How can relation extension make it available?

To explain, this diagram shows the map as it looks for different small
table sizes:

0123
A
NA
ANA
NANA

So for a 3-block table, the alternating strategy never checks block 1.
Any free space block 1 has acquired via delete-and-vacuum will become
visible if it extends to 4 blocks. We are accepting a small amount of
bloat for improved performance, as discussed. Would it help to include
this diagram in the README?

> 2.
> +2. For very small heap relations, the FSM would be relatively large and
> +wasteful, so as of PostgreSQL 12 we refrain from creating the FSM for
> +heaps with HEAP_FSM_CREATION_THRESHOLD pages or fewer, both to save space
> +and to improve performance.  To locate free space in this case, we simply
> +iterate over the heap, trying alternating pages in turn.  There may be some
> +wasted free space in this case, but it becomes visible again upon next
> +relation extension.
>
> a. Again, how space becomes available at next relation extension.
> b. I think there is no use of mentioning the version number in the
> above comment, this code will be present from PG-12, so one can find
> out from which version this optimization is added.

It fits with the reference to PG 8.4 earlier in the document. I chose
to be consistent, but to be honest, I'm not much in favor of a lot of
version references in code/READMEs.

> 3.
> BlockNumber
>  RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
>     Size oldSpaceAvail, Size spaceNeeded)
> {
> ..
> + /* First try the local map, if it exists. */
> + if (oldPage < fsm_local_map.nblocks)
> + {
> ..
> }
>
> The comment doesn't appear to be completely in sync with the code.
> Can't we just check whether "fsm_local_map.nblocks >  0", if so, we
> can use a macro for the same? I have changed this in the attached
> patch, see what you think about it.  I have used it at a few other
> places as well.

The macro adds clarity, so I'm in favor of using it.

> 4.
> + * When we initialize the map, the whole heap is potentially available to
> + * try.  If a caller wanted to reset the map after another backend extends
> + * the relation, this will only flag new blocks as available.  No callers
> + * do this currently, however.
> + */
> +static void
> +fsm_local_set(Relation rel, BlockNumber curr_nblocks)
> {
> ..
> + if (blkno >= fsm_local_map.nblocks + 2)
> ..
> }
>
>
> The way you have tried to support the case as quoted in the comment
> "If a caller wanted to reset the map after another backend extends .."
> doesn't appear to be solid and I am not sure if it is correct either.

I removed this case in v9 and you objected to that as unnecessary, so
I reverted it for v10.

> We don't have any way to test the same, so I suggest let's try to
> simplify the case w.r.t current requirement of this API.  I think we
> should
> some simple logic to try every other block like:
>
> + blkno = cur_nblocks - 1;
> + while (true)
> + {
> + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
> + if (blkno >= 2)
> + blkno -= 2;
> + else
> + break;
> + }
>
> I have changed this in the attached patch.

Fine by me.

> 5.
> +/*
> + * Search the local map for an available block to try, in descending order.
> + *
> + * For use when there is no FSM.
> + */
> +static BlockNumber
> +fsm_local_search(void)
>
> We should give a brief explanation as to why we try in descending
> order.  I have added some explanation in the attached patch, see what
> you think about it?
>
> Apart from the above, I have modified a few comments.

I'll include these with some grammar corrections in the next version.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Jan 16, 2019 at 11:40 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
> On Wed, Jan 16, 2019 at 8:41 AM Amit Kapila
<amit.kapila16@gmail.com> wrote:
> > can use a macro for the same? I have changed this in the attached
> > patch, see what you think about it.  I have used it at a few other
> > places as well.
>
> The macro adds clarity, so I'm in favor of using it.

It just occured to me that the style FSM_LOCAL_MAP_EXISTS seems more
common for macros that refer to constants, and FSMLocalMapExists for
expressions, but I've only seen a small amount of the code base. Do we
have a style preference here, or is it more a matter of matching the
surrounding code?
</amit.kapila16@gmail.com></john.naylor@2ndquadrant.com>

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Tom Lane
Date:
John Naylor <john.naylor@2ndquadrant.com> writes:
> It just occured to me that the style FSM_LOCAL_MAP_EXISTS seems more
> common for macros that refer to constants, and FSMLocalMapExists for
> expressions, but I've only seen a small amount of the code base. Do we
> have a style preference here, or is it more a matter of matching the
> surrounding code?

I believe there's a pretty longstanding tradition in C coding to use
all-caps names for macros representing constants.  Some people think
that goes for all macros period, but I'm not on board with that for
function-like macros.

Different parts of the PG code base make different choices between
camel-case and underscore-separation for multiword function names.
For that, I'd say match the style of nearby code.

            regards, tom lane


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Jan 16, 2019 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> John Naylor <john.naylor@2ndquadrant.com> writes:
> > It just occured to me that the style FSM_LOCAL_MAP_EXISTS seems more
> > common for macros that refer to constants, and FSMLocalMapExists for
> > expressions, but I've only seen a small amount of the code base. Do we
> > have a style preference here, or is it more a matter of matching the
> > surrounding code?
>

I am fine with the style (FSMLocalMapExists) you are suggesting, but
see the similar macros in nearby code like:

#define FSM_TREE_DEPTH ((SlotsPerFSMPage >= 1626) ? 3 : 4)

I think the above is not an exact match.  So, I have looked around and
found few other macros which serve a somewhat similar purpose, see
below:

#define ATT_IS_PACKABLE(att) \
((att)->attlen == -1 && (att)->attstorage != 'p')

#define VARLENA_ATT_IS_PACKABLE(att) \
((att)->attstorage != 'p')

#define CHECK_REL_PROCEDURE(pname)

#define SPTEST(f, x, y) \
DatumGetBool(DirectFunctionCall2(f, PointPGetDatum(x), PointPGetDatum(y)))

> I believe there's a pretty longstanding tradition in C coding to use
> all-caps names for macros representing constants.  Some people think
> that goes for all macros period, but I'm not on board with that for
> function-like macros.
>
> Different parts of the PG code base make different choices between
> camel-case and underscore-separation for multiword function names.
> For that, I'd say match the style of nearby code.
>

Yes, that is what we normally do.  However, in some cases, we might
need to refer to other places as well which I think is the case here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Jan 16, 2019 at 10:10 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 16, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > 1.
> > Commit message:
> > > Any pages with wasted free space become visible at next relation extension, so we still control table bloat.
> >
> > I think the free space will be available after the next pass of
> > vacuum, no?  How can relation extension make it available?
>
> To explain, this diagram shows the map as it looks for different small
> table sizes:
>
> 0123
> A
> NA
> ANA
> NANA
>
> So for a 3-block table, the alternating strategy never checks block 1.
> Any free space block 1 has acquired via delete-and-vacuum will become
> visible if it extends to 4 blocks. We are accepting a small amount of
> bloat for improved performance, as discussed. Would it help to include
> this diagram in the README?
>

Yes, I think it would be good if you can explain the concept of
local-map with the help of this example.

> > 2.
> > +2. For very small heap relations, the FSM would be relatively large and
> > +wasteful, so as of PostgreSQL 12 we refrain from creating the FSM for
> > +heaps with HEAP_FSM_CREATION_THRESHOLD pages or fewer, both to save space
> > +and to improve performance.  To locate free space in this case, we simply
> > +iterate over the heap, trying alternating pages in turn.  There may be some
> > +wasted free space in this case, but it becomes visible again upon next
> > +relation extension.
> >
> > a. Again, how space becomes available at next relation extension.
> > b. I think there is no use of mentioning the version number in the
> > above comment, this code will be present from PG-12, so one can find
> > out from which version this optimization is added.
>
> It fits with the reference to PG 8.4 earlier in the document. I chose
> to be consistent, but to be honest, I'm not much in favor of a lot of
> version references in code/READMEs.
>

Then let's not add a reference to the version number in this case.  I
also don't see much advantage of adding version number at least in
this case.

>
> I'll include these with some grammar corrections in the next version.
>

Okay, thanks!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Jan 16, 2019 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yes, I think it would be good if you can explain the concept of
> local-map with the help of this example.

> Then let's not add a reference to the version number in this case.  I

Okay, done in v14. I kept your spelling of the new macro. One minor
detail added: use uint8 rather than char for the local map array. This
seems to be preferred, especially in this file.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Jan 17, 2019 at 11:13 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 16, 2019 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yes, I think it would be good if you can explain the concept of
> > local-map with the help of this example.
>
> > Then let's not add a reference to the version number in this case.  I
>
> Okay, done in v14. I kept your spelling of the new macro. One minor
> detail added: use uint8 rather than char for the local map array. This
> seems to be preferred, especially in this file.
>

I am fine with your change.

Few more comments:
1.
I think we should not allow to create FSM for toast tables as well
till there size reaches HEAP_FSM_CREATION_THRESHOLD.  If you try below
test, you can see that FSM will be created for the toast table even if
the size of toast relation is 1 page.

CREATE OR REPLACE FUNCTION random_text(length INTEGER)
RETURNS TEXT
LANGUAGE SQL
 AS $$ select string_agg(chr
(32+(random()*96)::int), '') from generate_series(1,length); $$;

create table tt(c1 int, c2 text);
insert into tt values(1, random_text(2500));
Vacuum tt;

I have fixed this in the attached patch, kindly verify it once and see
if you can add the test for same as well.

2.
-CREATE TABLE test1 (a int, b int);
-INSERT INTO test1 VALUES (16777217, 131584);
+CREATE TABLE test_rel_forks (a
int);
+-- Make sure there are enough blocks in the heap for the FSM to be created.
+INSERT INTO test_rel_forks SELECT g
from generate_series(1,10000) g;

-VACUUM test1;  -- set up FSM
+-- set up FSM and VM
+VACUUM test_rel_forks;

This test will create 45 pages instead of 1.  I know that to create
FSM, we now need more than 4 pages, but 45 seems to be on the higher
side.  I think we should not unnecessarily populate more data if there
is no particular need for it, let's restrict the number of pages to 5
if possible.

3.
-SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1;
- fsm_1
--------
-  8192
-(1 row)
-
-SELECT octet_length
(get_raw_page('test1', 'vm', 0)) AS vm_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
+ERROR:  block number 10 is out of range for relation "test_rel_forks"

Why have you changed the test definition here?  Previously test checks
the existing FSM page, but now it tries to access out of range page.

Apart from the above, I have changed one sentence in README.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Sat, Jan 19, 2019 at 8:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 17, 2019 at 11:13 PM John Naylor
> Few more comments:
> 1.
> I think we should not allow to create FSM for toast tables as well
> till there size reaches HEAP_FSM_CREATION_THRESHOLD.  If you try below
> test, you can see that FSM will be created for the toast table even if
> the size of toast relation is 1 page.
...
> I have fixed this in the attached patch, kindly verify it once and see
> if you can add the test for same as well.

Works for me. For v16, I've added and tested similar logic to
pg_upgrade and verified that toast tables work the same as normal
tables in recovery. I used a slightly different method to generate the
long random string to avoid creating a function. Also, some cosmetic
adjustments -- I changed the regression test to use 'i' instead of 'g'
to match the use of generate_series in most other tests, and made
capitalization more consistent.

> 2.
> -CREATE TABLE test1 (a int, b int);
> -INSERT INTO test1 VALUES (16777217, 131584);
> +CREATE TABLE test_rel_forks (a
> int);
> +-- Make sure there are enough blocks in the heap for the FSM to be created.
> +INSERT INTO test_rel_forks SELECT g
> from generate_series(1,10000) g;
>
> -VACUUM test1;  -- set up FSM
> +-- set up FSM and VM
> +VACUUM test_rel_forks;
>
> This test will create 45 pages instead of 1.  I know that to create
> FSM, we now need more than 4 pages, but 45 seems to be on the higher
> side.  I think we should not unnecessarily populate more data if there
> is no particular need for it, let's restrict the number of pages to 5
> if possible.

Good idea, done here and in the fsm regression test.

> 3.
> -SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1;
> - fsm_1
> --------
> -  8192
> -(1 row)
> -
> -SELECT octet_length
> (get_raw_page('test1', 'vm', 0)) AS vm_0;
> +SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> +ERROR:  block number 10 is out of range for relation "test_rel_forks"
>
> Why have you changed the test definition here?  Previously test checks
> the existing FSM page, but now it tries to access out of range page.

The patch is hard to read here, but I still have a test for the
existing FSM page:

-SELECT octet_length(get_raw_page('test1', 'fsm', 0)) AS fsm_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
+ERROR:  block number 100 is out of range for relation "test_rel_forks"
+SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
  fsm_0
 -------
   8192
 (1 row)

I have a test for in-range and out-of-range for each relation fork.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sun, Jan 20, 2019 at 5:19 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>

Review of v16-0002-During-pg_upgrade-conditionally-skip-transfer-of:

- * Copy/link any fsm and vm files, if they exist
+ *
Copy/link any fsm and vm files, if they exist and if they would
+ * be created in the
new cluster.
  */
- transfer_relfile(&maps[mapnum], "_fsm",
vm_must_add_frozenbit);
+ if ((maps[mapnum].relkind != RELKIND_RELATION &&
+
maps[mapnum].relkind != RELKIND_TOASTVALUE) ||
+
first_seg_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ ||
+ GET_MAJOR_VERSION
(new_cluster.major_version) <= 1100)
+ (void) transfer_relfile(&maps[mapnum],
"_fsm", vm_must_add_frozenbit);

So we won't allow transfer of FSM files if their size is below
HEAP_FSM_CREATION_THRESHOLD.  What will be its behavior in link mode?
It seems that the old files will remain there. Will it create any
problem when we try to create the files via the new server, can you
once test this case?

Also, another case to think in this regard is the upgrade for standby
servers, if you read below paragraph from the user manual [1], you
will see what I am worried about?

"What this does is to record the links created by pg_upgrade's link
mode that connect files in the old and new clusters on the primary
server. It then finds matching files in the standby's old cluster and
creates links for them in the standby's new cluster. Files that were
not linked on the primary are copied from the primary to the standby.
(They are usually small.)"

[1] - https://www.postgresql.org/docs/devel/pgupgrade.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Mon, Jan 21, 2019 at 6:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> So we won't allow transfer of FSM files if their size is below
> HEAP_FSM_CREATION_THRESHOLD.  What will be its behavior in link mode?
> It seems that the old files will remain there. Will it create any
> problem when we try to create the files via the new server, can you
> once test this case?

I tried upgrading in --link mode, and on the new cluster, enlarging
the table past the threshold causes a new FSM to be created as
expected.

> Also, another case to think in this regard is the upgrade for standby
> servers, if you read below paragraph from the user manual [1], you
> will see what I am worried about?
>
> "What this does is to record the links created by pg_upgrade's link
> mode that connect files in the old and new clusters on the primary
> server. It then finds matching files in the standby's old cluster and
> creates links for them in the standby's new cluster. Files that were
> not linked on the primary are copied from the primary to the standby.
> (They are usually small.)"
>
> [1] - https://www.postgresql.org/docs/devel/pgupgrade.html

Trying this, I ran into a couple problems. I'm probably doing
something wrong, but I can't help but think there's a pg_upgrade
bug/feature I'm unaware of:

I set up my test to have primary directory data1 and for the secondary
standby/data1. I instructed pg_upgrade to upgrade data1 into data1u,
and I tried the rsync recipe in the docs quoted above, and the
upgraded standby wouldn't go into recovery. While debugging that, I
found surprisingly that pg_upgrade also went further and upgraded
standby/data1 into standby/data1u. I tried deleting standby/data1u
before running the rsync command and still nothing. Because the
upgraded secondary is non-functional, I can't really answer your
question.

Not sure if this is normal, but the pg_upgraded new cluster no longer
had the replication slot. Re-adding it didn't allow my upgraded
secondary to go into recovery, either. (I made sure to copy the
recovery settings, so that can't be the problem)


--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sun, Jan 20, 2019 at 5:19 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> I have a test for in-range and out-of-range for each relation fork.
>

I think the first two patches (a) removal of dead code in bootstrap
and (b) the core patch to avoid creation of FSM file for the small
table are good now.  I have prepared the patches along with commit
message.  There is no change except for some changes in README and
commit message of the second patch.  Kindly let me know what you think
about them?

I think these two patches can go even without the upgrade patch
(during pg_upgrade, conditionally skip transfer of FSMs.) which is
still under discussion.  However, I am not in a hurry if you or other
thinks that upgrade patch must be committed along with the second
patch.  I think the upgrade patch is generally going on track but
might need some more review.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Jan 23, 2019 at 7:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think the first two patches (a) removal of dead code in bootstrap
> and (b) the core patch to avoid creation of FSM file for the small
> table are good now.  I have prepared the patches along with commit
> message.  There is no change except for some changes in README and
> commit message of the second patch.  Kindly let me know what you think
> about them?

Good to hear! The additional language is fine. In "Once the FSM is
created for heap", I would just change that to "...for a heap".

> I think these two patches can go even without the upgrade patch
> (during pg_upgrade, conditionally skip transfer of FSMs.) which is
> still under discussion.  However, I am not in a hurry if you or other
> thinks that upgrade patch must be committed along with the second
> patch.  I think the upgrade patch is generally going on track but
> might need some more review.

The pg_upgrade piece is a nice-to-have feature and not essential, so
can go in later. Additional review is also welcome.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Mon, Jan 21, 2019 at 6:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Also, another case to think in this regard is the upgrade for standby
> servers, if you read below paragraph from the user manual [1], you
> will see what I am worried about?
>
> "What this does is to record the links created by pg_upgrade's link
> mode that connect files in the old and new clusters on the primary
> server. It then finds matching files in the standby's old cluster and
> creates links for them in the standby's new cluster. Files that were
> not linked on the primary are copied from the primary to the standby.
> (They are usually small.)"
>
> [1] - https://www.postgresql.org/docs/devel/pgupgrade.html

I am still not able to get the upgraded standby to go into recovery
without resorting to pg_basebackup, but in another attempt to
investigate your question I tried the following (data1 = old cluster,
data2 = new cluster):


mkdir -p data1 data2 standby

echo 'heap' > data1/foo
echo 'fsm' > data1/foo_fsm

# simulate streaming replication
rsync --archive data1 standby

# simulate pg_upgrade, skipping FSM
ln data1/foo -t data2/

rsync --archive --delete --hard-links --size-only --no-inc-recursive
data1 data2 standby

# result
ls standby/data1
ls standby/data2


The result is that foo_fsm is not copied to standby/data2, contrary to
what the docs above imply for other unlinked files. Can anyone shed
light on this?

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Jan 23, 2019 at 9:18 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 23, 2019 at 7:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think the first two patches (a) removal of dead code in bootstrap
> > and (b) the core patch to avoid creation of FSM file for the small
> > table are good now.  I have prepared the patches along with commit
> > message.  There is no change except for some changes in README and
> > commit message of the second patch.  Kindly let me know what you think
> > about them?
>
> Good to hear! The additional language is fine. In "Once the FSM is
> created for heap", I would just change that to "...for a heap".
>

Sure, apart from this I have run pgindent on the patches and make some
changes accordingly.  Latest patches attached (only second patch has
some changes).  I will take one more pass on Monday morning (28th Jan)
and will commit unless you or others see any problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Jan 24, 2019 at 3:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Jan 21, 2019 at 6:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Also, another case to think in this regard is the upgrade for standby
> > servers, if you read below paragraph from the user manual [1], you
> > will see what I am worried about?
> >
> > "What this does is to record the links created by pg_upgrade's link
> > mode that connect files in the old and new clusters on the primary
> > server. It then finds matching files in the standby's old cluster and
> > creates links for them in the standby's new cluster. Files that were
> > not linked on the primary are copied from the primary to the standby.
> > (They are usually small.)"
> >
> > [1] - https://www.postgresql.org/docs/devel/pgupgrade.html
>
> I am still not able to get the upgraded standby to go into recovery
> without resorting to pg_basebackup, but in another attempt to
> investigate your question I tried the following (data1 = old cluster,
> data2 = new cluster):
>
>
> mkdir -p data1 data2 standby
>
> echo 'heap' > data1/foo
> echo 'fsm' > data1/foo_fsm
>
> # simulate streaming replication
> rsync --archive data1 standby
>
> # simulate pg_upgrade, skipping FSM
> ln data1/foo -t data2/
>
> rsync --archive --delete --hard-links --size-only --no-inc-recursive
> data1 data2 standby
>
> # result
> ls standby/data1
> ls standby/data2
>
>
> The result is that foo_fsm is not copied to standby/data2, contrary to
> what the docs above imply for other unlinked files. Can anyone shed
> light on this?
>

Is foo_fsm present in standby/data1?  I think what doc means to say is
that it copies any unlinked files present in primary's new cluster
(which in your case will be data2).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Jan 24, 2019 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 24, 2019 at 3:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >

Few comments related to pg_upgrade patch:

1.
+ if ((maps[mapnum].relkind != RELKIND_RELATION &&
+ maps[mapnum].relkind != RELKIND_TOASTVALUE) ||
+ first_seg_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ ||
+ GET_MAJOR_VERSION(new_cluster.major_version) <= 1100)
+ (void) transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);

I think this check will needlessly be performed for future versions as
well, say when wants to upgrade from PG12 to PG13.  That might not
create any problem, but let's try to be more precise.  Can you try to
rewrite this check?  You might want to encapsulate it inside a
function.  I have thought of doing something similar to what we do for
vm, see checks relate to VISIBILITY_MAP_FROZEN_BIT_CAT_VER, but I
guess for this patch it is not important to check catalog version as
even if someone tries to upgrade to the same version.

2.
transfer_relfile()
{
..
- /* Is it an extent, fsm, or vm file? */
- if (type_suffix[0] != '\0' || segno != 0)
+ /* Did file open fail? */
+ if (stat(old_file, &statbuf) != 0)
..
}

So from now onwards, we will call stat for even 0th segment which
means there is one additional system call for each relation, not sure
if that matters, but I think there is no harm in once testing with a
large number of relations say 10K to 50K relations which have FSM.
The other alternative is we can fetch pg_class.relpages and rely on
that to take this decision, but again if that is not updated, we might
take the wrong decision.

Anyone else has any thoughts on this point?

3.
-static void
+static Size
 transfer_relfile(FileNameMap *map, const char *type_suffix, bool
vm_must_add_frozenbit)

If we decide to go with the approach proposed by you, we should add
some comments atop this function for return value change?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Jan 23, 2019 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 24, 2019 at 3:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > mkdir -p data1 data2 standby
> >
> > echo 'heap' > data1/foo
> > echo 'fsm' > data1/foo_fsm
> >
> > # simulate streaming replication
> > rsync --archive data1 standby
> >
> > # simulate pg_upgrade, skipping FSM
> > ln data1/foo -t data2/
> >
> > rsync --archive --delete --hard-links --size-only --no-inc-recursive
> > data1 data2 standby
> >
> > # result
> > ls standby/data1
> > ls standby/data2
> >
> >
> > The result is that foo_fsm is not copied to standby/data2, contrary to
> > what the docs above imply for other unlinked files. Can anyone shed
> > light on this?
> >
>
> Is foo_fsm present in standby/data1?

Yes it is.

> I think what doc means to say is
> that it copies any unlinked files present in primary's new cluster
> (which in your case will be data2).

In that case, I'm still confused why that doc says, "Unfortunately,
rsync needlessly copies files associated with temporary and unlogged
tables because these files don't normally exist on standby servers."
I fail to see why the primary's new cluster would have these if they
weren't linked. And in the case we're discussing here, the skipped
FSMs won't be on data2, so won't end up in standby/data2.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Jan 25, 2019 at 1:03 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 23, 2019 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think what doc means to say is
> > that it copies any unlinked files present in primary's new cluster
> > (which in your case will be data2).
>
> In that case, I'm still confused why that doc says, "Unfortunately,
> rsync needlessly copies files associated with temporary and unlogged
> tables because these files don't normally exist on standby servers."
> I fail to see why the primary's new cluster would have these if they
> weren't linked.
>

Why unlogged files won't be in primary's new cluster?  After the
upgrade, they should be present in a new cluster if they were present
in the old cluster.

> And in the case we're discussing here, the skipped
> FSMs won't be on data2, so won't end up in standby/data2.
>

Right.  I think we are safe with respect to rsync because I have seen
that we do rewrite the vm files in link mode and rsync will copy them
from primary's new cluster.

I think you can try to address my other comments on your pg_upgrade
patch.   Once we agree on the code, we need to test below scenarios:
(a) upgrade from all supported versions to the latest version
(b) upgrade standby with and without using rsync.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Jan 24, 2019 at 5:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 1.
> + if ((maps[mapnum].relkind != RELKIND_RELATION &&
> + maps[mapnum].relkind != RELKIND_TOASTVALUE) ||
> + first_seg_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ ||
> + GET_MAJOR_VERSION(new_cluster.major_version) <= 1100)
> + (void) transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);
>
> I think this check will needlessly be performed for future versions as
> well, say when wants to upgrade from PG12 to PG13.  That might not
> create any problem, but let's try to be more precise.  Can you try to
> rewrite this check?  You might want to encapsulate it inside a
> function.  I have thought of doing something similar to what we do for
> vm, see checks relate to VISIBILITY_MAP_FROZEN_BIT_CAT_VER, but I
> guess for this patch it is not important to check catalog version as
> even if someone tries to upgrade to the same version.

Agreed, done for v19 (I've only attached the pg_upgrade patch).

> 2.
> transfer_relfile()
> {
> ..
> - /* Is it an extent, fsm, or vm file? */
> - if (type_suffix[0] != '\0' || segno != 0)
> + /* Did file open fail? */
> + if (stat(old_file, &statbuf) != 0)
> ..
> }
>
> So from now onwards, we will call stat for even 0th segment which
> means there is one additional system call for each relation, not sure
> if that matters, but I think there is no harm in once testing with a
> large number of relations say 10K to 50K relations which have FSM.

Performance testing is probably a good idea anyway, but I went ahead
and implemented your next idea:

> The other alternative is we can fetch pg_class.relpages and rely on
> that to take this decision, but again if that is not updated, we might
> take the wrong decision.

We can think of it this way: Which is worse,
1. Transferring a FSM we don't need, or
2. Skipping a FSM we need

I'd say #2 is worse. So, in v19 we check pg_class.relpages and if it's
a heap and less than or equal the threshold we call stat on the 0th
segment to verify. In the common case, the cost of the stat call is
offset by not linking the FSM. Despite needing another pg_class field,
I think this code is actually easier to read than my earlier versions.

> 3.
> -static void
> +static Size
>  transfer_relfile(FileNameMap *map, const char *type_suffix, bool
> vm_must_add_frozenbit)
>
> If we decide to go with the approach proposed by you, we should add
> some comments atop this function for return value change?

Done, as well as other comment edits.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Jan 24, 2019 at 9:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 25, 2019 at 1:03 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Wed, Jan 23, 2019 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I think what doc means to say is
> > > that it copies any unlinked files present in primary's new cluster
> > > (which in your case will be data2).
> >
> > In that case, I'm still confused why that doc says, "Unfortunately,
> > rsync needlessly copies files associated with temporary and unlogged
> > tables because these files don't normally exist on standby servers."
> > I fail to see why the primary's new cluster would have these if they
> > weren't linked.
> >
>
> Why unlogged files won't be in primary's new cluster?  After the
> upgrade, they should be present in a new cluster if they were present
> in the old cluster.

I assume they would be linked, however (I haven't checked this). I did
think rewritten VM files would fall under this, but I was confused
about unlogged files.

> > And in the case we're discussing here, the skipped
> > FSMs won't be on data2, so won't end up in standby/data2.
> >
>
> Right.  I think we are safe with respect to rsync because I have seen
> that we do rewrite the vm files in link mode and rsync will copy them
> from primary's new cluster.

Okay.

> I think you can try to address my other comments on your pg_upgrade
> patch.   Once we agree on the code, we need to test below scenarios:
> (a) upgrade from all supported versions to the latest version
> (b) upgrade standby with and without using rsync.

Sounds good.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sat, Jan 26, 2019 at 5:05 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 24, 2019 at 5:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> Performance testing is probably a good idea anyway, but I went ahead
> and implemented your next idea:
>
> > The other alternative is we can fetch pg_class.relpages and rely on
> > that to take this decision, but again if that is not updated, we might
> > take the wrong decision.
>
> We can think of it this way: Which is worse,
> 1. Transferring a FSM we don't need, or
> 2. Skipping a FSM we need
>
> I'd say #2 is worse.
>

Agreed.

> So, in v19 we check pg_class.relpages and if it's
> a heap and less than or equal the threshold we call stat on the 0th
> segment to verify.
>

Okay, but the way logic is implemented appears clumsy to me.

@@ -234,16 +243,40 @@ transfer_relfile(FileNameMap *map, const char
*type_suffix, bool vm_must_add_fro
  {
  /* File does not exist?  That's OK, just return */
  if (errno == ENOENT)
- return;
+ return first_seg_size;
  else
- pg_fatal("error while checking for file existence \"%s.%s\" (\"%s\"
to \"%s\"): %s\n",
- map->nspname, map->relname, old_file, new_file,
- strerror(errno));
+ goto fatal;
  }

  /* If file is empty, just return */
  if (statbuf.st_size == 0)
- return;
+ return first_seg_size;
+ }
+
+ /* Save size of the first segment of the main fork. */
+
+ else if (map->relpages <= HEAP_FSM_CREATION_THRESHOLD &&
+ (map->relkind == RELKIND_RELATION ||
+   map->relkind == RELKIND_TOASTVALUE))
+ {
+ /*
+ * In this case, if pg_class.relpages is wrong, it's possible
+ * that a FSM will be skipped when we actually need it.  To guard
+ * against this, we verify the size of the first segment.
+ */
+ if (stat(old_file, &statbuf) != 0)
+ goto fatal;
+ else
+ first_seg_size = statbuf.st_size;
+ }
+ else
+ {
+ /*
+ * For indexes etc., we don't care if pg_class.relpages is wrong,
+ * since we always transfer their FSMs.  For heaps, we might
+ * transfer a FSM when we don't need to, but this is harmless.
+ */
+ first_seg_size = Min(map->relpages, RELSEG_SIZE) * BLCKSZ;
  }

The function transfer_relfile has no clue about skipping of FSM stuff,
but it contains comments about it.  The check "if (map->relpages <=
HEAP_FSM_CREATION_THRESHOLD ..." will needlessly be executed for each
segment.  I think there is some value in using the information from
this function to skip fsm files, but the code doesn't appear to fit
well, how about moving this check to new function
new_cluster_needs_fsm()?


> In the common case, the cost of the stat call is
> offset by not linking the FSM.
>

Agreed.

> Despite needing another pg_class field,
> I think this code is actually easier to read than my earlier versions.
>

Yeah, the code appears cleaner from the last version, but I think we
can do more in that regards.

One more minor comment:
snprintf(query + strlen(query), sizeof(query) - strlen(query),
  "SELECT all_rels.*, n.nspname, c.relname, "
- "  c.relfilenode, c.reltablespace, %s "
+ "  c.relfilenode, c.reltablespace, c.relpages, c.relkind, %s "
  "FROM (SELECT * FROM regular_heap "
  "      UNION ALL "
  "      SELECT * FROM toast_heap "
@@ -525,6 +530,8 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
  i_relname = PQfnumber(res, "relname");
  i_relfilenode = PQfnumber(res, "relfilenode");
  i_reltablespace = PQfnumber(res, "reltablespace");
+ i_relpages = PQfnumber(res, "relpages");
+ i_relkind = PQfnumber(res, "relkind");
  i_spclocation = PQfnumber(res, "spclocation");

The order in which relkind and relpages is used in the above code is
different from the order in which it is mentioned in the query, it
won't matter, but keeping in order will make look code consistent.  I
have made this and some more minor code adjustments in the attached
patch.  If you like those, you can include them in the next version of
your patch.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Sat, Jan 26, 2019 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jan 26, 2019 at 5:05 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > So, in v19 we check pg_class.relpages and if it's
> > a heap and less than or equal the threshold we call stat on the 0th
> > segment to verify.
> >
>
> Okay, but the way logic is implemented appears clumsy to me.

> The function transfer_relfile has no clue about skipping of FSM stuff,
> but it contains comments about it.

Yeah, I wasn't entirely happy with how that turned out.

> I think there is some value in using the information from
> this function to skip fsm files, but the code doesn't appear to fit
> well, how about moving this check to new function
> new_cluster_needs_fsm()?

For v21, new_cluster_needs_fsm() has all responsibility for obtaining
the info it needs. I think this is much cleaner, but there is a small
bit of code duplication since it now has to form the file name. One
thing we could do is form the the base old/new file names in
transfer_single_new_db() and pass those to transfer_relfile(), which
will only add suffixes and segment numbers. We could then pass the
base old file name to new_cluster_needs_fsm() and use it as is. Not
sure if that's worthwhile, though.

> The order in which relkind and relpages is used in the above code is
> different from the order in which it is mentioned in the query, it
> won't matter, but keeping in order will make look code consistent.  I
> have made this and some more minor code adjustments in the attached
> patch.  If you like those, you can include them in the next version of
> your patch.

Okay, done.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Jan 24, 2019 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Sure, apart from this I have run pgindent on the patches and make some
> changes accordingly.  Latest patches attached (only second patch has
> some changes).  I will take one more pass on Monday morning (28th Jan)
> and will commit unless you or others see any problem.
>

Pushed these two patches.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Mon, Jan 28, 2019 at 3:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 24, 2019 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Sure, apart from this I have run pgindent on the patches and make some
> > changes accordingly.  Latest patches attached (only second patch has
> > some changes).  I will take one more pass on Monday morning (28th Jan)
> > and will commit unless you or others see any problem.
>
> Pushed these two patches.

Thank you for your input and detailed review! Thank you Mithun for testing!

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Jan 28, 2019 at 9:16 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Jan 28, 2019 at 3:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Jan 24, 2019 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Sure, apart from this I have run pgindent on the patches and make some
> > > changes accordingly.  Latest patches attached (only second patch has
> > > some changes).  I will take one more pass on Monday morning (28th Jan)
> > > and will commit unless you or others see any problem.
> >
> > Pushed these two patches.
>
> Thank you for your input and detailed review! Thank you Mithun for testing!
>

There are a few buildfarm failures due to this commit, see my email on
pgsql-committers.  If you have time, you can also once look into
those.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> There are a few buildfarm failures due to this commit, see my email on
> pgsql-committers.  If you have time, you can also once look into
> those.

I didn't see anything in common with the configs of the failed
members. None have a non-default BLCKSZ that I can see.

Looking at this typical example from woodlouse:

================== pgsql.build/src/test/regress/regression.diffs
==================
--- C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/expected/fsm.out
2019-01-28 04:43:09.031456700 +0100
+++ C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/results/fsm.out
2019-01-28 05:06:20.351100400 +0100
@@ -26,7 +26,7 @@
 pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
  heap_size | fsm_size
 -----------+----------
-     24576 |        0
+     32768 |        0
 (1 row)

***It seems like the relation extended when the new records should
have gone into block 0.

 -- Extend table with enough blocks to exceed the FSM threshold
@@ -56,7 +56,7 @@
 SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
  fsm_size
 ----------
-    16384
+    24576
 (1 row)

***And here it seems vacuum didn't truncate the FSM. I wonder if the
heap didn't get truncated either.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Jan 28, 2019 at 10:03 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > There are a few buildfarm failures due to this commit, see my email on
> > pgsql-committers.  If you have time, you can also once look into
> > those.
>
> I didn't see anything in common with the configs of the failed
> members. None have a non-default BLCKSZ that I can see.
>
> Looking at this typical example from woodlouse:
>
> ================== pgsql.build/src/test/regress/regression.diffs
> ==================
> --- C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/expected/fsm.out
> 2019-01-28 04:43:09.031456700 +0100
> +++ C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/results/fsm.out
> 2019-01-28 05:06:20.351100400 +0100
> @@ -26,7 +26,7 @@
>  pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
>   heap_size | fsm_size
>  -----------+----------
> -     24576 |        0
> +     32768 |        0
>  (1 row)
>
> ***It seems like the relation extended when the new records should
> have gone into block 0.
>
>  -- Extend table with enough blocks to exceed the FSM threshold
> @@ -56,7 +56,7 @@
>  SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
>   fsm_size
>  ----------
> -    16384
> +    24576
>  (1 row)
>
> ***And here it seems vacuum didn't truncate the FSM. I wonder if the
> heap didn't get truncated either.
>

Yeah, it seems to me that vacuum is not able to truncate the relation,
see my latest reply on another thread [1].

[1] - https://www.postgresql.org/message-id/CAA4eK1JntHd7X6dLJVPGYV917HejjhbMKXn9m_RnnCE162LbLA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Jan 28, 2019 at 10:03 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > There are a few buildfarm failures due to this commit, see my email on
> > pgsql-committers.  If you have time, you can also once look into
> > those.
>
> I didn't see anything in common with the configs of the failed
> members. None have a non-default BLCKSZ that I can see.
>

I have done an analysis of the different failures on buildfarm.

1.
@@ -26,7 +26,7 @@
 pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
  heap_size | fsm_size
 -----------+----------
-     24576 |        0
+     32768 |        0
 (1 row)

 -- Extend table with enough blocks to exceed the FSM threshold
@@ -56,7 +56,7 @@
 SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
  fsm_size
 ----------
-    16384
+    24576
 (1 row)


As discussed on another thread, this seems to be due to the reason
that a parallel auto-analyze doesn't allow vacuum to remove dead-row
versions.  To fix this, I think we should avoid having a dependency on
vacuum to remove dead rows.

2.
@@ -15,13 +15,9 @@
 SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
 ERROR:  block number 100 is out of range for relation "test_rel_forks"
 SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
- fsm_0
--------
-  8192
-(1 row)
-
+ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
 SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
-ERROR:  block number 10 is out of range for relation "test_rel_forks"
+ERROR:  could not open file "base/50769/50798_fsm": No such file or directory

This indicates that even though the Vacuum is executed, but the FSM
doesn't get created.  This could be due to different BLCKSZ, but the
failed machines don't seem to have a non-default value of it.  I am
not sure why this could happen, maybe we need to check once in the
failed regression database to see the size of relation?

3.  Failure on 'mantid'
2019-01-28 00:13:55.191 EST [123979] 001_pgbench_with_server.pl LOG:
statement: CREATE UNLOGGED TABLE insert_tbl (id
serial primary key);
2019-01-28 00:13:55.218 EST [123982] 001_pgbench_with_server.pl LOG:
execute P0_0: INSERT INTO insert_tbl SELECT
FROM generate_series(1,1000);
2019-01-28 00:13:55.219 EST [123983] 001_pgbench_with_server.pl LOG:
execute P0_0: INSERT INTO insert_tbl SELECT
FROM generate_series(1,1000);
2019-01-28 00:13:55.220 EST [123984] 001_pgbench_with_server.pl LOG:
execute P0_0: INSERT INTO insert_tbl SELECT
FROM generate_series(1,1000);
..
..
TRAP: FailedAssertion("!((rel->rd_rel->relkind == 'r' ||
rel->rd_rel->relkind == 't') && fsm_local_map.map[oldPage] == 0x01)",
File: "freespace.c", Line: 223)

I think this can happen if we forget to clear the local map after we
get the block with space in function RelationGetBufferForTuple().  I
see the race condition in the code where that can happen.  Say, we
tried all the blocks in the local map and then tried to extend the
relation and we didn't get ConditionalLockRelationForExtension, in the
meantime, another backend has extended the relation and updated the
FSM (via RelationAddExtraBlocks).  Now, when the backend that didn't
get the extension lock will get the target block from FSM which will
be greater than HEAP_FSM_CREATION_THRESHOLD.  Next, it will find that
the block can be used to insert a new row and return the buffer, but
won't clear the local map due to below condition in code:

@@ -377,20 +383,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
+
+ /*
+ * In case we used an in-memory map of available blocks, reset it
+ * for next use.
+ */
+ if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
+ FSMClearLocalMap();
+

I think here you need to clear the map if it exists or clear it
unconditionally, the earlier one would be better.

This test gets executed concurrently by 5 clients, so it can hit the
above race condition.

4.  Failure on jacana:
--- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/box.out
2018-09-26
17:53:33 -0400
+++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/results/box.out
2019-01-27 23:14:35
-0500
@@ -252,332 +252,7 @@
     ('(0,100)(0,infinity)'),
     ('(-infinity,0)(0,infinity)'),
     ('(-infinity,-infinity)(infinity,infinity)');
-SET enable_seqscan = false;
-SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
..
..
TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File:
"c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/freespace/freespace.c",
Line:
1118)
..
2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:4] LOG:  server process
(PID 14388) exited with exit code 3
2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:5] DETAIL:  Failed process
was running: INSERT INTO box_temp
VALUES (NULL),

I think the reason for this failure is same as previous (as mentioned
in point-3), but this can happen in a different way.  Say, we have
searched the local map and then try to extend a relation 'X' and in
the meantime, another backend has extended such that it creates FSM.
Now, we will reuse that page and won't clear local map.  Now, say we
try to insert in relation 'Y' which doesn't have FSM.  It will try to
set the local map and will find that it already exists, so will fail.
Now, the question is how it can happen in this box.sql test.  I guess
that is happening for some system table which is being populated by
Create Index statement executed just before the failing Insert.

I think both 3 and 4 are timing issues, so we didn't got in our local
regression runs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Jan 28, 2019 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 28, 2019 at 10:03 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> > On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > There are a few buildfarm failures due to this commit, see my email on
> > > pgsql-committers.  If you have time, you can also once look into
> > > those.
> >
> > I didn't see anything in common with the configs of the failed
> > members. None have a non-default BLCKSZ that I can see.
> >
>
> I have done an analysis of the different failures on buildfarm.
>
>
> 2.
> @@ -15,13 +15,9 @@
>  SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
>  ERROR:  block number 100 is out of range for relation "test_rel_forks"
>  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> - fsm_0
> --------
> -  8192
> -(1 row)
> -
> +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
>  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> -ERROR:  block number 10 is out of range for relation "test_rel_forks"
> +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
>
> This indicates that even though the Vacuum is executed, but the FSM
> doesn't get created.  This could be due to different BLCKSZ, but the
> failed machines don't seem to have a non-default value of it.  I am
> not sure why this could happen, maybe we need to check once in the
> failed regression database to see the size of relation?
>

This symptom is shown in the below buildfarm critters:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-01-28%2005%3A05%3A22
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-01-28%2003%3A20%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2019-01-28%2003%3A13%3A47
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-01-28%2003%3A07%3A39

All of these seems to run with fsync=off.  Is it possible that vacuum has updated FSM, but the same is not synced to disk and when we try to read it, we didn't get the required page?  This is just a guess.

I have checked all the buildfarm failures and I see only 4 symptoms for which I have sent some initial analysis.  I think you can also once cross-verify the same.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From
Andrew Gierth
Date:
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 Amit> All of these seems to run with fsync=off. Is it possible that
 Amit> vacuum has updated FSM, but the same is not synced to disk and
 Amit> when we try to read it, we didn't get the required page?

No.

fsync never affects what programs see while the system is running, only
what happens after an OS crash.

-- 
Andrew (irc:RhodiumToad)


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Mon, Jan 28, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 28, 2019 at 10:03 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> 1.
> @@ -26,7 +26,7 @@
>  pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
>   heap_size | fsm_size
>  -----------+----------
> -     24576 |        0
> +     32768 |        0
>  (1 row)
>
>  -- Extend table with enough blocks to exceed the FSM threshold
> @@ -56,7 +56,7 @@
>  SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
>   fsm_size
>  ----------
> -    16384
> +    24576
>  (1 row)
>
>
> As discussed on another thread, this seems to be due to the reason
> that a parallel auto-analyze doesn't allow vacuum to remove dead-row
> versions.  To fix this, I think we should avoid having a dependency on
> vacuum to remove dead rows.

Ok, to make the first test here more reliable I will try Andrew's idea
to use fillfactor to save free space. As I said earlier, I think that
second test isn't helpful and can be dropped.

> 2.
> @@ -15,13 +15,9 @@
>  SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
>  ERROR:  block number 100 is out of range for relation "test_rel_forks"
>  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> - fsm_0
> --------
> -  8192
> -(1 row)
> -
> +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
>  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> -ERROR:  block number 10 is out of range for relation "test_rel_forks"
> +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
>
> This indicates that even though the Vacuum is executed, but the FSM
> doesn't get created.  This could be due to different BLCKSZ, but the
> failed machines don't seem to have a non-default value of it.  I am
> not sure why this could happen, maybe we need to check once in the
> failed regression database to see the size of relation?

I'm also having a hard time imagining why this failed. Just in case,
we could return ctid in a plpgsql loop and stop as soon as we see the
5th block. I've done that for some tests during development and is a
safer method anyway.

> <timing failures in 3 and 4>
>
> @@ -377,20 +383,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
> +
> + /*
> + * In case we used an in-memory map of available blocks, reset it
> + * for next use.
> + */
> + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> + FSMClearLocalMap();
> +
>
> I think here you need to clear the map if it exists or clear it
> unconditionally, the earlier one would be better.

Ok, maybe all callers should call it unconditonally, but within the
function, check "if (FSM_LOCAL_MAP_EXISTS)"?

Thanks for investigating the failures -- I'm a bit pressed for time this week.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Tue, Jan 29, 2019 at 12:37 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Jan 28, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > 2.
> > @@ -15,13 +15,9 @@
> >  SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
> >  ERROR:  block number 100 is out of range for relation "test_rel_forks"
> >  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> > - fsm_0
> > --------
> > -  8192
> > -(1 row)
> > -
> > +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
> >  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> > -ERROR:  block number 10 is out of range for relation "test_rel_forks"
> > +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
> >
> > This indicates that even though the Vacuum is executed, but the FSM
> > doesn't get created.  This could be due to different BLCKSZ, but the
> > failed machines don't seem to have a non-default value of it.  I am
> > not sure why this could happen, maybe we need to check once in the
> > failed regression database to see the size of relation?
>
> I'm also having a hard time imagining why this failed. Just in case,
> we could return ctid in a plpgsql loop and stop as soon as we see the
> 5th block. I've done that for some tests during development and is a
> safer method anyway.
>

I think we can devise some concrete way, but it is better first we try
to understand why it failed, otherwise there is always a chance that
we will repeat the mistake in some other case.  I think we have no
other choice, but to request the buildfarm owners to either give us
the access to see what happens or help us in investigating the
problem. The four buildfarms where it failed were lapwing, locust,
dromedary, prairiedog.   Among these, the owner of last two is Tom
Lane and others I don't recognize.  Tom, Andrew, can you help us in
getting the access of one of those four?  Yet another alternative is
the owner can apply the patch attached (this is same what got
committed) or reset to commit ac88d2962a and execute below statements
and share the results:

CREATE EXTENSION pageinspect;

CREATE TABLE test_rel_forks (a int);
INSERT INTO test_rel_forks SELECT i from generate_series(1,1000) i;
VACUUM test_rel_forks;
SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0;
SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;

SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;

SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0;
SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 1)) AS vm_1;

If the above statements give error: "ERROR:  could not open file ...", then run:
Analyze test_rel_forks;
Select oid, relname, relpages, reltuples from pg_class where relname
like 'test%';

The result of the above tests will tell us whether there are 5 pages
in the table or not.  If the table contains 5 pages and throws an
error, then there is some bug in our code, otherwise, there is
something specific to those systems where the above insert doesn't
result in 5 pages.

> > I think here you need to clear the map if it exists or clear it
> > unconditionally, the earlier one would be better.
>
> Ok, maybe all callers should call it unconditonally, but within the
> function, check "if (FSM_LOCAL_MAP_EXISTS)"?
>

Sounds sensible.  I think we should try to reproduce these failures,
for ex. for pgbench failure, we can try the same test with more
clients.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Tue, Jan 29, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 12:37 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> > > I think here you need to clear the map if it exists or clear it
> > > unconditionally, the earlier one would be better.
> >
> > Ok, maybe all callers should call it unconditonally, but within the
> > function, check "if (FSM_LOCAL_MAP_EXISTS)"?
> >
>
> Sounds sensible.  I think we should try to reproduce these failures,
> for ex. for pgbench failure, we can try the same test with more
> clients.
>

I am able to reproduce this by changing pgbench test as below:

--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -56,9 +56,9 @@ $node->safe_psql('postgres',
  'CREATE UNLOGGED TABLE insert_tbl (id serial primary key); ');

 pgbench(
- '--no-vacuum --client=5 --protocol=prepared --transactions=25',
+ '--no-vacuum --client=10 --protocol=prepared --transactions=25',
  0,
- [qr{processed: 125/125}],
+ [qr{processed: 250/250}],

You can find this change in attached patch.  Then, I ran the make
check in src/bin/pgbench multiple times using test_conc_insert.sh.
You can vary the number of times the test should run, if you are not
able to reproduce it with this.

The attached patch (clear_local_map_if_exists_1.patch) atop the main
patch fixes the issue for me.  Kindly verify the same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Masahiko Sawada
Date:
On Tue, Jan 29, 2019 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 12:37 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> > On Mon, Jan 28, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > 2.
> > > @@ -15,13 +15,9 @@
> > >  SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
> > >  ERROR:  block number 100 is out of range for relation "test_rel_forks"
> > >  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> > > - fsm_0
> > > --------
> > > -  8192
> > > -(1 row)
> > > -
> > > +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
> > >  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> > > -ERROR:  block number 10 is out of range for relation "test_rel_forks"
> > > +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
> > >
> > > This indicates that even though the Vacuum is executed, but the FSM
> > > doesn't get created.  This could be due to different BLCKSZ, but the
> > > failed machines don't seem to have a non-default value of it.  I am
> > > not sure why this could happen, maybe we need to check once in the
> > > failed regression database to see the size of relation?
> >
> > I'm also having a hard time imagining why this failed. Just in case,
> > we could return ctid in a plpgsql loop and stop as soon as we see the
> > 5th block. I've done that for some tests during development and is a
> > safer method anyway.
> >
>
> I think we can devise some concrete way, but it is better first we try
> to understand why it failed, otherwise there is always a chance that
> we will repeat the mistake in some other case.  I think we have no
> other choice, but to request the buildfarm owners to either give us
> the access to see what happens or help us in investigating the
> problem. The four buildfarms where it failed were lapwing, locust,
> dromedary, prairiedog.   Among these, the owner of last two is Tom
> Lane and others I don't recognize.  Tom, Andrew, can you help us in
> getting the access of one of those four?  Yet another alternative is
> the owner can apply the patch attached (this is same what got
> committed) or reset to commit ac88d2962a and execute below statements
> and share the results:
>
> CREATE EXTENSION pageinspect;
>
> CREATE TABLE test_rel_forks (a int);
> INSERT INTO test_rel_forks SELECT i from generate_series(1,1000) i;
> VACUUM test_rel_forks;
> SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0;
> SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
>
> SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
>
> SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0;
> SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 1)) AS vm_1;
>
> If the above statements give error: "ERROR:  could not open file ...", then run:
> Analyze test_rel_forks;
> Select oid, relname, relpages, reltuples from pg_class where relname
> like 'test%';
>
> The result of the above tests will tell us whether there are 5 pages
> in the table or not.  If the table contains 5 pages and throws an
> error, then there is some bug in our code, otherwise, there is
> something specific to those systems where the above insert doesn't
> result in 5 pages.

I'd suspect the alignment of integer. In my environemnt, the tuple
actual size is 28 bytes but the aligned size is 32 bytes (=
MAXALIGN(28)), so we can store 226 tuples to single page. But if
MAXALIGN(28) = 28 then we can store 255 tuples and 1000 tuples fits
within 4 pages. The MAXALIGN of four buildfarms seem 4 accroding to
the configure script so MAXALIGN(28) might be 28 on these buildfarms.

configure:16816: checking alignment of short
configure:16839: result: 2
configure:16851: checking alignment of int
configure:16874: result: 4
configure:16886: checking alignment of long
configure:16909: result: 4
configure:16922: checking alignment of long long int
configure:16945: result: 4

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Tue, Jan 29, 2019 at 5:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
>
> I'd suspect the alignment of integer. In my environemnt, the tuple
> actual size is 28 bytes but the aligned size is 32 bytes (=
> MAXALIGN(28)), so we can store 226 tuples to single page. But if
> MAXALIGN(28) = 28 then we can store 255 tuples and 1000 tuples fits
> within 4 pages. The MAXALIGN of four buildfarms seem 4 accroding to
> the configure script so MAXALIGN(28) might be 28 on these buildfarms.
>

Good finding.  I was also wondering along these lines and wanted to
verify.  Thanks a lot.  So, this clearly states why we have a second
failure in my email above [1].  I think this means for the fsm test
also we have to be careful when relying on the number of pages in the
test.  I think now we have found the reasons and solutions for the
first three problems mentioned in my email [1].  For the problem-4
(Failure on jacana:), I have speculated some theory, but not sure how
can we confirm?  Can we try the patch on Jacana before considering the
patch for commit?  Is there any other way we can replicate that error?

[1] - https://www.postgresql.org/message-id/CAA4eK1L%3DqWp_bJ5aTc9%2Bfy4Ewx2LPaLWY-RbR4a60g_rupCKnQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

> You can find this change in attached patch.  Then, I ran the make
> check in src/bin/pgbench multiple times using test_conc_insert.sh.
> You can vary the number of times the test should run, if you are not
> able to reproduce it with this.
>
> The attached patch (clear_local_map_if_exists_1.patch) atop the main
> patch fixes the issue for me.  Kindly verify the same.

I got one failure in 50 runs. With the new patch, I didn't get any
failures in 300 runs.


-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > You can find this change in attached patch.  Then, I ran the make
> > check in src/bin/pgbench multiple times using test_conc_insert.sh.
> > You can vary the number of times the test should run, if you are not
> > able to reproduce it with this.
> >
> > The attached patch (clear_local_map_if_exists_1.patch) atop the main
> > patch fixes the issue for me.  Kindly verify the same.
>
> I got one failure in 50 runs. With the new patch, I didn't get any
> failures in 300 runs.
>

Thanks for verification.  I have included it in the attached patch and
I have also modified the page.sql test to have enough number of pages
in relation so that FSM will get created irrespective of alignment
boundaries.  Masahiko San, can you verify if this now works for you?

There are two more failures which we need to something about.
1. Make fsm.sql independent of vacuum without much losing on coverage
of newly added code.  John, I guess you have an idea, see if you can
take care of it, otherwise, I will see what I can do for it.
2. I still could not figure out how to verify if the failure on Jacana
will be fixed.  I have posted some theory above and the attached patch
has a solution for it, but I think it would be better if find out some
way to verify the same.

Note - you might see some cosmetic changes in freespace.c due to pgindent.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> There are two more failures which we need to something about.
> 1. Make fsm.sql independent of vacuum without much losing on coverage
> of newly added code.  John, I guess you have an idea, see if you can
> take care of it, otherwise, I will see what I can do for it.

I've attached a patch that applies on top of v19 that uses Andrew
Gierth's idea to use fillfactor to control free space. I've also
removed tests that relied on truncation and weren't very useful to
begin with.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Jan 30, 2019 at 3:26 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > There are two more failures which we need to something about.
> > 1. Make fsm.sql independent of vacuum without much losing on coverage
> > of newly added code.  John, I guess you have an idea, see if you can
> > take care of it, otherwise, I will see what I can do for it.
>
> I've attached a patch that applies on top of v19 that uses Andrew
> Gierth's idea to use fillfactor to control free space. I've also
> removed tests that relied on truncation and weren't very useful to
> begin with.
>

This is much better than the earlier version of test and there is no
dependency on the vacuum.  However, I feel still there is some
dependency on how the rows will fit in a page and we have seen some
related failures due to alignment stuff.  By looking at the test, I
can't envision any such problem, but how about if we just write some
simple tests where we can check that the FSM won't be created for very
small number of records say one or two and then when we increase the
records FSM gets created, here if we want, we can even use vacuum to
ensure FSM gets created.  Once we are sure that the main patch passes
all the buildfarm tests, we can extend the test to something advanced
as you are proposing now.  I think that will reduce the chances of
failure, what do you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Jan 30, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> This is much better than the earlier version of test and there is no
> dependency on the vacuum.  However, I feel still there is some
> dependency on how the rows will fit in a page and we have seen some
> related failures due to alignment stuff.  By looking at the test, I
> can't envision any such problem, but how about if we just write some
> simple tests where we can check that the FSM won't be created for very
> small number of records say one or two and then when we increase the
> records FSM gets created, here if we want, we can even use vacuum to
> ensure FSM gets created.  Once we are sure that the main patch passes
> all the buildfarm tests, we can extend the test to something advanced
> as you are proposing now.  I think that will reduce the chances of
> failure, what do you think?

That's probably a good idea to limit risk. I just very basic tests
now, and vacuum before every relation size check to make sure any FSM
extension (whether desired or not) is invoked. Also, in my last patch
I forgot to implement explicit checks of the block number instead of
assuming how many rows will fit on a page. I've used a plpgsql code
block to do this.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Masahiko Sawada
Date:
On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > You can find this change in attached patch.  Then, I ran the make
> > > check in src/bin/pgbench multiple times using test_conc_insert.sh.
> > > You can vary the number of times the test should run, if you are not
> > > able to reproduce it with this.
> > >
> > > The attached patch (clear_local_map_if_exists_1.patch) atop the main
> > > patch fixes the issue for me.  Kindly verify the same.
> >
> > I got one failure in 50 runs. With the new patch, I didn't get any
> > failures in 300 runs.
> >
>
> Thanks for verification.  I have included it in the attached patch and
> I have also modified the page.sql test to have enough number of pages
> in relation so that FSM will get created irrespective of alignment
> boundaries.  Masahiko San, can you verify if this now works for you?
>

Thank you for updating the patch!

The modified page.sql test could fail if the block size is more than
8kB? We can ensure the number of pages are more than 4 by checking it
and adding more data if no enough but I'm really not sure we should
care the bigger-block size cases. However maybe it's good to check the
number of pages after insertion so that we can break down the issue in
case the test failed again.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Jan 30, 2019 at 8:11 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 30, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This is much better than the earlier version of test and there is no
> > dependency on the vacuum.  However, I feel still there is some
> > dependency on how the rows will fit in a page and we have seen some
> > related failures due to alignment stuff.  By looking at the test, I
> > can't envision any such problem, but how about if we just write some
> > simple tests where we can check that the FSM won't be created for very
> > small number of records say one or two and then when we increase the
> > records FSM gets created, here if we want, we can even use vacuum to
> > ensure FSM gets created.  Once we are sure that the main patch passes
> > all the buildfarm tests, we can extend the test to something advanced
> > as you are proposing now.  I think that will reduce the chances of
> > failure, what do you think?
>
> That's probably a good idea to limit risk. I just very basic tests
> now, and vacuum before every relation size check to make sure any FSM
> extension (whether desired or not) is invoked. Also, in my last patch
> I forgot to implement explicit checks of the block number instead of
> assuming how many rows will fit on a page. I've used a plpgsql code
> block to do this.
>

-- Extend table with enough blocks to exceed the FSM threshold
 -- FSM is created and extended to 3 blocks

The second comment line seems redundant to me, so I have removed that
and integrated it in the main patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > >
> > > On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > You can find this change in attached patch.  Then, I ran the make
> > > > check in src/bin/pgbench multiple times using test_conc_insert.sh.
> > > > You can vary the number of times the test should run, if you are not
> > > > able to reproduce it with this.
> > > >
> > > > The attached patch (clear_local_map_if_exists_1.patch) atop the main
> > > > patch fixes the issue for me.  Kindly verify the same.
> > >
> > > I got one failure in 50 runs. With the new patch, I didn't get any
> > > failures in 300 runs.
> > >
> >
> > Thanks for verification.  I have included it in the attached patch and
> > I have also modified the page.sql test to have enough number of pages
> > in relation so that FSM will get created irrespective of alignment
> > boundaries.  Masahiko San, can you verify if this now works for you?
> >
>
> Thank you for updating the patch!
>
> The modified page.sql test could fail if the block size is more than
> 8kB?

That's right, but I don't think current regression tests will work for
block size greater than 8KB.  I have tried with 16 and 32 as block
size, there were few failures on the head itself.

> We can ensure the number of pages are more than 4 by checking it
> and adding more data if no enough but I'm really not sure we should
> care the bigger-block size cases.
>

Yeah, I am not sure either.  I think as this is an existing test, we
should not try to change it too much.  However, if both you and John
feel it is better to change, we can go with that.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Jan 31, 2019 at 6:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 30, 2019 at 8:11 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > That's probably a good idea to limit risk. I just very basic tests
> > now, and vacuum before every relation size check to make sure any FSM
> > extension (whether desired or not) is invoked. Also, in my last patch
> > I forgot to implement explicit checks of the block number instead of
> > assuming how many rows will fit on a page. I've used a plpgsql code
> > block to do this.
> >
>
> -- Extend table with enough blocks to exceed the FSM threshold
>  -- FSM is created and extended to 3 blocks
>
> The second comment line seems redundant to me, so I have removed that
> and integrated it in the main patch.

FYI, the second comment is still present in v20.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Jan 31, 2019 at 6:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > The modified page.sql test could fail if the block size is more than
> > 8kB?
>
> That's right, but I don't think current regression tests will work for
> block size greater than 8KB.  I have tried with 16 and 32 as block
> size, there were few failures on the head itself.
>
> > We can ensure the number of pages are more than 4 by checking it
> > and adding more data if no enough but I'm really not sure we should
> > care the bigger-block size cases.
> >
>
> Yeah, I am not sure either.  I think as this is an existing test, we
> should not try to change it too much.  However, if both you and John
> feel it is better to change, we can go with that.

I have an idea -- instead of adding a bunch of records and hoping that
the relation size and free space is consistent across platforms, how
about we revert to the original test input, and add a BRIN index? That
should have a FSM even with one record.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Jan 31, 2019 at 2:02 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 31, 2019 at 6:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 30, 2019 at 8:11 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > >
> > > That's probably a good idea to limit risk. I just very basic tests
> > > now, and vacuum before every relation size check to make sure any FSM
> > > extension (whether desired or not) is invoked. Also, in my last patch
> > > I forgot to implement explicit checks of the block number instead of
> > > assuming how many rows will fit on a page. I've used a plpgsql code
> > > block to do this.
> > >
> >
> > -- Extend table with enough blocks to exceed the FSM threshold
> >  -- FSM is created and extended to 3 blocks
> >
> > The second comment line seems redundant to me, so I have removed that
> > and integrated it in the main patch.
>
> FYI, the second comment is still present in v20.
>

oops, forgot to include in commit after making a change, done now.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Jan 31, 2019 at 2:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 31, 2019 at 6:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > The modified page.sql test could fail if the block size is more than
> > > 8kB?
> >
> > That's right, but I don't think current regression tests will work for
> > block size greater than 8KB.  I have tried with 16 and 32 as block
> > size, there were few failures on the head itself.
> >
> > > We can ensure the number of pages are more than 4 by checking it
> > > and adding more data if no enough but I'm really not sure we should
> > > care the bigger-block size cases.
> > >
> >
> > Yeah, I am not sure either.  I think as this is an existing test, we
> > should not try to change it too much.  However, if both you and John
> > feel it is better to change, we can go with that.
>
> I have an idea -- instead of adding a bunch of records and hoping that
> the relation size and free space is consistent across platforms, how
> about we revert to the original test input, and add a BRIN index? That
> should have a FSM even with one record.
>

Why would BRIN index allow having FSM for heap relation?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Jan 31, 2019 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have an idea -- instead of adding a bunch of records and hoping that
> > the relation size and free space is consistent across platforms, how
> > about we revert to the original test input, and add a BRIN index? That
> > should have a FSM even with one record.
> >
>
> Why would BRIN index allow having FSM for heap relation?

Oops, I forgot this file is for testing heaps only. That said, we
could possibly put most of the FSM tests such as

SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));

into brin.sql since we know a non-empty BRIN index will have a FSM.
And in page.sql we could just have a test that the table has no FSM.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Jan 31, 2019 at 1:52 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 31, 2019 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I have an idea -- instead of adding a bunch of records and hoping that
> > > the relation size and free space is consistent across platforms, how
> > > about we revert to the original test input, and add a BRIN index? That
> > > should have a FSM even with one record.
> > >
> >
> > Why would BRIN index allow having FSM for heap relation?
>
> Oops, I forgot this file is for testing heaps only. That said, we
> could possibly put most of the FSM tests such as
>
> SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
>
> into brin.sql since we know a non-empty BRIN index will have a FSM.

As in the attached. Applies on top of v20. First to revert to HEAD,
second to move FSM tests to brin.sql. This is a much less invasive and
more readable patch, in addition to being hopefully more portable.

> And in page.sql we could just have a test that the table has no FSM.

This is not possible, since we don't know the relfilenode for the
error text, and it's not important. Better to have everything in
brin.sql.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Masahiko Sawada
Date:
On Thu, Jan 31, 2019 at 6:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > > >
> > > > On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > > You can find this change in attached patch.  Then, I ran the make
> > > > > check in src/bin/pgbench multiple times using test_conc_insert.sh.
> > > > > You can vary the number of times the test should run, if you are not
> > > > > able to reproduce it with this.
> > > > >
> > > > > The attached patch (clear_local_map_if_exists_1.patch) atop the main
> > > > > patch fixes the issue for me.  Kindly verify the same.
> > > >
> > > > I got one failure in 50 runs. With the new patch, I didn't get any
> > > > failures in 300 runs.
> > > >
> > >
> > > Thanks for verification.  I have included it in the attached patch and
> > > I have also modified the page.sql test to have enough number of pages
> > > in relation so that FSM will get created irrespective of alignment
> > > boundaries.  Masahiko San, can you verify if this now works for you?
> > >
> >
> > Thank you for updating the patch!
> >
> > The modified page.sql test could fail if the block size is more than
> > 8kB?
>
> That's right, but I don't think current regression tests will work for
> block size greater than 8KB.  I have tried with 16 and 32 as block
> size, there were few failures on the head itself.

Understood. That means that no build farm configures other block size than 8kB.

>
> > We can ensure the number of pages are more than 4 by checking it
> > and adding more data if no enough but I'm really not sure we should
> > care the bigger-block size cases.
> >
>
> Yeah, I am not sure either.  I think as this is an existing test, we
> should not try to change it too much.  However, if both you and John
> feel it is better to change, we can go with that.
>

So I think the patch you proposed looks good to me but how about
adding the check whether the table is more than 4 pages? For example,

SELECT (pg_relation_size('test_rel_forks') /
current_setting('block_size')::int) > 4;


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Jan 31, 2019 at 7:23 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 31, 2019 at 1:52 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Thu, Jan 31, 2019 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > I have an idea -- instead of adding a bunch of records and hoping that
> > > > the relation size and free space is consistent across platforms, how
> > > > about we revert to the original test input, and add a BRIN index? That
> > > > should have a FSM even with one record.
> > > >
> > >
> > > Why would BRIN index allow having FSM for heap relation?
> >
> > Oops, I forgot this file is for testing heaps only. That said, we
> > could possibly put most of the FSM tests such as
> >
> > SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
> >
> > into brin.sql since we know a non-empty BRIN index will have a FSM.
>
> As in the attached. Applies on top of v20. First to revert to HEAD,
> second to move FSM tests to brin.sql. This is a much less invasive and
> more readable patch, in addition to being hopefully more portable.
>

I don't think that moving fsm tests to brin would be a good approach.
We want to have a separate test for each access method.  I think if we
want to do something to avoid portability issues, maybe we can do what
Masahiko San has just suggested.  OTOH, I think we are just good w.r.t
this issue with the last patch I sent.  I think unless we see some
problem here, we should put energy into having a reproducible test for
the fourth problem mentioned in my mail up thread [1].  Do you think
it makes sense to run make check in loop for multiple times or do you
have any idea how we can have a reproducible test?

[1] - https://www.postgresql.org/message-id/CAA4eK1L%3DqWp_bJ5aTc9%2Bfy4Ewx2LPaLWY-RbR4a60g_rupCKnQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Jan 31, 2019 at 4:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I don't think that moving fsm tests to brin would be a good approach.
> We want to have a separate test for each access method.  I think if we
> want to do something to avoid portability issues, maybe we can do what
> Masahiko San has just suggested.

We could also use the same plpgsql loop as in fsm.sql to check the ctid, right?

> OTOH, I think we are just good w.r.t
> this issue with the last patch I sent.  I think unless we see some
> problem here, we should put energy into having a reproducible test for
> the fourth problem mentioned in my mail up thread [1].  Do you think
> it makes sense to run make check in loop for multiple times or do you
> have any idea how we can have a reproducible test?

Okay. Earlier I tried running make installcheck with
force_parallel_mode='regress', but didn't get a failure. I may not
have run enough times, though. I'll have to think about how to induce
it.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Jan 31, 2019 at 9:18 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 31, 2019 at 4:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I don't think that moving fsm tests to brin would be a good approach.
> > We want to have a separate test for each access method.  I think if we
> > want to do something to avoid portability issues, maybe we can do what
> > Masahiko San has just suggested.
>
> We could also use the same plpgsql loop as in fsm.sql to check the ctid, right?
>

Yes, however, I feel we should leave it as it is for now unless we see
any risk of portability issues.  The only reason to do that way is to
avoid any failure for bigger block size (say BLCKSZ is 16KB or 32KB).
 Does anyone else have any opinion on whether we should try to write
tests which should care for bigger block size?  I see that existing
regression tests fail if we configure with bigger block size, so not
sure if we should try to avoid that here.  In an ideal scenario,  I
think it would be good if we can write tests which pass on all kind of
block sizes, that will make the life easier if tomorrow one wants to
set up a buildfarm or do the testing for bigger block sizes.

> > OTOH, I think we are just good w.r.t
> > this issue with the last patch I sent.  I think unless we see some
> > problem here, we should put energy into having a reproducible test for
> > the fourth problem mentioned in my mail up thread [1].  Do you think
> > it makes sense to run make check in loop for multiple times or do you
> > have any idea how we can have a reproducible test?
>
> Okay. Earlier I tried running make installcheck with
> force_parallel_mode='regress', but didn't get a failure.
>

AFAICS, this failure was not for force_parallel_mode='regress'.  See
the config at [1].

> I may not
> have run enough times, though.
>

Yeah, probably running make check or make installcheck many times
would help, but not sure.

> I'll have to think about how to induce
> it.
>

Thanks!

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-01-28%2004%3A00%3A23

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Jan 28, 2019 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 28, 2019 at 10:03 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> > On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > There are a few buildfarm failures due to this commit, see my email on
> > > pgsql-committers.  If you have time, you can also once look into
> > > those.
> >
> > I didn't see anything in common with the configs of the failed
> > members. None have a non-default BLCKSZ that I can see.
> >
>
> I have done an analysis of the different failures on buildfarm.
>

In the past few days, we have done a further analysis of each problem
and tried to reproduce it.  We are successful in generating some form
of reproducer for 3 out of 4 problems in the same way as it was failed
in the buildfarm.   For the fourth symptom, we have tried a lot (even
Andrew Dunstan has helped us to run the regression tests with the
faulty commit on Jacana for many hours, but it didn't got reproduced)
but not able to regenerate a failure in a similar way.  However, I
have a theory as mentioned below why the particular test could fail
and the fix for the same is done in the patch.  I am planning to push
the latest version of the patch [1] which has fixes for all the
symptoms. Does anybody have any opinion here?

>
> 4.  Failure on jacana:
> --- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/box.out
> 2018-09-26
> 17:53:33 -0400
> +++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/results/box.out
> 2019-01-27 23:14:35
> -0500
> @@ -252,332 +252,7 @@
>      ('(0,100)(0,infinity)'),
>      ('(-infinity,0)(0,infinity)'),
>      ('(-infinity,-infinity)(infinity,infinity)');
> -SET enable_seqscan = false;
> -SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
> ..
> ..
> TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File:
> "c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/freespace/freespace.c",
> Line:
> 1118)
> ..
> 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:4] LOG:  server process
> (PID 14388) exited with exit code 3
> 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:5] DETAIL:  Failed process
> was running: INSERT INTO box_temp
> VALUES (NULL),
>
> I think the reason for this failure is same as previous (as mentioned
> in point-3), but this can happen in a different way.  Say, we have
> searched the local map and then try to extend a relation 'X' and in
> the meantime, another backend has extended such that it creates FSM.
> Now, we will reuse that page and won't clear local map.  Now, say we
> try to insert in relation 'Y' which doesn't have FSM.  It will try to
> set the local map and will find that it already exists, so will fail.
> Now, the question is how it can happen in this box.sql test.  I guess
> that is happening for some system table which is being populated by
> Create Index statement executed just before the failing Insert.
>
> I think both 3 and 4 are timing issues, so we didn't got in our local
> regression runs.
>

[1] - https://www.postgresql.org/message-id/CAA4eK1%2B3ajhRPC0jvUi6p_aMrTUpB568OBH10LrbHtvOLNTgqQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sat, Feb 2, 2019 at 7:30 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 28, 2019 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jan 28, 2019 at 10:03 AM John Naylor
> > <john.naylor@2ndquadrant.com> wrote:
>
> In the past few days, we have done a further analysis of each problem
> and tried to reproduce it.  We are successful in generating some form
> of reproducer for 3 out of 4 problems in the same way as it was failed
> in the buildfarm.   For the fourth symptom, we have tried a lot (even
> Andrew Dunstan has helped us to run the regression tests with the
> faulty commit on Jacana for many hours, but it didn't got reproduced)
> but not able to regenerate a failure in a similar way.  However, I
> have a theory as mentioned below why the particular test could fail
> and the fix for the same is done in the patch.  I am planning to push
> the latest version of the patch [1] which has fixes for all the
> symptoms.
>

Today, I have spent some more time to generate a test which can
reproduce the failure though it is with the help of breakpoints.  See
below:

> >
> > 4.  Failure on jacana:
> > --- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/box.out
> > 2018-09-26
> > 17:53:33 -0400
> > +++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/results/box.out
> > 2019-01-27 23:14:35
> > -0500
> > @@ -252,332 +252,7 @@
> >      ('(0,100)(0,infinity)'),
> >      ('(-infinity,0)(0,infinity)'),
> >      ('(-infinity,-infinity)(infinity,infinity)');
> > -SET enable_seqscan = false;
> > -SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
> > ..
> > ..
> > TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File:
> > "c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/freespace/freespace.c",
> > Line:
> > 1118)
> > ..
> > 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:4] LOG:  server process
> > (PID 14388) exited with exit code 3
> > 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:5] DETAIL:  Failed process
> > was running: INSERT INTO box_temp
> > VALUES (NULL),
> >
> > I think the reason for this failure is same as previous (as mentioned
> > in point-3), but this can happen in a different way.  Say, we have
> > searched the local map and then try to extend a relation 'X' and in
> > the meantime, another backend has extended such that it creates FSM.
> > Now, we will reuse that page and won't clear local map.  Now, say we
> > try to insert in relation 'Y' which doesn't have FSM.  It will try to
> > set the local map and will find that it already exists, so will fail.
> > Now, the question is how it can happen in this box.sql test.  I guess
> > that is happening for some system table which is being populated by
> > Create Index statement executed just before the failing Insert.
> >

Based on the above theory, the test is as below:

Session-1
---------------
postgres=# create table test_1(c1 int, c2 char(1500));
CREATE TABLE
postgres=# create table test_2(c1 int, c2 char(1500));
CREATE TABLE
postgres=# insert into test_1 values(generate_series(1,20),'aaaa');
INSERT 0 20
postgres=# insert into test_2 values(1,'aaaa');
INSERT 0 1

Session-2
----------------
postgres=# analyze test_1;
ANALYZE
postgres=# analyze test_2;
ANALYZE
postgres=# select oid, relname, relpages from pg_class where relname
like 'test%';
  oid  |    relname     | relpages
-------+----------------+----------
 41835 | test_1         |        4
 41838 | test_2         |        1

Till here we can see that test_1 has 4 pages and test2_1 has 1 page.
At this stage, even one more record insertion in test_1 will create a
new page.  So, now we have to hit the scenario with the help of
debugger.   For session-1, attach debugger and put breakpoint in
RelationGetBufferForTuple().

Session-1
----------------
postgres=# insert into test_1 values(21,'aaaa');

It will hit the breakpoint in RelationGetBufferForTuple().  Now, add
one more breakpoint on line 542 in hio.c, aka below line:
RelationGetBufferForTuple
{
..
else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))

Press continue and stop the debugger at line 542.  Attach the debugger
for session-2 and add a breakpoint on line 580 in hio.c, aka below
line:
RelationGetBufferForTuple
{
..
buffer = ReadBufferBI(relation, P_NEW, bistate);

Session-2
---------------
postgres=# insert into test_1 values(22,'aaaa');

It will hit the breakpoint in RelationGetBufferForTuple().   Now
proceed with debugging on session-1 by one step.  This is to ensure
that session-1 doesn't get a conditional lock. Now, continue the
debugger in session-2 and after that run vacuum test_1 from session-2,
this will ensure that FSM is created for relation test_1.  So the
state of session-2 will be as below:

Session-2
----------------
postgres=# insert into test_1 values(22,'aaaa');
INSERT 0 1
postgres=# vacuum test_1;
VACUUM
postgres=# select oid, relname, relpages from pg_class where relname
like 'test%';
  oid  |    relname     | relpages
-------+----------------+----------
 41835 | test_1         |        5
 41838 | test_2         |        1

Continue the debugger in session-1.  Now insert one row in test_2 from
session-1 and kaboom:

Session-1
---------------
postgres=# insert into test_2 values(2,'aaaa');
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Server logs is as below:
TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File:
"..\postgresql\src\backend\storage\freespace\freespace.c",
Line: 1118)
2019-02-02 16:48:06.216 IST [4044] LOG:  server process (PID 3540)
exited with exit code 3
2019-02-02 16:48:06.216 IST [4044] DETAIL:  Failed process was
running: insert into test_2 values(2,'aaaa');

This looks exactly the same as the failure in Jacana.

The patch fixes this case.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 31, 2019 at 2:02 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> >
> > FYI, the second comment is still present in v20.
> >
>
> oops, forgot to include in commit after making a change, done now.
>

This doesn't get applied cleanly after recent commit 0d1fe9f74e.
Attached is a rebased version.  I have checked once that the changes
done by 0d1fe9f74e don't impact this patch.  John, see if you can also
once confirm whether the recent commit (0d1fe9f74e) has any impact.  I
am planning to push this tomorrow morning (IST) unless you or anyone
see any problem with this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Sun, Feb 3, 2019 at 2:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> This doesn't get applied cleanly after recent commit 0d1fe9f74e.
> Attached is a rebased version.  I have checked once that the changes
> done by 0d1fe9f74e don't impact this patch.  John, see if you can also
> once confirm whether the recent commit (0d1fe9f74e) has any impact.  I
> am planning to push this tomorrow morning (IST) unless you or anyone
> see any problem with this.

Since that commit changes RelationAddExtraBlocks(), which can be
induces by your pgbench adjustment upthread, I ran make check with
that adjustment in the pgbench dir 300 times without triggering
asserts.

I also tried to follow the logic in 0d1fe9f74e, and I believe it will
be correct without a FSM.

[1] https://www.postgresql.org/message-id/CAA4eK1KRByXY03qR2JvUjUxKBzpBnCSO5H19oAC%3D_v4r5dzTwQ%40mail.gmail.com

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Feb 4, 2019 at 12:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Sun, Feb 3, 2019 at 2:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This doesn't get applied cleanly after recent commit 0d1fe9f74e.
> > Attached is a rebased version.  I have checked once that the changes
> > done by 0d1fe9f74e don't impact this patch.  John, see if you can also
> > once confirm whether the recent commit (0d1fe9f74e) has any impact.  I
> > am planning to push this tomorrow morning (IST) unless you or anyone
> > see any problem with this.
>
> Since that commit changes RelationAddExtraBlocks(), which can be
> induces by your pgbench adjustment upthread, I ran make check with
> that adjustment in the pgbench dir 300 times without triggering
> asserts.
>
> I also tried to follow the logic in 0d1fe9f74e, and I believe it will
> be correct without a FSM.
>

I have just pushed it and buildfarm has shown two failures:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-02-04%2002%3A27%3A26

--- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out
2019-02-03 21:27:29.000000000 -0500
+++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out
2019-02-03 21:41:32.000000000 -0500
@@ -38,19 +38,19 @@
 SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
  fsm_page_contents
 -------------------
- 0: 39            +
- 1: 39            +
- 3: 39            +
- 7: 39            +
- 15: 39           +
- 31: 39           +
- 63: 39           +
..

This one seems to be FSM test portability issue (due to different page
contents, maybe).  Looking into it, John, see if you are around and
have some thoughts on it.

2.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2019-02-04%2002%3A30%3A25

select explain_parallel_append('execute ab_q5 (33, 44, 55)');
-                            explain_parallel_append
--------------------------------------------------------------------------------
- Finalize Aggregate (actual rows=1 loops=1)
-   ->  Gather (actual rows=3 loops=1)
-         Workers Planned: 2
-         Workers Launched: 2
-         ->  Partial Aggregate (actual rows=1 loops=3)
-               ->  Parallel Append (actual rows=0 loops=N)
-                     Subplans Removed: 8
-                     ->  Parallel Seq Scan on ab_a1_b1 (never executed)
-                           Filter: ((b < 4) AND (a = ANY (ARRAY[$1, $2, $3])))
-(9 rows)
-
+ERROR:  lost connection to parallel worker
+CONTEXT:  PL/pgSQL function explain_parallel_append(text) line 5 at
FOR over EXECUTE statement
 -- Test Parallel Append with PARAM_EXEC Params
 select explain_parallel_append('select count(*) from ab where (a =
(select 1) or a = (select 3)) and b = 2');
                          explain_parallel_append

Failure is something like:

2019-02-03 21:44:42.456 EST [2812:327] pg_regress/partition_prune LOG:
 statement: select explain_parallel_append('execute ab_q5 (1, 1, 1)');
2019-02-03 21:44:42.493 EST [2812:328] pg_regress/partition_prune LOG:
 statement: select explain_parallel_append('execute ab_q5 (2, 3, 3)');
2019-02-03 21:44:42.531 EST [2812:329] pg_regress/partition_prune LOG:
 statement: select explain_parallel_append('execute ab_q5 (33, 44,
55)');
2019-02-04 02:44:42.552 GMT [4172] FATAL:  could not reattach to
shared memory (key=00000000000001B4, addr=0000000001980000): error
code 487
2019-02-03 21:44:42.555 EST [5116:6] LOG:  background worker "parallel
worker" (PID 4172) exited with exit code 1
2019-02-03 21:44:42.560 EST [2812:330] pg_regress/partition_prune
ERROR:  lost connection to parallel worker

I don't think this is related to this commit.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Feb 4, 2019 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 4, 2019 at 12:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Sun, Feb 3, 2019 at 2:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > This doesn't get applied cleanly after recent commit 0d1fe9f74e.
> > > Attached is a rebased version.  I have checked once that the changes
> > > done by 0d1fe9f74e don't impact this patch.  John, see if you can also
> > > once confirm whether the recent commit (0d1fe9f74e) has any impact.  I
> > > am planning to push this tomorrow morning (IST) unless you or anyone
> > > see any problem with this.
> >
> > Since that commit changes RelationAddExtraBlocks(), which can be
> > induces by your pgbench adjustment upthread, I ran make check with
> > that adjustment in the pgbench dir 300 times without triggering
> > asserts.
> >
> > I also tried to follow the logic in 0d1fe9f74e, and I believe it will
> > be correct without a FSM.
> >
>
> I have just pushed it and buildfarm has shown two failures:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-02-04%2002%3A27%3A26
>
> --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out
> 2019-02-03 21:27:29.000000000 -0500
> +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out
> 2019-02-03 21:41:32.000000000 -0500
> @@ -38,19 +38,19 @@
>  SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
>   fsm_page_contents
>  -------------------
> - 0: 39            +
> - 1: 39            +
> - 3: 39            +
> - 7: 39            +
> - 15: 39           +
> - 31: 39           +
> - 63: 39           +
> ..
>
> This one seems to be FSM test portability issue (due to different page
> contents, maybe).  Looking into it, John, see if you are around and
> have some thoughts on it.
>

One more similar failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-02-04%2003%3A20%3A01

So, basically, this is due to difference in the number of tuples that
can fit on a page.  The freespace in FSM for the page is shown
different because of available space on a particular page.  This can
vary due to alignment.  It seems to me we can't rely on FSM contents
if there are many tuples in a relation.  One idea is to get rid of
dependency on FSM contents in this test, can you think of any better
way to have consistent FSM contents across different platforms?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Feb 4, 2019 at 9:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Feb 4, 2019 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> One more similar failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-02-04%2003%3A20%3A01
>
> So, basically, this is due to difference in the number of tuples that
> can fit on a page.  The freespace in FSM for the page is shown
> different because of available space on a particular page.  This can
> vary due to alignment.  It seems to me we can't rely on FSM contents
> if there are many tuples in a relation.  One idea is to get rid of
> dependency on FSM contents in this test, can you think of any better
> way to have consistent FSM contents across different platforms?
>

One more idea could be that we devise a test (say with a char/varchar)
such that it always consume same space in a page irrespective of its
alignment.  Yet another way could be we use explain (costs off,
analyze on, timing off, summary off) ..., this will ensure that we
will have test coverage for function fsm_page_contents, but we don't
rely on its contents.   What do you think?  I will go with last option
to stablize the buildfarm tests unless anyone thinks otherwise or has
better idea.  I willprobably wait for 20 minutes or so to see if
anyone has inputs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Mon, Feb 4, 2019 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> This one seems to be FSM test portability issue (due to different page
> contents, maybe).  Looking into it, John, see if you are around and
> have some thoughts on it.

Maybe we can use the same plpgsql loop as fsm.sql that exits after 1
tuple has inserted into the 5th page.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Feb 4, 2019 at 10:18 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Feb 4, 2019 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This one seems to be FSM test portability issue (due to different page
> > contents, maybe).  Looking into it, John, see if you are around and
> > have some thoughts on it.
>
> Maybe we can use the same plpgsql loop as fsm.sql that exits after 1
> tuple has inserted into the 5th page.
>

Yeah that can also work, but we still need to be careful about the
alignment of that one tuple, otherwise, there will could be different
free space on the fifth page.  The probably easier way could be to use
an even number of integers in the table say(int, int).  Anyway, for
now, I have avoided the dependency on FSM contents without losing on
coverage of test.  I have pushed my latest suggestion in the previous
email.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Feb 4, 2019 at 10:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 4, 2019 at 10:18 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Mon, Feb 4, 2019 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > This one seems to be FSM test portability issue (due to different page
> > > contents, maybe).  Looking into it, John, see if you are around and
> > > have some thoughts on it.
> >
> > Maybe we can use the same plpgsql loop as fsm.sql that exits after 1
> > tuple has inserted into the 5th page.
> >
>
> Yeah that can also work, but we still need to be careful about the
> alignment of that one tuple, otherwise, there will could be different
> free space on the fifth page.  The probably easier way could be to use
> an even number of integers in the table say(int, int).  Anyway, for
> now, I have avoided the dependency on FSM contents without losing on
> coverage of test.  I have pushed my latest suggestion in the previous
> email.
>

The change seems to have worked.  All the buildfarm machines that were
showing the failure are passed now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Mon, Feb 4, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yeah that can also work, but we still need to be careful about the
> > alignment of that one tuple, otherwise, there will could be different
> > free space on the fifth page.  The probably easier way could be to use
> > an even number of integers in the table say(int, int).  Anyway, for
> > now, I have avoided the dependency on FSM contents without losing on
> > coverage of test.  I have pushed my latest suggestion in the previous
> > email.
> >
>
> The change seems to have worked.  All the buildfarm machines that were
> showing the failure are passed now.

Excellent!

Now that the buildfarm is green as far as this patch goes, I will
touch on a few details to round out development in this area:

1. Earlier, I had a test to ensure that free space towards the front
of the relation was visible with no FSM. In [1], I rewrote it without
using vacuum, so we can consider adding it back now if desired. I can
prepare a patch for this.

2. As a follow-on, since we don't rely on vacuum to remove dead rows,
we could try putting the fsm.sql test in some existing group in the
parallel schedule, rather than its own group is it is now.

3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
patch's effects on GetRecordedFreeSpace(), which will return zero for
tables with no FSM. The other callers are in:
contrib/pg_freespacemap/pg_freespacemap.c
contrib/pgstattuple/pgstatapprox.c

For pg_freespacemap, this doesn't matter, since it's just reporting
the facts. For pgstattuple_approx(), it might under-estimate the free
space and over-estimate the number of live tuples. This might be fine,
since it is approximate after all, but maybe a comment would be
helpful. If this is a problem, we could tweak it to be more precise
for tables without FSMs. Thoughts?

4. The latest patch for the pg_upgrade piece was in [2]

Anything else?

[1] https://www.postgresql.org/message-id/CACPNZCvEXLUx10pFvNcOs88RvqemMEjOv7D9MhL3ac86EzjAOA%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CACPNZCu4cOdm3uGnNEGXivy7Gz8UWyQjynDpdkPGabQ18_zK6g%40mail.gmail.com

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Feb 4, 2019 at 2:27 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Feb 4, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > The change seems to have worked.  All the buildfarm machines that were
> > showing the failure are passed now.
>
> Excellent!
>
> Now that the buildfarm is green as far as this patch goes,
>

There is still one recent failure which I don't think is related to commit of this patch:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-02-04%2016%3A38%3A48

================== pgsql.build/src/bin/pg_ctl/tmp_check/log/004_logrotate_primary.log ===================
TRAP: FailedAssertion("!(UsedShmemSegAddr != ((void *)0))", File: "g:\prog\bf\root\head\pgsql.build\src\backend\port\win32_shmem.c", Line: 513)

I think we need to do something about this random failure, but not as part of this thread/patch.


> I will
> touch on a few details to round out development in this area:
>
> 1. Earlier, I had a test to ensure that free space towards the front
> of the relation was visible with no FSM. In [1], I rewrote it without
> using vacuum, so we can consider adding it back now if desired. I can
> prepare a patch for this.
>

Yes, this is required.  It is generally a good practise to add test (unless it takes a lot of time) which covers new code/functionality.

> 2. As a follow-on, since we don't rely on vacuum to remove dead rows,
> we could try putting the fsm.sql test in some existing group in the
> parallel schedule, rather than its own group is it is now.
>

+1.

> 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
> patch's effects on GetRecordedFreeSpace(), which will return zero for
> tables with no FSM.
>

Right, but what exactly we want to do for it?  Do you want to add a comment atop of this function?

> The other callers are in:
> contrib/pg_freespacemap/pg_freespacemap.c
> contrib/pgstattuple/pgstatapprox.c
>
> For pg_freespacemap, this doesn't matter, since it's just reporting
> the facts. For pgstattuple_approx(), it might under-estimate the free
> space and over-estimate the number of live tuples.
>

Sure, but without patch also, it can do so, if the vacuum hasn't updated freespace map.

> This might be fine,
> since it is approximate after all, but maybe a comment would be
> helpful. If this is a problem, we could tweak it to be more precise
> for tables without FSMs.
>

Sounds reasonable to me.

> Thoughts?
>
> 4. The latest patch for the pg_upgrade piece was in [2]
>

It will be good if we get this one as well.  I will look into it once we are done with the other points you have mentioned.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 4, 2019 at 2:27 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > 1. Earlier, I had a test to ensure that free space towards the front
> > of the relation was visible with no FSM. In [1], I rewrote it without
> > using vacuum, so we can consider adding it back now if desired. I can
> > prepare a patch for this.
> >
>
> Yes, this is required.  It is generally a good practise to add test (unless it takes a lot of time) which covers new
code/functionality.
>
> > 2. As a follow-on, since we don't rely on vacuum to remove dead rows,
> > we could try putting the fsm.sql test in some existing group in the
> > parallel schedule, rather than its own group is it is now.
> >
>
> +1.

This is done in 0001.

> > 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
> > patch's effects on GetRecordedFreeSpace(), which will return zero for
> > tables with no FSM.
> >
>
> Right, but what exactly we want to do for it?  Do you want to add a comment atop of this function?

Hmm, the comment already says "according to the FSM", so maybe it's
already obvious. I was thinking more about maybe commenting the
callsite where it's helpful, as in 0002.

> > The other callers are in:
> > contrib/pg_freespacemap/pg_freespacemap.c
> > contrib/pgstattuple/pgstatapprox.c
> >
> > For pg_freespacemap, this doesn't matter, since it's just reporting
> > the facts. For pgstattuple_approx(), it might under-estimate the free
> > space and over-estimate the number of live tuples.
> >
>
> Sure, but without patch also, it can do so, if the vacuum hasn't updated freespace map.

Okay, then maybe we don't need to do anything else here.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Feb 4, 2019 at 2:27 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > >
> > > 1. Earlier, I had a test to ensure that free space towards the front
> > > of the relation was visible with no FSM. In [1], I rewrote it without
> > > using vacuum, so we can consider adding it back now if desired. I can
> > > prepare a patch for this.
> > >
> >
> > Yes, this is required.  It is generally a good practise to add test (unless it takes a lot of time) which covers
newcode/functionality.
 
> >
> > > 2. As a follow-on, since we don't rely on vacuum to remove dead rows,
> > > we could try putting the fsm.sql test in some existing group in the
> > > parallel schedule, rather than its own group is it is now.
> > >
> >
> > +1.
>
> This is done in 0001.
>

This is certainly a good test w.r.t code coverage of new code, but I
have few comments:
1. The size of records in test still depends on alignment (MAXALIGN).
Though it doesn't seem to be a problematic case, I still suggest we
can avoid using records whose size depends on alignment.  If you
change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int,
str text);, then you can avoid alignment related issues for the
records being used in test.
2.
+-- Fill most of the last block
..
+-- Make sure records can go into any block but the last one
..
+-- Insert large record and make sure it does not cause the relation to extend

The comments in some part of the test seems too focussed towards the
algorithm used for in-memory map.  I think we can keep these if we
want, but it is required to write a more generic comment stating what
is the actual motive of additional tests (basically we are testing the
functionality of in-memory map (LSM) for the heap, so we should write
about it.).


> > > 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
> > > patch's effects on GetRecordedFreeSpace(), which will return zero for
> > > tables with no FSM.
> > >
> >
> > Right, but what exactly we want to do for it?  Do you want to add a comment atop of this function?
>
> Hmm, the comment already says "according to the FSM", so maybe it's
> already obvious. I was thinking more about maybe commenting the
> callsite where it's helpful, as in 0002.
>
> > > The other callers are in:
> > > contrib/pg_freespacemap/pg_freespacemap.c
> > > contrib/pgstattuple/pgstatapprox.c
> > >
> > > For pg_freespacemap, this doesn't matter, since it's just reporting
> > > the facts. For pgstattuple_approx(), it might under-estimate the free
> > > space and over-estimate the number of live tuples.
> > >
> >
> > Sure, but without patch also, it can do so, if the vacuum hasn't updated freespace map.
>
> Okay, then maybe we don't need to do anything else here.
>

Shall we add a note to the docs of pg_freespacemap and
pgstattuple_approx indicating that for small relations, FSM won't be
created, so these functions won't give appropriate value?  Or other
possibility could be that we return an error if the block number is
less than the threshold value, but not sure if that is a good
alternative as that can happen today also if the vacuum hasn't run on
the table.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com>
> wrote:
>>
>> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
> This is certainly a good test w.r.t code coverage of new code, but I
> have few comments:
> 1. The size of records in test still depends on alignment (MAXALIGN).
> Though it doesn't seem to be a problematic case, I still suggest we
> can avoid using records whose size depends on alignment.  If you
> change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int,
> str text);, then you can avoid alignment related issues for the
> records being used in test.

Done.

> 2.
> +-- Fill most of the last block
> ..
> +-- Make sure records can go into any block but the last one
> ..
> +-- Insert large record and make sure it does not cause the relation to
> extend
>
> The comments in some part of the test seems too focussed towards the
> algorithm used for in-memory map.  I think we can keep these if we
> want, but it is required to write a more generic comment stating what
> is the actual motive of additional tests (basically we are testing the
> functionality of in-memory map (LSM) for the heap, so we should write
> about it.).

Done.

> Shall we add a note to the docs of pg_freespacemap and
> pgstattuple_approx indicating that for small relations, FSM won't be
> created, so these functions won't give appropriate value?

I've given this a try in 0002.

> Or other
> possibility could be that we return an error if the block number is
> less than the threshold value, but not sure if that is a good
> alternative as that can happen today also if the vacuum hasn't run on
> the table.

Yeah, an error doesn't seem helpful.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Feb 11, 2019 at 10:48 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Shall we add a note to the docs of pg_freespacemap and
> > pgstattuple_approx indicating that for small relations, FSM won't be
> > created, so these functions won't give appropriate value?
>
> I've given this a try in 0002.
>

This looks mostly correct, but I have a few observations:
1.
-      tuples.
+      tuples. Small tables don't have a free space map, so in that case
+      this function will report zero free space, likewise inflating the
+      estimated number of live tuples.

The last part of the sentence "likewise inflating the estimated number
of live tuples." seems incorrect to me because live tuples are
computed based on the pages scanned, live tuples in them and total
blocks in the relation.  So, I think it should be "likewise inflating
the approximate tuple length".

2.
+   In addition, small tables don't have a free space map, so this function
+   will return zero even if free space is available.

Actually, the paragraph you have modified applies to both the
functions mentioned on that page.  So instead of saying "this function
..", we can say "these functions .."

3.
* space from the FSM and move on.
+ * Note: If a relation has no FSM, GetRecordedFreeSpace() will report
+ * zero free space.  This is fine for the purposes of approximation.
  */

It is better to have an empty line before Note: ...

I have modified the patch for the above observations and added a
commit message as well, see if it looks okay to you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Feb 20, 2019 at 6:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have modified the patch for the above observations and added a
> commit message as well, see if it looks okay to you.

Looks good to me, thanks.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Feb 11, 2019 at 10:48 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com>
> > wrote:
> >>
> >> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> > This is certainly a good test w.r.t code coverage of new code, but I
> > have few comments:
> > 1. The size of records in test still depends on alignment (MAXALIGN).
> > Though it doesn't seem to be a problematic case, I still suggest we
> > can avoid using records whose size depends on alignment.  If you
> > change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int,
> > str text);, then you can avoid alignment related issues for the
> > records being used in test.
>
> Done.
>
> > 2.
> > +-- Fill most of the last block
> > ..
> > +-- Make sure records can go into any block but the last one
> > ..
> > +-- Insert large record and make sure it does not cause the relation to
> > extend
> >
> > The comments in some part of the test seems too focussed towards the
> > algorithm used for in-memory map.  I think we can keep these if we
> > want, but it is required to write a more generic comment stating what
> > is the actual motive of additional tests (basically we are testing the
> > functionality of in-memory map (LSM) for the heap, so we should write
> > about it.).
>
> Done.
>

Thanks, the modification looks good.  I have slightly changed the
commit message in the attached patch.  I will spend some more time
tomorrow morning on this and will commit unless I see any new problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Alvaro Herrera
Date:
Please remember to keep serial_schedule in sync.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Please remember to keep serial_schedule in sync.
>

I don't understand what you mean by this?  It is already present in
serial_schedule.  In parallel_schedule, we are just moving this test
to one of the parallel groups.  Do we need to take care of something
else?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Alvaro Herrera
Date:
On 2019-Feb-21, Amit Kapila wrote:

> On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > Please remember to keep serial_schedule in sync.
> 
> I don't understand what you mean by this?  It is already present in
> serial_schedule.  In parallel_schedule, we are just moving this test
> to one of the parallel groups.  Do we need to take care of something
> else?

Just to make sure it's in the same relative position.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Feb 11, 2019 at 10:48 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com>
> > wrote:
> >>
> >> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> > This is certainly a good test w.r.t code coverage of new code, but I
> > have few comments:
> > 1. The size of records in test still depends on alignment (MAXALIGN).
> > Though it doesn't seem to be a problematic case, I still suggest we
> > can avoid using records whose size depends on alignment.  If you
> > change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int,
> > str text);, then you can avoid alignment related issues for the
> > records being used in test.
>
> Done.
>

Oops, on again carefully studying the test, I realized my above
comment was wrong.  Let me explain with a test this time:
CREATE TABLE fsm_check_size (num int, str text);
INSERT INTO fsm_check_size SELECT i, rpad('', 1024, 'a') FROM
generate_series(1,3) i;

So here you are inserting 4-byte integer and 1024-bytes variable
length record.  So the tuple length will be tuple_header (24-bytes) +
4-bytes for integer + 4-bytes header for variable length data + 1024
bytes of actual data.  So, the length will be 1056 which is already
MAXALIGN.  I took the new comments added in your latest version of the
patch and added them to the previous version of the patch.   Kindly
see if I have not missed anything while merging the patch-versions?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Feb 21, 2019 at 7:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> So here you are inserting 4-byte integer and 1024-bytes variable
> length record.  So the tuple length will be tuple_header (24-bytes) +
> 4-bytes for integer + 4-bytes header for variable length data + 1024
> bytes of actual data.  So, the length will be 1056 which is already
> MAXALIGN.  I took the new comments added in your latest version of the
> patch and added them to the previous version of the patch.   Kindly
> see if I have not missed anything while merging the patch-versions?

OK, that makes sense. Looks fine to me.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Feb 21, 2019 at 6:39 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Feb-21, Amit Kapila wrote:
>
> > On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >
> > > Please remember to keep serial_schedule in sync.
> >
> > I don't understand what you mean by this?  It is already present in
> > serial_schedule.  In parallel_schedule, we are just moving this test
> > to one of the parallel groups.  Do we need to take care of something
> > else?
>
> Just to make sure it's in the same relative position.
>

Okay, thanks for the input.  Attached, find the updated patch, will
try to commit it tomorrow unless you or someone else has any other
comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Alvaro Herrera
Date:
I think this test is going to break on nonstandard block sizes.  While
we don't promise that all tests work on such installs (particularly
planner ones), it seems fairly easy to cope with this one -- just use a
record size expressed as a fraction of current_setting('block_size').
So instead of "1024" you'd write current_setting('block_size') / 8.
And then display the relation size in terms of pages, not bytes, so
divide pg_relation_size by block size.

(I see that there was already a motion for this, but was dismissed
because of lack of interest.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Feb 22, 2019 at 1:57 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> I think this test is going to break on nonstandard block sizes.  While
> we don't promise that all tests work on such installs (particularly
> planner ones),
>

The reason for not pushing much on making the test pass for
nonstandard block sizes is that when I tried existing tests, there
were already some failures.  For example, see the failures in the
attached regression diff files (for block_size as 16K and 32K
respectively).  I saw those failures during the previous
investigation, the situation on HEAD might or might not be exactly the
same.  Whereas I see the value in trying to make sure that tests pass
for nonstandard block sizes, but that doesn't seem to be followed for
all the tests.

> it seems fairly easy to cope with this one -- just use a
> record size expressed as a fraction of current_setting('block_size').
> So instead of "1024" you'd write current_setting('block_size') / 8.
> And then display the relation size in terms of pages, not bytes, so
> divide pg_relation_size by block size.
>

The idea sounds good.  John, would you like to give it a try?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Robert Haas
Date:
On Thu, Feb 21, 2019 at 9:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The reason for not pushing much on making the test pass for
> nonstandard block sizes is that when I tried existing tests, there
> were already some failures.

Sure, but let's not make things worse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP: Avoid creation of the free space map for small tables

From
Alvaro Herrera
Date:
On 2019-Feb-22, Amit Kapila wrote:

> On Fri, Feb 22, 2019 at 1:57 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > I think this test is going to break on nonstandard block sizes.  While
> > we don't promise that all tests work on such installs (particularly
> > planner ones),
> 
> The reason for not pushing much on making the test pass for
> nonstandard block sizes is that when I tried existing tests, there
> were already some failures.  For example, see the failures in the
> attached regression diff files (for block_size as 16K and 32K
> respectively).  I saw those failures during the previous
> investigation, the situation on HEAD might or might not be exactly the
> same.  Whereas I see the value in trying to make sure that tests pass
> for nonstandard block sizes, but that doesn't seem to be followed for
> all the tests.

Wow, there's a lot less tests failing there than I thought there would
be.  That increases hope that we can someday have them pass.  +1 on not
making things worse.

I think the crash in the amcheck test should be studied, one way or
another; CCing Peter.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Peter Geoghegan
Date:
On Fri, Feb 22, 2019 at 8:04 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Wow, there's a lot less tests failing there than I thought there would
> be.  That increases hope that we can someday have them pass.  +1 on not
> making things worse.
>
> I think the crash in the amcheck test should be studied, one way or
> another; CCing Peter.

I built Postgres with "--with-blocksize=16" and "--with-blocksize=32",
and tested amcheck with both builds. All tests passed.

I have a hard time imagining what the problem could be here. If there
was a problem with amcheck relying on there being an 8KiB block size
specifically, then it would almost certainly have been there since the
initial commit from March 2017. Not much has changed since then, and
the crash that Amit reported occurs at the earliest possible point.

I find it suspicious that there is another crash in pageinspect's
brin_page_items(), since like amcheck, pageinspect is a contrib module
that relies on BLCKSZ when allocating a local temp buffer.

-- 
Peter Geoghegan


Re: WIP: Avoid creation of the free space map for small tables

From
Alvaro Herrera
Date:
On 2019-Feb-22, Peter Geoghegan wrote:

> I find it suspicious that there is another crash in pageinspect's
> brin_page_items(), since like amcheck, pageinspect is a contrib module
> that relies on BLCKSZ when allocating a local temp buffer.

Ah.  Maybe they just weren't rebuilt.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Feb 21, 2019 at 9:27 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> I think this test is going to break on nonstandard block sizes.  While
> we don't promise that all tests work on such installs (particularly
> planner ones), it seems fairly easy to cope with this one -- just use a
> record size expressed as a fraction of current_setting('block_size').
> So instead of "1024" you'd write current_setting('block_size') / 8.
> And then display the relation size in terms of pages, not bytes, so
> divide pg_relation_size by block size.

I've done this for v6, tested on 16k block size.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The reason for not pushing much on making the test pass for
> nonstandard block sizes is that when I tried existing tests, there
> were already some failures.

FWIW, I currently see 8 failures (attached).

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Alvaro Herrera
Date:
On 2019-Feb-23, John Naylor wrote:

> On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The reason for not pushing much on making the test pass for
> > nonstandard block sizes is that when I tried existing tests, there
> > were already some failures.
> 
> FWIW, I currently see 8 failures (attached).

Hmm, not great -- even the strings test fails, which seems to try to handle
the case explicitly.  I did expect the plan shape ones to fail, but I'm
surprised about the tablesample one.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Petr Jelinek
Date:
On 26/02/2019 16:20, Alvaro Herrera wrote:
> On 2019-Feb-23, John Naylor wrote:
> 
>> On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> The reason for not pushing much on making the test pass for
>>> nonstandard block sizes is that when I tried existing tests, there
>>> were already some failures.
>>
>> FWIW, I currently see 8 failures (attached).
> 
> Hmm, not great -- even the strings test fails, which seems to try to handle
> the case explicitly.  I did expect the plan shape ones to fail, but I'm
> surprised about the tablesample one.
> 

The SYSTEM table sampling is basically per-page sampling so it depends
heavily on which rows are on which page.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Fri, Jan 25, 2019 at 9:50 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Once we agree on the code, we need to test below scenarios:
> (a) upgrade from all supported versions to the latest version
> (b) upgrade standby with and without using rsync.

Although the code hasn't been reviewed yet, I went ahead and tested
(a) on v21 of the pg_upgrade patch [1]. To do this I dumped out a 9.4
instance with the regression database and restored it to all supported
versions. To make it work with pg_upgrade, I first had to drop tables
with oids, drop functions referring to C libraries, and drop the
later-removed '=>' operator. Then I pg_upgrade'd in copy mode from all
versions to HEAD with the patch applied. pg_upgrade worked without
error, and the following query returned 0 bytes on all the new
clusters:

select sum(pg_relation_size(oid, 'fsm')) as total_fsm_size
from pg_class where relkind in ('r', 't')
and pg_relation_size(oid, 'main') <= 4 * 8192;

The complementary query (> 4 * 8192) returned the same number of bytes
as in the original 9.4 instance.

To make it easy to find, the latest regression test patch, which is
intended to be independent of block-size, was in [2].

[1] https://www.postgresql.org/message-id/CACPNZCu4cOdm3uGnNEGXivy7Gz8UWyQjynDpdkPGabQ18_zK6g%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CACPNZCsWa%3Ddd0K%2BFiODwM%3DLEsepAHVJCoSx2Avew%3DxBEX3Ywjw%40mail.gmail.com

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sat, Feb 23, 2019 at 1:24 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Feb 21, 2019 at 9:27 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > I think this test is going to break on nonstandard block sizes.  While
> > we don't promise that all tests work on such installs (particularly
> > planner ones), it seems fairly easy to cope with this one -- just use a
> > record size expressed as a fraction of current_setting('block_size').
> > So instead of "1024" you'd write current_setting('block_size') / 8.
> > And then display the relation size in terms of pages, not bytes, so
> > divide pg_relation_size by block size.
>
> I've done this for v6, tested on 16k block size.
>

Thanks, the patch looks good to me.  I have additionally tested it 32K
and 1K sized blocks and the test passes.  I will commit this early
next week.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Mon, Jan 28, 2019 at 2:33 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Sat, Jan 26, 2019 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think there is some value in using the information from
> > this function to skip fsm files, but the code doesn't appear to fit
> > well, how about moving this check to new function
> > new_cluster_needs_fsm()?
>
> For v21, new_cluster_needs_fsm() has all responsibility for obtaining
> the info it needs. I think this is much cleaner,
>

Right, now the code looks much better.

> but there is a small
> bit of code duplication since it now has to form the file name. One
> thing we could do is form the the base old/new file names in
> transfer_single_new_db() and pass those to transfer_relfile(), which
> will only add suffixes and segment numbers. We could then pass the
> base old file name to new_cluster_needs_fsm() and use it as is. Not
> sure if that's worthwhile, though.
>

I don't think it is worth.

Few minor comments:
1.
warning C4715: 'new_cluster_needs_fsm': not all control paths return a value

Getting this new warning in the patch.

2.
+
+ /* Transfer any VM files if we can trust their
contents. */
  if (vm_crashsafe_match)

3. Can we add a note about this in the Notes section of pg_upgrade
documentation [1]?

This comment line doesn't seem to be related to this patch.  If so, I
think we can avoid having any additional change which is not related
to the functionality of this patch.  Feel free to submit it
separately, if you think it is an improvement.

Have you done any performance testing of this patch?  I mean to say
now that we added a new stat call for each table, we should see if
that has any impact.  Ideally, that should be compensated by the fact
that we are now not transferring *fsm files for small relations.  How
about constructing a test where all relations are greater than 4 pages
and then try to upgrade them.  We can check for a cluster with a
different number of relations say 10K, 20K, 50K, 100K.

In general, the patch looks okay to me.  I would like to know if
anybody else has any opinion whether pg_upgrade should skip
transferring fsm files for small relations or not?  I think both me
and John thinks that it is good to have feature and now that patch
turns out to be simpler, I feel we can go ahead with this optimization
in pg_upgrade.

[1] - https://www.postgresql.org/docs/devel/pgupgrade.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Mar 6, 2019 at 5:19 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Fri, Jan 25, 2019 at 9:50 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Once we agree on the code, we need to test below scenarios:
> > (a) upgrade from all supported versions to the latest version
> > (b) upgrade standby with and without using rsync.
>
> Although the code hasn't been reviewed yet, I went ahead and tested
> (a) on v21 of the pg_upgrade patch [1]. To do this I dumped out a 9.4
> instance with the regression database and restored it to all supported
> versions. To make it work with pg_upgrade, I first had to drop tables
> with oids, drop functions referring to C libraries, and drop the
> later-removed '=>' operator. Then I pg_upgrade'd in copy mode from all
> versions to HEAD with the patch applied. pg_upgrade worked without
> error, and the following query returned 0 bytes on all the new
> clusters:
>
> select sum(pg_relation_size(oid, 'fsm')) as total_fsm_size
> from pg_class where relkind in ('r', 't')
> and pg_relation_size(oid, 'main') <= 4 * 8192;
>
> The complementary query (> 4 * 8192) returned the same number of bytes
> as in the original 9.4 instance.
>

Thanks, the tests done by you are quite useful.  I have given a few
comments as a response to your previous email.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Mar 8, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> Few minor comments:
..
>
> 2.
> +
> + /* Transfer any VM files if we can trust their
> contents. */
>   if (vm_crashsafe_match)
>
> 3. Can we add a note about this in the Notes section of pg_upgrade
> documentation [1]?
>
> This comment line doesn't seem to be related to this patch.  If so, I
> think we can avoid having any additional change which is not related
> to the functionality of this patch.  Feel free to submit it
> separately, if you think it is an improvement.
>

oops, I have messed up the above comments.  The paragraph starting
with "This comment line doesn't ..." is for comment number-2.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few minor comments:
> 1.
> warning C4715: 'new_cluster_needs_fsm': not all control paths return a value
>
> Getting this new warning in the patch.

Hmm, I don't get that in a couple versions of gcc. Your compiler must
not know that pg_fatal() cannot return. I blindly added a fix.

> 2.
>
> This comment line doesn't seem to be related to this patch.  If so, I
> think we can avoid having any additional change which is not related
> to the functionality of this patch.  Feel free to submit it
> separately, if you think it is an improvement.

> +
> + /* Transfer any VM files if we can trust their contents. */
>   if (vm_crashsafe_match)

Well, I guess the current comment is still ok, so reverted. If I were
to do a separate cleanup patch, I would rather remove the
vm_must_add_frozenbit parameter -- there's no reason I can see for
calls that transfer the heap and FSM to know about this.

I also changed references to the 'first segment of the main fork'
where there will almost always only be one segment. This was a vestige
of the earlier algorithm I had.

> 3. Can we add a note about this in the Notes section of pg_upgrade
> documentation [1]?

Done.

> Have you done any performance testing of this patch?  I mean to say
> now that we added a new stat call for each table, we should see if
> that has any impact.  Ideally, that should be compensated by the fact
> that we are now not transferring *fsm files for small relations.  How
> about constructing a test where all relations are greater than 4 pages
> and then try to upgrade them.  We can check for a cluster with a
> different number of relations say 10K, 20K, 50K, 100K.

I have not, but I agree it should be done. I will try to do so soon.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Mar 7, 2019 at 7:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Feb 23, 2019 at 1:24 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Thu, Feb 21, 2019 at 9:27 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >
> > > I think this test is going to break on nonstandard block sizes.  While
> > > we don't promise that all tests work on such installs (particularly
> > > planner ones), it seems fairly easy to cope with this one -- just use a
> > > record size expressed as a fraction of current_setting('block_size').
> > > So instead of "1024" you'd write current_setting('block_size') / 8.
> > > And then display the relation size in terms of pages, not bytes, so
> > > divide pg_relation_size by block size.
> >
> > I've done this for v6, tested on 16k block size.
> >
>
> Thanks, the patch looks good to me.  I have additionally tested it 32K
> and 1K sized blocks and the test passes.  I will commit this early
> next week.
>

Pushed this patch.  Last time, we have seen a few portability issues
with this test.  Both John and me with the help of others tried to
ensure that there are no more such issues, but there is always a
chance that we missed something.  Anyway, I will keep an eye on
buildfarm to see if there is any problem related to this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Sun, Mar 10, 2019 at 7:47 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Have you done any performance testing of this patch?  I mean to say
> > now that we added a new stat call for each table, we should see if
> > that has any impact.  Ideally, that should be compensated by the fact
> > that we are now not transferring *fsm files for small relations.  How
> > about constructing a test where all relations are greater than 4 pages
> > and then try to upgrade them.  We can check for a cluster with a
> > different number of relations say 10K, 20K, 50K, 100K.
>
> I have not, but I agree it should be done. I will try to do so soon.
>

Thanks, I will wait for your test results.  I believe this is the last
patch in this project and we should try to get it done soon.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> Have you done any performance testing of this patch?  I mean to say
> now that we added a new stat call for each table, we should see if
> that has any impact.  Ideally, that should be compensated by the fact
> that we are now not transferring *fsm files for small relations.

To be precise, it will only call stat if pg_class.relpages is below
the threshold. I suppose I could hack a database where all the
relpages values are wrong, but that seems like a waste of time.

> How
> about constructing a test where all relations are greater than 4 pages
> and then try to upgrade them.  We can check for a cluster with a
> different number of relations say 10K, 20K, 50K, 100K.

I did both greater and less than 4 pages for 10k relations. Since
pg_upgrade is O(# relations), I don't see a point in going higher.

First, I had a problem: On MacOS with their "gcc" wrapper around
clang, I got a segfault 11 when compiled with no debugging symbols. I
added "CFLAGS=-O0" and it worked fine. Since this doesn't happen in a
debugging build, I'm not sure how to investigate this. IIRC, this
doesn't happen for me on Linux gcc.

Since it was at least running now, I measured by putting
gettimeofday() calls around transfer_all_new_tablespaces(). I did 10
runs each and took the average, except for patch/1-page case since it
was obviously faster after a couple runs.

5 pages:
master    patch
5.59s     5.64s

The variation within the builds is up to +/- 0.2s, so there is no
difference, as expected.

1 page:
master    patch
5.62s     4.25s

Clearly, linking is much slower than stat.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Mar 13, 2019 at 4:57 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Have you done any performance testing of this patch?  I mean to say
> > now that we added a new stat call for each table, we should see if
> > that has any impact.  Ideally, that should be compensated by the fact
> > that we are now not transferring *fsm files for small relations.
>
> To be precise, it will only call stat if pg_class.relpages is below
> the threshold. I suppose I could hack a database where all the
> relpages values are wrong, but that seems like a waste of time.
>

Right.

> > How
> > about constructing a test where all relations are greater than 4 pages
> > and then try to upgrade them.  We can check for a cluster with a
> > different number of relations say 10K, 20K, 50K, 100K.
>
> I did both greater and less than 4 pages for 10k relations. Since
> pg_upgrade is O(# relations), I don't see a point in going higher.
>
> First, I had a problem: On MacOS with their "gcc" wrapper around
> clang, I got a segfault 11 when compiled with no debugging symbols.
>

Did you get this problem with the patch or both with and without the
patch?  If it is only with patch, then we definitely need to
investigate.

> I
> added "CFLAGS=-O0" and it worked fine. Since this doesn't happen in a
> debugging build, I'm not sure how to investigate this. IIRC, this
> doesn't happen for me on Linux gcc.
>
> Since it was at least running now, I measured by putting
> gettimeofday() calls around transfer_all_new_tablespaces(). I did 10
> runs each and took the average, except for patch/1-page case since it
> was obviously faster after a couple runs.
>
> 5 pages:
> master    patch
> 5.59s     5.64s
>
> The variation within the builds is up to +/- 0.2s, so there is no
> difference, as expected.
>
> 1 page:
> master    patch
> 5.62s     4.25s
>
> Clearly, linking is much slower than stat.
>

The results are fine.  Thanks for doing the tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Wed, Mar 13, 2019 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > First, I had a problem: On MacOS with their "gcc" wrapper around
> > clang, I got a segfault 11 when compiled with no debugging symbols.
> >
>
> Did you get this problem with the patch or both with and without the
> patch?  If it is only with patch, then we definitely need to
> investigate.

Only with the patch.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Wed, Mar 13, 2019 at 7:42 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Mar 13, 2019 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > First, I had a problem: On MacOS with their "gcc" wrapper around
> > > clang, I got a segfault 11 when compiled with no debugging symbols.
> > >
> >
> > Did you get this problem with the patch or both with and without the
> > patch?  If it is only with patch, then we definitely need to
> > investigate.
>
> Only with the patch.
>

If the problem is reproducible, then I think we can try out a few
things to narrow down or get some clue about the problem:
(a) modify function new_cluster_needs_fsm() such that it always
returns true as its first statement.  This will tell us if the code in
that function has some problem.
(b) run with Valgrind

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
> [segfault problems]

This now seems spurious. I ran make distclean, git pull, reapplied the
patch (leaving out the gettimeofday() calls), and now my upgrade perf
test works with default compiler settings. Not sure what happened, but
hopefully we can move forward.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Mar 14, 2019 at 7:08 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> > [segfault problems]
>
> This now seems spurious. I ran make distclean, git pull, reapplied the
> patch (leaving out the gettimeofday() calls), and now my upgrade perf
> test works with default compiler settings. Not sure what happened, but
> hopefully we can move forward.
>

Yeah, I took another pass through the patch and change minor things as
you can see in the attached patch.

1. Added an Assert in new_cluster_needs_fsm() that old cluster version
should be >= 804 as that is where fsm support has been added.
2. Reverted the old cluster version check to <= 1100.  There was
nothing wrong in the way you have written a check, but I find most of
the other places in the code uses that way.
3. At one place changed the function header to be consistent with the
nearby code, run pgindent on the code and changed the commit message.

Let me know what you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Thu, Mar 14, 2019 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 1. Added an Assert in new_cluster_needs_fsm() that old cluster version
> should be >= 804 as that is where fsm support has been added.

There is already an explicit check for 804 in the caller, and the HEAD
code is already resilient to FSMs not existing, so I think this is
superfluous.

> 2. Reverted the old cluster version check to <= 1100.  There was
> nothing wrong in the way you have written a check, but I find most of
> the other places in the code uses that way.
> 3. At one place changed the function header to be consistent with the
> nearby code, run pgindent on the code and changed the commit message.

Looks good to me, thanks.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Mar 14, 2019 at 12:37 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Mar 14, 2019 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 1. Added an Assert in new_cluster_needs_fsm() that old cluster version
> > should be >= 804 as that is where fsm support has been added.
>
> There is already an explicit check for 804 in the caller,
>

Yeah, I know that, but I have added it to prevent this function being
used elsewhere.  OTOH, maybe you are right that as per current code it
is superfluous, so we shouldn't add this assert.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Thu, Mar 14, 2019 at 7:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 14, 2019 at 12:37 PM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> > On Thu, Mar 14, 2019 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 1. Added an Assert in new_cluster_needs_fsm() that old cluster version
> > > should be >= 804 as that is where fsm support has been added.
> >
> > There is already an explicit check for 804 in the caller,
> >
>
> Yeah, I know that, but I have added it to prevent this function being
> used elsewhere.  OTOH, maybe you are right that as per current code it
> is superfluous, so we shouldn't add this assert.
>

I have committed the latest version of this patch.  I think we can
wait for a day or two see if there is any complain from buildfarm or
in general and then we can close this CF entry.  IIRC, this was the
last patch in the series, right?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
John Naylor
Date:
On Fri, Mar 15, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have committed the latest version of this patch.  I think we can
> wait for a day or two see if there is any complain from buildfarm or
> in general and then we can close this CF entry.  IIRC, this was the
> last patch in the series, right?

Great, thanks! I'll keep an eye on the buildfarm as well.

I just spotted two comments in freespace.c that were true during
earlier patch revisions, but are no longer true, so I've attached a
fix for those. There are no other patches in the series.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Mar 15, 2019 at 3:40 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Fri, Mar 15, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have committed the latest version of this patch.  I think we can
> > wait for a day or two see if there is any complain from buildfarm or
> > in general and then we can close this CF entry.  IIRC, this was the
> > last patch in the series, right?
>
> Great, thanks! I'll keep an eye on the buildfarm as well.
>
> I just spotted two comments in freespace.c that were true during
> earlier patch revisions, but are no longer true, so I've attached a
> fix for those.
>

LGTM, so committed.

> There are no other patches in the series.
>

Okay.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: WIP: Avoid creation of the free space map for small tables

From
Amit Kapila
Date:
On Fri, Mar 15, 2019 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have committed the latest version of this patch.  I think we can
> wait for a day or two see if there is any complain from buildfarm or
> in general and then we can close this CF entry.
>

Closed this CF entry.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com