Thread: WIP: Avoid creation of the free space map for small tables

WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

05 October 2018, 21:47:11

Hi all,
A while back, Robert Haas noticed that the space taken up by very
small tables is dominated by the FSM [1]. Tom suggested that we could
prevent creation of the FSM until the heap has reached a certain
threshold size [2]. Attached is a WIP patch to implement that. I've
also attached a SQL script to demonstrate the change in behavior for
various scenarios.

The behavior that allows the simplest implementation I thought of is as follows:

-The FSM isn't created if the heap has fewer than 10 blocks (or
whatever). If the last known good block has insufficient space, try
every block before extending the heap.

-If a heap with a FSM is truncated back to below the threshold, the
FSM stays around and can be used as usual.

-If the heap tuples are all deleted, the FSM stays but has no leaf
blocks (same as on master). Although it exists, it won't be
re-extended until the heap re-passes the threshold.

--
Some notes:

-For normal mode, I taught fsm_set_and_search() to switch to a
non-extending buffer call, but the biggest missing piece is WAL
replay. I couldn't find a non-extending equivalent of
XLogReadBufferExtended(), so I might have to create one.

-There'll need to be some performance testing to make sure there's no
regression, and to choose a good value for the threshold. I'll look
into that, but if anyone has any ideas for tests, that'll help this
effort along.

-A possible TODO item is to teach pg_upgrade not to link FSMs for
small heaps. I haven't look into the feasibility of that, however.

-RelationGetBufferForTuple() now has two boolean variables that mean
"don't use the FSM", but with different behaviors. To avoid confusion,
I've renamed use_fsm to always_extend and revised the commentary
accordingly.

-I've only implemented this for heaps, because indexes (at least
B-tree) don't seem to be as eager to create a FSM. I haven't looked at
the code, however.

--
[1] https://www.postgresql.org/message-id/CA%2BTgmoac%2B6qTNp2U%2BwedY8-PU6kK_b6hbdhR5xYGBG3GtdFcww%40mail.gmail.com
[2] https://www.postgresql.org/message-id/11360.1345502641%40sss.pgh.pa.us

--
I'll add this to the November commitfest.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From

Thomas Munro

Date:

05 October 2018, 23:42:26

On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:
> A while back, Robert Haas noticed that the space taken up by very
> small tables is dominated by the FSM [1]. Tom suggested that we could
> prevent creation of the FSM until the heap has reached a certain
> threshold size [2]. Attached is a WIP patch to implement that. I've
> also attached a SQL script to demonstrate the change in behavior for
> various scenarios.

Hi John,

You'll need to tweak the test in contrib/pageinspect/sql/page.sql,
because it's currently asserting that there is an FSM on a small table
so make check-world fails.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

07 October 2018, 18:17:24

On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:
>> A while back, Robert Haas noticed that the space taken up by very
>> small tables is dominated by the FSM [1]. Tom suggested that we could
>> prevent creation of the FSM until the heap has reached a certain
>> threshold size [2]. Attached is a WIP patch to implement that. I've
>> also attached a SQL script to demonstrate the change in behavior for
>> various scenarios.
>
> Hi John,
>
> You'll need to tweak the test in contrib/pageinspect/sql/page.sql,
> because it's currently asserting that there is an FSM on a small table
> so make check-world fails.

Whoops, sorry about that; the attached patch passes make check-world.
While looking into that, I also found a regression: If the cached
target block is the last block in the relation and there is no free
space, that block will be tried twice. That's been fixed as well.

Thanks,
-John Naylor

Attachment

v2-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Tom Lane

Date:

07 October 2018, 18:41:20

John Naylor <jcnaylor@gmail.com> writes:
> On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:
>>> A while back, Robert Haas noticed that the space taken up by very
>>> small tables is dominated by the FSM [1]. Tom suggested that we could
>>> prevent creation of the FSM until the heap has reached a certain
>>> threshold size [2]. Attached is a WIP patch to implement that.

BTW, don't we need a similar hack for visibility maps?

            regards, tom lane

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

08 October 2018, 13:49:03

On 10/7/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> John Naylor <jcnaylor@gmail.com> writes:
>> On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote:
>>>> A while back, Robert Haas noticed that the space taken up by very
>>>> small tables is dominated by the FSM [1]. Tom suggested that we could
>>>> prevent creation of the FSM until the heap has reached a certain
>>>> threshold size [2]. Attached is a WIP patch to implement that.
>
> BTW, don't we need a similar hack for visibility maps?

The FSM is the bigger bang for the buck, and fairly simple to do, but
it would be nice to do something about VMs as well. I'm not sure if
simply lacking a VM would be as simple (or as free of downsides) as
for the FSM. I haven't studied the VM code in detail, however.

-John Naylor

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

12 October 2018, 13:49:17

On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> -There'll need to be some performance testing to make sure there's no
> regression, and to choose a good value for the threshold. I'll look
> into that, but if anyone has any ideas for tests, that'll help this
> effort along.
>

Can you try with a Copy command which copies just enough tuples to
fill the pages equivalent to HEAP_FSM_EXTENSION_THRESHOLD?  It seems
to me in such a case patch will try each of the blocks multiple times.
  It looks quite lame that we have to try again and again the blocks
which we have just filled by ourselves but may be that doesn't matter
much as the threshold value is small.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

13 October 2018, 12:56:33

On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> Hi all,
> A while back, Robert Haas noticed that the space taken up by very
> small tables is dominated by the FSM [1]. Tom suggested that we could
> prevent creation of the FSM until the heap has reached a certain
> threshold size [2]. Attached is a WIP patch to implement that. I've
> also attached a SQL script to demonstrate the change in behavior for
> various scenarios.
>
> The behavior that allows the simplest implementation I thought of is as follows:
>
> -The FSM isn't created if the heap has fewer than 10 blocks (or
> whatever). If the last known good block has insufficient space, try
> every block before extending the heap.
>
> -If a heap with a FSM is truncated back to below the threshold, the
> FSM stays around and can be used as usual.
>
> -If the heap tuples are all deleted, the FSM stays but has no leaf
> blocks (same as on master). Although it exists, it won't be
> re-extended until the heap re-passes the threshold.
>
> --
> Some notes:
>
> -For normal mode, I taught fsm_set_and_search() to switch to a
> non-extending buffer call, but the biggest missing piece is WAL
> replay.
>

fsm_set_and_search()
{
..
+ /*
+ * For heaps we prevent extension of the FSM unless the number of pages
+ * exceeds
HEAP_FSM_EXTENSION_THRESHOLD. For tables that don't already
+ * have a FSM, this will save an inode and a few kB
of space.
+ * For sane threshold values, the FSM address will be zero, so we
+ * don't bother dealing with
anything else.
+ */
+ if (rel->rd_rel->relkind == RELKIND_RELATION
+ && addr.logpageno == 0)

I am not sure if this is a solid way to avoid creating FSM.  What if
fsm_set_and_search gets called for the level other than 0?   Also,
when the relation has blocks more than HEAP_FSM_EXTENSION_THRESHOLD,
then first time when vacuum will try to record the free space in the
page, won't it skip recording free space for first
HEAP_FSM_EXTENSION_THRESHOLD pages?

I think you have found a good way to avoid creating FSM, but can't we
use some simpler technique like if the FSM fork for a relation doesn't
exist, then check the heapblk number for which we try to update the
FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
creating the FSM.

> I couldn't find a non-extending equivalent of
> XLogReadBufferExtended(), so I might have to create one.
>

I think it would be better if we can find a common way to avoid
creating FSM both during DO and REDO time.  It might be possible if
somethin like what I have said above is feasible.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

13 October 2018, 19:39:39

On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote:
>> -For normal mode, I taught fsm_set_and_search() to switch to a
>> non-extending buffer call, but the biggest missing piece is WAL
>> replay.
>>
>
> fsm_set_and_search()
> {
> ..
> + /*
> + * For heaps we prevent extension of the FSM unless the number of pages
> + * exceeds
> HEAP_FSM_EXTENSION_THRESHOLD. For tables that don't already
> + * have a FSM, this will save an inode and a few kB
> of space.
> + * For sane threshold values, the FSM address will be zero, so we
> + * don't bother dealing with
> anything else.
> + */
> + if (rel->rd_rel->relkind == RELKIND_RELATION
> + && addr.logpageno == 0)
>
> I am not sure if this is a solid way to avoid creating FSM.  What if
> fsm_set_and_search gets called for the level other than 0?

Thanks for taking a look. As for levels other than 0, I think that
only happens when fsm_set_and_search() is called by fsm_search(),
which will not cause extension.

> Also,
> when the relation has blocks more than HEAP_FSM_EXTENSION_THRESHOLD,
> then first time when vacuum will try to record the free space in the
> page, won't it skip recording free space for first
> HEAP_FSM_EXTENSION_THRESHOLD pages?

Hmm, that's a good point.

> I think you have found a good way to avoid creating FSM, but can't we
> use some simpler technique like if the FSM fork for a relation doesn't
> exist, then check the heapblk number for which we try to update the
> FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
> creating the FSM.

I think I see what you mean, but to avoid the vacuum problem you just
mentioned, we'd need to check the relation size, too. I've attached an
unpolished revision to do this. It seems to work, but I haven't tested
the vacuum issue yet. I'll do that and some COPY performance testing
in the next day or so. There's a bit more repetition than I would
like, so I'm not sure it's simpler - perhaps RecordPageWithFreeSpace()
could be turned into a wrapper around RecordAndGetPageWithFreeSpace().

Also new in this version, some non-functional improvements to hio.c:
-debugging calls that are #ifdef'd out.
-move some code out into a function instead of adding another goto.

>> I couldn't find a non-extending equivalent of
>> XLogReadBufferExtended(), so I might have to create one.
>>
>
> I think it would be better if we can find a common way to avoid
> creating FSM both during DO and REDO time.  It might be possible if
> somethin like what I have said above is feasible.

That would be ideal.

-John Naylor

Attachment

v3-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

14 October 2018, 16:29:28

> On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think you have found a good way to avoid creating FSM, but can't we
>> use some simpler technique like if the FSM fork for a relation doesn't
>> exist, then check the heapblk number for which we try to update the
>> FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid
>> creating the FSM.

>> I think it would be better if we can find a common way to avoid
>> creating FSM both during DO and REDO time.  It might be possible if
>> somethin like what I have said above is feasible.

I've attached v4, which implements the REDO case, and as closely as
possible to the DO case. I've created a new function to guard against
creation of the FSM, which is called by  RecordPageWithFreeSpace() and
RecordAndGetPageWithFreeSpace(). Since XLogRecordPageWithFreeSpace()
takes a relfilenode and not a relation, I had to reimplement that
separately, but the logic is basically the same. It works under
streaming replication.

I've also attached a couple SQL scripts which, when the aforementioned
DEBUG1 calls are enabled, show what the heap insert code is doing for
different scenarios. Make check-world passes.

-John Naylor

On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think you can avoid calling RelationGetNumberOfBlocks, if you call
> smgrexists before

This is done in the attached v5, 0001.

> and for the purpose of vacuum, we can get that as an
> input parameter.  I think one can argue for not changing the interface
> functions like RecordPageWithFreeSpace to avoid calling
> RelationGetNumberOfBlocks, but to me, it appears worth to save the
> additional system call.

This is done in 0002. I also added a check for the cached value of
pg_class.relpages, since it's cheap and may help non-VACUUM callers.

> [proposal for a cache of blocks to try]

That's interesting. I'll have to do some reading elsewhere in the
codebase, and then I'll follow up.

Thanks,
-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

16 October 2018, 13:05:28

On Tue, Oct 16, 2018 at 4:27 PM John Naylor <jcnaylor@gmail.com> wrote:
> > [proposal for a cache of blocks to try]
>
> That's interesting. I'll have to do some reading elsewhere in the
> codebase, and then I'll follow up.
>

Thanks, I have changed the status of this patch as "Waiting on
Author".  Feel free to change it once you have a new patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

22 October 2018, 06:44:27

On 10/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think we can avoid using prevBlockNumber and try_every_block, if we
> maintain a small cache which tells whether the particular block is
> tried or not.  What I am envisioning is that while finding the block
> with free space, if we came to know that the relation in question is
> small enough that it doesn't have FSM, we can perform a local_search.
> In local_seach, we can enquire the cache for any block that we can try
> and if we find any block, we can try inserting in that block,
> otherwise, we need to extend the relation.  One simple way to imagine
> such a cache would be an array of structure and structure has blkno
> and status fields.  After we get the usable block, we need to clear
> the cache, if exists.

Here is the design I've implemented in the attached v6. There is more
code than v5, but there's a cleaner separation between freespace.c and
hio.c, as you preferred. I also think it's more robust. I've expended
some effort to avoid doing unnecessary system calls to get the number
of blocks.
--

For the local, in-memory map, maintain a static array of status
markers, of fixed-length HEAP_FSM_CREATION_THRESHOLD, indexed by block
number. This is populated every time we call GetPageWithFreeSpace() on
small tables with no FSM. The statuses are

'zero' (beyond the relation)
'available to try'
'tried already'

Example for a 4-page heap:

01234567
AAAA0000

If we try block 3 and there is no space, we set it to 'tried' and next
time through the loop we'll try 2, etc:

01234567
AAAT0000

If we try all available blocks, we will extend the relation. As in the
master branch, first we call GetPageWithFreeSpace() again to see if
another backend extended the relation to 5 blocks while we were
waiting for the lock. If we find a new block, we will mark the new
block available and leave the rest alone:

01234567
TTTTA000

On the odd chance we still can't insert into the new block, we'll skip
checking any others and we'll redo the logic to extend the relation.

If we're about to successfully return a buffer, whether from an
existing block, or by extension, we clear the local map.

Once this is in shape, I'll do some performance testing.

-John Naylor

Attachment

v6-0001-Avoid-creation-of-the-free-space-map-for-small-ta.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

23 October 2018, 13:42:02

I wrote:

> Once this is in shape, I'll do some performance testing.

On second thought, there's no point in waiting, especially if a
regression points to a design flaw.

I compiled patched postgres with HEAP_FSM_CREATION_THRESHOLD set to
32, then ran the attached script which populates 100 tables with
varying numbers of blocks. I wanted a test that created pages eagerly
and wrote to disk as little as possible. Config was stock, except for
fsync = off. I took the average of 10 runs after removing the slowest
and fastest run:

# blocks    master        patch
4            36.4ms        33.9ms
8            50.6ms        48.9ms
12            58.6ms        66.3ms
16            65.5ms        81.4ms

It seems under these circumstances a threshold of up to 8 performs
comparably to the master branch, with small block numbers possibly
faster than with the FSM, provided they're in shared buffers already.
I didn't bother testing higher values because it's clear there's a
regression starting around 10 or so, beyond which it helps to have the
FSM.

A case could be made for setting the threshold to 4, since not having
3 blocks of FSM in shared buffers exactly makes up for the 3 other
blocks of heap that are checked when free space runs out.

I can run additional tests if there's interest.

-John Naylor

Attachment

fsm-copy-test.sql

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

31 October 2018, 08:12:08

Upthread I wrote:

> -A possible TODO item is to teach pg_upgrade not to link FSMs for
> small heaps. I haven't look into the feasibility of that, however.

This turned out to be relatively light weight (0002 attached). I had
to add relkind to the RelInfo struct and save the size of each heap as
its transferred. The attached SQL script will setup a couple test
cases to demonstrate pg_upgrade. Installations with large numbers of
small tables will be able to see space savings right away.

For 0001, I adjusted the README and docs, and also made some cosmetic
improvements to the code, mostly in the comments. I've set the
commitfest entry back to 'needs review'

One thing I noticed is that one behavior on master hasn't changed:
System catalogs created during bootstrap still have a FSM if they have
any data. Possibly related, catalogs also have a VM even if they have
no data at all. This isn't anything to get excited about, but it would
be nice to investigate, at least so it can be documented. A cursory
dig hasn't found the cause, but I'll keep doing that as time permits.

-John Naylor

On Fri, Nov 02, 2018 at 10:38:45AM -0400, Robert Haas wrote:
> I think it's in evidence, in the form of several messages mentioning a
> flag called try_every_block.
>
> Just checking the last page of the table doesn't sound like a good
> idea to me.  I think that will just lead to a lot of stupid bloat.  It
> seems likely that checking every page of the table is fine for npages
> <= 3, and that would still be win in a very significant number of
> cases, since lots of instances have many empty or tiny tables.  I was
> merely reacting to the suggestion that the approach should be used for
> npages <= 32; that threshold sounds way too high.

It seems to me that it would be costly for schemas which have one core
table with a couple of records used in many joins with other queries.
Imagine for example a core table like that:
CREATE TABLE us_states (id serial, initials varchar(2));
INSERT INTO us_states VALUES (DEFAULT, 'CA');

If there is a workload where those initials need to be fetched a lot,
this patch could cause a loss.  It looks hard to me to put a straight
number on when not having the FSM is better than having it because that
could be environment-dependent, so there is an argument for making the
default very low, still configurable?
--
Michael

Attachment

signature.asc

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

04 November 2018, 03:58:58

On Sun, Nov 4, 2018 at 5:56 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Nov 02, 2018 at 10:38:45AM -0400, Robert Haas wrote:
> > I think it's in evidence, in the form of several messages mentioning a
> > flag called try_every_block.
> >
> > Just checking the last page of the table doesn't sound like a good
> > idea to me.  I think that will just lead to a lot of stupid bloat.  It
> > seems likely that checking every page of the table is fine for npages
> > <= 3, and that would still be win in a very significant number of
> > cases, since lots of instances have many empty or tiny tables.  I was
> > merely reacting to the suggestion that the approach should be used for
> > npages <= 32; that threshold sounds way too high.
>
> It seems to me that it would be costly for schemas which have one core
> table with a couple of records used in many joins with other queries.
> Imagine for example a core table like that:
> CREATE TABLE us_states (id serial, initials varchar(2));
> INSERT INTO us_states VALUES (DEFAULT, 'CA');
>
> If there is a workload where those initials need to be fetched a lot,
> this patch could cause a loss.
>

How alone fetching would cause any loss? If it gets updated, then
there is a chance that we might have some performance impact.

>  It looks hard to me to put a straight
> number on when not having the FSM is better than having it because that
> could be environment-dependent, so there is an argument for making the
> default very low, still configurable?
>

I think 3 or 4 as threshold should work fine (though we need to
thoroughly test that) as we will anyway avoid having three additional
pages of FSM for such tables.  I am not sure how easy it would be for
users to set this value if we make it configurable or on what basis
can they configure?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

04 November 2018, 08:26:34

On 10/31/18, Robert Haas <robertmhaas@gmail.com> wrote:
> It seems important to me that before anybody thinks
> about committing this, we construct some kind of destruction case
> where repeated scans of the whole table are triggered as frequently as
> possible, and then run that test with varying thresholds.  I might be
> totally wrong, but I bet with a value as large as 32 you will be able
> to find cases where it regresses in a big way.

Here's an attempt at a destruction case: Lobotomize the heap insert
logic such that it never checks the cached target block and has to
call the free space logic for every single insertion, like this:

index ff13c03083..5d5b36af29 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -377,7 +377,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
     else if (bistate && bistate->current_buf != InvalidBuffer)
         targetBlock = BufferGetBlockNumber(bistate->current_buf);
     else
-        targetBlock = RelationGetTargetBlock(relation);
+        targetBlock = InvalidBlockNumber;

     if (targetBlock == InvalidBlockNumber && use_fsm)
     {

(with the threshold patch I had to do additional work)
With the small tuples used in the attached v2 test, this means the
free space logic is called ~225 times per block. The test tables are
pre-filled with one tuple and vacuumed so that the FSMs are already
created when testing the master branch. The patch branch is compiled
with a threshold of 8, but testing inserts of 4 pages will effectively
simulate a threshold of 4, etc. As before, trimmed average of 10 runs,
loading to 100 tables each:

# blocks    master        patch
2            25.1ms        30.3ms
4            40.7ms        48.1ms
6            56.6ms        64.7ms
8            73.1ms        82.0ms

Without this artificial penalty, the 8 block case was about 50ms for
both branches. So if I calculated right, of that 50 ms, master is
spending ~0.10ms looking for free space, and the patch is spending
about ~0.15ms. So, from that perspective, the difference is trivial.
Of course, this is a single client, so not entirely realistic. I think
that shared buffer considerations are most important for deciding the
threshold.

> We also need to think about what happens on the standby, where the FSM
> is updated in a fairly different way.

Were you referring to performance or just functionality? Because the
threshold works on the standby, but I don't know about the performance
there.

-John Naylor

Attachment

fsm-copy-test-v2.sql

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

05 November 2018, 09:28:17

On 11/2/18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> On a FSM-less table, I'd be inclined to just check the
> last page and then grow the table if the tuple doesn't fit there.
> This would, in many cases, soon result in a FSM being created, but
> I think that's just fine.  The point of the change is to optimize
> for cases where a table *never* gets more than a few inserts.  Not, IMO,
> for cases where a table gets a lot of churn but never has a whole lot of
> live tuples.  In the latter scenario we are far better off having a FSM.

and...

On 11/2/18, Robert Haas <robertmhaas@gmail.com> wrote:
> Just checking the last page of the table doesn't sound like a good
> idea to me.  I think that will just lead to a lot of stupid bloat.  It
> seems likely that checking every page of the table is fine for npages
> <= 3, and that would still be win in a very significant number of
> cases,

I see the merit of both of these arguments, and it occurred to me that
there is middle ground between checking only the last page and
checking every page: Check the last 3 pages and set the threshold to
6. That way, with npages <= 3, every page will be checked. In the
unlikely case that npages = 6 and the first 3 pages are all wasted
space, that's the amount of space that would have gone to the FSM
anyway, and the relation will likely grow beyond the threshold soon,
at which point the free space will become visible again.

-John Naylor

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

16 November 2018, 03:00:00

On Mon, Oct 22, 2018 at 12:14 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 10/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think we can avoid using prevBlockNumber and try_every_block, if we
> > maintain a small cache which tells whether the particular block is
> > tried or not.  What I am envisioning is that while finding the block
> > with free space, if we came to know that the relation in question is
> > small enough that it doesn't have FSM, we can perform a local_search.
> > In local_seach, we can enquire the cache for any block that we can try
> > and if we find any block, we can try inserting in that block,
> > otherwise, we need to extend the relation.  One simple way to imagine
> > such a cache would be an array of structure and structure has blkno
> > and status fields.  After we get the usable block, we need to clear
> > the cache, if exists.
>
> Here is the design I've implemented in the attached v6. There is more
> code than v5, but there's a cleaner separation between freespace.c and
> hio.c, as you preferred.
>

This approach seems better.

> I also think it's more robust. I've expended
> some effort to avoid doing unnecessary system calls to get the number
> of blocks.
> --
>
> For the local, in-memory map, maintain a static array of status
> markers, of fixed-length HEAP_FSM_CREATION_THRESHOLD, indexed by block
> number. This is populated every time we call GetPageWithFreeSpace() on
> small tables with no FSM. The statuses are
>
> 'zero' (beyond the relation)
> 'available to try'
> 'tried already'
>

+/* Status codes for the local map. */
+#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */
+#define FSM_LOCAL_AVAIL 0x01 /* Available to try */
+#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */

Instead of maintaining three states, can't we do with two states
(Available and Not Available), basically combine 0 and 2 in your case.
I think it will save some cycles in
fsm_local_set, where each time you need to initialize all the entries
in the map.  I think we can argue that it is not much overhead, but I
think it is better code-wise also if we can make it happen with fewer
states.

Some assorted comments:
1.
 <para>
-Each heap and index relation, except for hash indexes, has a Free Space Map
+Each heap relation, unless it is very small, and each index relation,
+except for hash indexes, has a Free Space Map
 (FSM) to keep track of available space in the relation. It's stored

It appears that line has ended abruptly.

2.
page = BufferGetPage(buffer);
+ targetBlock = BufferGetBlockNumber(buffer);

  if (!PageIsNew(page))
  elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
- BufferGetBlockNumber(buffer),
+ targetBlock,
  RelationGetRelationName(relation));

  PageInit(page, BufferGetPageSize(buffer), 0);
@@ -623,7 +641,18 @@ loop:
  * current backend to make more insertions or not, which is probably a
  * good bet most of the time.  So for now, don't add it to FSM yet.
  */
- RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
+ RelationSetTargetBlock(relation, targetBlock);

Is this related to this patch? If not, I suggest let's do it
separately if required.

3.
 static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
-    uint8 newValue, uint8 minValue);
+ uint8 newValue, uint8 minValue);

This appears to be a spurious change.

4.
@@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size len,
  * target.
  */
  targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
+
+ /*
+ * In case we used an in-memory map of available blocks, reset
+ * it for next use.
+ */
+ if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
+ ClearLocalMap();

How will you clear the local map during error?  I think you need to
clear it in abort path and you can name the function as
FSMClearLocalMap or something like that.

5.
+/*#define TRACE_TARGETBLOCK */

Debugging leftover, do you want to retain this and related stuff
during the development of patch?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

19 November 2018, 02:00:02

On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> +/* Status codes for the local map. */
> +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */
> +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */
> +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */
>
> Instead of maintaining three states, can't we do with two states
> (Available and Not Available), basically combine 0 and 2 in your case.
> I think it will save some cycles in
> fsm_local_set, where each time you need to initialize all the entries
> in the map.  I think we can argue that it is not much overhead, but I
> think it is better code-wise also if we can make it happen with fewer
> states.

That'd work too, but let's consider this scenario: We have a 2-block
table that has no free space. After trying each block, the local cache
looks like

0123
TT00

Let's say we have to wait to acquire a relation extension lock,
because another backend had already started extending the heap by 1
block. We call GetPageWithFreeSpace() and now the local map looks like

0123
TTA0

By using bitwise OR to set availability, the already-tried blocks
remain as they are. With only 2 states, the map would look like this
instead:

0123
AAAN

If we assume that an insert into the newly-created block 2 will almost
always succeed, we don't have to worry about wasting time re-checking
the first 2 full blocks. Does that sound right to you?

> Some assorted comments:
> 1.
>  <para>
> -Each heap and index relation, except for hash indexes, has a Free Space
> Map
> +Each heap relation, unless it is very small, and each index relation,
> +except for hash indexes, has a Free Space Map
>  (FSM) to keep track of available space in the relation. It's stored
>
> It appears that line has ended abruptly.

Not sure what you're referring to here.

> 2.
> page = BufferGetPage(buffer);
> + targetBlock = BufferGetBlockNumber(buffer);
>
>   if (!PageIsNew(page))
>   elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
> - BufferGetBlockNumber(buffer),
> + targetBlock,
>   RelationGetRelationName(relation));
>
>   PageInit(page, BufferGetPageSize(buffer), 0);
> @@ -623,7 +641,18 @@ loop:
>   * current backend to make more insertions or not, which is probably a
>   * good bet most of the time.  So for now, don't add it to FSM yet.
>   */
> - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
> + RelationSetTargetBlock(relation, targetBlock);
>
> Is this related to this patch? If not, I suggest let's do it
> separately if required.

I will separate this out.

> 3.
>  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> -    uint8 newValue, uint8 minValue);
> + uint8 newValue, uint8 minValue);
>
> This appears to be a spurious change.

It was intentional, but I will include it separately as above.

> 4.
> @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size
> len,
>   * target.
>   */
>   targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> +
> + /*
> + * In case we used an in-memory map of available blocks, reset
> + * it for next use.
> + */
> + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> + ClearLocalMap();
>
> How will you clear the local map during error?  I think you need to
> clear it in abort path and you can name the function as
> FSMClearLocalMap or something like that.

That sounds right, and I will rename the function that way. For the
abort path, were you referring to this or somewhere else?

if (!PageIsNew(page))
    elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
         targetBlock,
         RelationGetRelationName(relation));

> 5.
> +/*#define TRACE_TARGETBLOCK */
>
> Debugging leftover, do you want to retain this and related stuff
> during the development of patch?

I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's
useful for development, but I don't particularly care whether it's in
the final verision.

Also, I found an off-by-one error that caused an unnecessary
smgrexists() call in tables with threshold + 1 pages. This will be
fixed in the next version.

-John Naylor

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

19 November 2018, 05:06:02

On Mon, Nov 19, 2018 at 7:30 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > +/* Status codes for the local map. */
> > +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */
> > +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */
> > +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */
> >
> > Instead of maintaining three states, can't we do with two states
> > (Available and Not Available), basically combine 0 and 2 in your case.
> > I think it will save some cycles in
> > fsm_local_set, where each time you need to initialize all the entries
> > in the map.  I think we can argue that it is not much overhead, but I
> > think it is better code-wise also if we can make it happen with fewer
> > states.
>
> That'd work too, but let's consider this scenario: We have a 2-block
> table that has no free space. After trying each block, the local cache
> looks like
>
> 0123
> TT00
>
> Let's say we have to wait to acquire a relation extension lock,
> because another backend had already started extending the heap by 1
> block. We call GetPageWithFreeSpace() and now the local map looks like
>
> 0123
> TTA0
>
> By using bitwise OR to set availability, the already-tried blocks
> remain as they are. With only 2 states, the map would look like this
> instead:
>
> 0123
> AAAN
>

I expect below part of code to go-away.
+fsm_local_set(Relation rel, BlockNumber nblocks)
{
..
+ /*
+ * If the blkno is beyond the end of the relation, the status should
+ * be zero already, but make sure it is.  If the blkno is within the
+ * relation, mark it available unless it's already been tried.
+ */
+ for (blkno = 0; blkno < HEAP_FSM_CREATION_THRESHOLD; blkno++)
+ {
+ if (blkno < nblocks)
+ FSMLocalMap[blkno] |= FSM_LOCAL_AVAIL;
+ else
+ FSMLocalMap[blkno] = FSM_LOCAL_ZERO;
+ }
..
}

In my mind for such a case it should look like below:
0123
NNAN

> If we assume that an insert into the newly-created block 2 will almost
> always succeed, we don't have to worry about wasting time re-checking
> the first 2 full blocks. Does that sound right to you?
>

As explained above, such a situation won't exist.

>
> > Some assorted comments:
> > 1.
> >  <para>
> > -Each heap and index relation, except for hash indexes, has a Free Space
> > Map
> > +Each heap relation, unless it is very small, and each index relation,
> > +except for hash indexes, has a Free Space Map
> >  (FSM) to keep track of available space in the relation. It's stored
> >
> > It appears that line has ended abruptly.
>
> Not sure what you're referring to here.
>

There is a space after "has a Free Space Map   " so you can combine next line.

> > 2.
> > page = BufferGetPage(buffer);
> > + targetBlock = BufferGetBlockNumber(buffer);
> >
> >   if (!PageIsNew(page))
> >   elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
> > - BufferGetBlockNumber(buffer),
> > + targetBlock,
> >   RelationGetRelationName(relation));
> >
> >   PageInit(page, BufferGetPageSize(buffer), 0);
> > @@ -623,7 +641,18 @@ loop:
> >   * current backend to make more insertions or not, which is probably a
> >   * good bet most of the time.  So for now, don't add it to FSM yet.
> >   */
> > - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
> > + RelationSetTargetBlock(relation, targetBlock);
> >
> > Is this related to this patch? If not, I suggest let's do it
> > separately if required.
>
> I will separate this out.
>
> > 3.
> >  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> > -    uint8 newValue, uint8 minValue);
> > + uint8 newValue, uint8 minValue);
> >
> > This appears to be a spurious change.
>
> It was intentional, but I will include it separately as above.
>
> > 4.
> > @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size
> > len,
> >   * target.
> >   */
> >   targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> > +
> > + /*
> > + * In case we used an in-memory map of available blocks, reset
> > + * it for next use.
> > + */
> > + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> > + ClearLocalMap();
> >
> > How will you clear the local map during error?  I think you need to
> > clear it in abort path and you can name the function as
> > FSMClearLocalMap or something like that.
>
> That sounds right, and I will rename the function that way. For the
> abort path, were you referring to this or somewhere else?
>
> if (!PageIsNew(page))
>         elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
>                  targetBlock,
>                  RelationGetRelationName(relation));
>

I think it might come from any other place between when you set it and
before it got cleared (like any intermediate buffer and pin related
API's).

> > 5.
> > +/*#define TRACE_TARGETBLOCK */
> >
> > Debugging leftover, do you want to retain this and related stuff
> > during the development of patch?
>
> I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's
> useful for development, but I don't particularly care whether it's in
> the final verision.
>

Okay, so if you want to retain it for the period of development, then
I am fine with it.  We can see at the end if it makes sense to retain
it.

> Also, I found an off-by-one error that caused an unnecessary
> smgrexists() call in tables with threshold + 1 pages. This will be
> fixed in the next version.
>

Thanks.

One other thing that slightly bothers me is the call to
RelationGetNumberOfBlocks via fsm_allow_writes.  It seems that call
will happen quite frequently in this code-path and can have some
performance impact.  As of now, I don't have any idea to avoid it or
reduce it more than what you already have in the patch, but I think we
should try some more to avoid it.  Let me know if you have any ideas
around that?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

19 November 2018, 11:10:17

On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Nov 19, 2018 at 7:30 AM John Naylor <jcnaylor@gmail.com> wrote:
>> Let's say we have to wait to acquire a relation extension lock,
>> because another backend had already started extending the heap by 1
>> block. We call GetPageWithFreeSpace() and now the local map looks like
>>
>> 0123
>> TTA0
>>
>> By using bitwise OR to set availability, the already-tried blocks
>> remain as they are. With only 2 states, the map would look like this
>> instead:
>>
>> 0123
>> AAAN
>>

> In my mind for such a case it should look like below:
> 0123
> NNAN

Okay, to retain that behavior with only 2 status codes, I have
implemented the map as a struct with 2 members: the cached number of
blocks, plus the same array I had before. This also allows a more
efficient implementation at the micro level. I just need to do some
more testing on it.

[ abortive states ]
> I think it might come from any other place between when you set it and
> before it got cleared (like any intermediate buffer and pin related
> API's).

Okay, I will look into that.

> One other thing that slightly bothers me is the call to
> RelationGetNumberOfBlocks via fsm_allow_writes.  It seems that call
> will happen quite frequently in this code-path and can have some
> performance impact.  As of now, I don't have any idea to avoid it or
> reduce it more than what you already have in the patch, but I think we
> should try some more to avoid it.  Let me know if you have any ideas
> around that?

FWIW, I believe that the callers of RecordPageWithFreeSpace() will
almost always avoid that call. Otherwise, there is at least one detail
that could use attention: If rel->rd_rel->relpages shows fewer pages
than the threshold, than the code doesn't trust it to be true. Might
be worth revisiting.
Aside from that, I will have to think about it.

More generally, I have a couple ideas about performance:

1. Only mark available every other block such that visible blocks are
interleaved as the relation extends. To explain, this diagram shows a
relation extending, with 1 meaning marked available and 0 meaning
marked not-available.

A
NA
ANA
NANA

So for a 3-block table, we never check block 1. Any free space it has
acquired will become visible when it extends to 4 blocks. For a
4-block threshold, we only check 2 blocks or less. This reduces the
number of lock/pin events but still controls bloat. We could also
check both blocks of a 2-block table.

2. During manual testing I seem to remember times that the FSM code
was invoked even though I expected the smgr entry to have a cached
target block. Perhaps VACUUM or something is clearing that away
unnecessarily. It seems worthwhile to verify and investigate, but that
seems like a separate project.

-John Naylor

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

19 November 2018, 12:24:29

On Mon, Nov 19, 2018 at 4:40 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Mon, Nov 19, 2018 at 7:30 AM John Naylor <jcnaylor@gmail.com> wrote:
> >> Let's say we have to wait to acquire a relation extension lock,
> >> because another backend had already started extending the heap by 1
> >> block. We call GetPageWithFreeSpace() and now the local map looks like
> >>
> >> 0123
> >> TTA0
> >>
> >> By using bitwise OR to set availability, the already-tried blocks
> >> remain as they are. With only 2 states, the map would look like this
> >> instead:
> >>
> >> 0123
> >> AAAN
> >>
>
> > In my mind for such a case it should look like below:
> > 0123
> > NNAN
>
> Okay, to retain that behavior with only 2 status codes, I have
> implemented the map as a struct with 2 members: the cached number of
> blocks, plus the same array I had before. This also allows a more
> efficient implementation at the micro level. I just need to do some
> more testing on it.
>

Okay.

> [ abortive states ]
> > I think it might come from any other place between when you set it and
> > before it got cleared (like any intermediate buffer and pin related
> > API's).
>
> Okay, I will look into that.
>
> > One other thing that slightly bothers me is the call to
> > RelationGetNumberOfBlocks via fsm_allow_writes.  It seems that call
> > will happen quite frequently in this code-path and can have some
> > performance impact.  As of now, I don't have any idea to avoid it or
> > reduce it more than what you already have in the patch, but I think we
> > should try some more to avoid it.  Let me know if you have any ideas
> > around that?
>
> FWIW, I believe that the callers of RecordPageWithFreeSpace() will
> almost always avoid that call. Otherwise, there is at least one detail
> that could use attention: If rel->rd_rel->relpages shows fewer pages
> than the threshold, than the code doesn't trust it to be true. Might
> be worth revisiting.
>

I think it is less of a concern when called from vacuum code path.

> Aside from that, I will have to think about it.
>
> More generally, I have a couple ideas about performance:
>
> 1. Only mark available every other block such that visible blocks are
> interleaved as the relation extends. To explain, this diagram shows a
> relation extending, with 1 meaning marked available and 0 meaning
> marked not-available.
>
> A
> NA
> ANA
> NANA
>
> So for a 3-block table, we never check block 1. Any free space it has
> acquired will become visible when it extends to 4 blocks. For a
> 4-block threshold, we only check 2 blocks or less. This reduces the
> number of lock/pin events but still controls bloat. We could also
> check both blocks of a 2-block table.
>

We can try something like this if we see there is any visible
performance hit in some scenario.

> 2. During manual testing I seem to remember times that the FSM code
> was invoked even though I expected the smgr entry to have a cached
> target block. Perhaps VACUUM or something is clearing that away
> unnecessarily. It seems worthwhile to verify and investigate, but that
> seems like a separate project.
>

makes sense, let's not get distracted by stuff that is not related to
this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

20 November 2018, 08:12:22

I wrote:

> On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> [ abortive states ]
>> I think it might come from any other place between when you set it and
>> before it got cleared (like any intermediate buffer and pin related
>> API's).
>
> Okay, I will look into that.

LockBuffer(), visibilitymap_pin(), and GetVisibilityMapPins() don't
call errors at this level. I don't immediately see any additional good
places from which to clear the local map.

-John Naylor

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

20 November 2018, 12:48:19

On Tue, Nov 20, 2018 at 1:42 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> I wrote:
>
> > On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > [ abortive states ]
> >> I think it might come from any other place between when you set it and
> >> before it got cleared (like any intermediate buffer and pin related
> >> API's).
> >
> > Okay, I will look into that.
>
> LockBuffer(), visibilitymap_pin(), and GetVisibilityMapPins() don't
> call errors at this level. I don't immediately see any additional good
> places from which to clear the local map.
>

LockBuffer()->LWLockAcquire() can error out.  Similarly,
ReadBuffer()->ReadBufferExtended() and calls below it can error ou.
To handle them, you need to add a call to clear local map in
Abortransaction code path.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

23 November 2018, 06:26:43

On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote:

I've attached v8, which includes the 2-state map and addresses the points below:

> Some assorted comments:
> 1.
>  <para>
> -Each heap and index relation, except for hash indexes, has a Free Space
> Map
> +Each heap relation, unless it is very small, and each index relation,
> +except for hash indexes, has a Free Space Map
>  (FSM) to keep track of available space in the relation. It's stored
>
> It appears that line has ended abruptly.

Revised.

> 2.
> page = BufferGetPage(buffer);
> + targetBlock = BufferGetBlockNumber(buffer);
>
>   if (!PageIsNew(page))
>   elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
> - BufferGetBlockNumber(buffer),
> + targetBlock,
>   RelationGetRelationName(relation));
>
>   PageInit(page, BufferGetPageSize(buffer), 0);
> @@ -623,7 +641,18 @@ loop:
>   * current backend to make more insertions or not, which is probably a
>   * good bet most of the time.  So for now, don't add it to FSM yet.
>   */
> - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
> + RelationSetTargetBlock(relation, targetBlock);
>
> Is this related to this patch? If not, I suggest let's do it
> separately if required.
>
> 3.
>  static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> -    uint8 newValue, uint8 minValue);
> + uint8 newValue, uint8 minValue);
>
> This appears to be a spurious change.

2 and 3 are separated into 0001.

> 4.
> @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size
> len,
>   * target.
>   */
>   targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace);
> +
> + /*
> + * In case we used an in-memory map of available blocks, reset
> + * it for next use.
> + */
> + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> + ClearLocalMap();
>
> How will you clear the local map during error?  I think you need to
> clear it in abort path and you can name the function as
> FSMClearLocalMap or something like that.

Done. I've put this call last before abort processing.

> 5.
> +/*#define TRACE_TARGETBLOCK */
>
> Debugging leftover, do you want to retain this and related stuff
> during the development of patch?

I have kept it aside as a separate patch but not attached it for now.

Also, we don't quite have a consensus on the threshold value, but I
have set it to 4 pages for v8. If this is still considered too
expensive (and basic tests show it shouldn't be), I suspect it'd be
better to interleave the available block numbers as described a couple
days ago than lower the threshold further.

I have looked at zhio.c, and it seems trivial to adapt zheap to this patchset.

-John Naylor

> Is the point 3 change related to pgindent?  I think even if you want
> these, then don't prepare other patches on top of this, keep it
> entirely separate.

Both removed.

>> Also, we don't quite have a consensus on the threshold value, but I
>> have set it to 4 pages for v8. If this is still considered too
>> expensive (and basic tests show it shouldn't be), I suspect it'd be
>> better to interleave the available block numbers as described a couple
>> days ago than lower the threshold further.
>>
>
> Can you please repeat the copy test you have done above with
> fillfactor as 20 and 30?

I will send the results in a separate email soon.

> Few more comments:
> -------------------------------
> 1. I think we can add some test(s) to test the new functionality, may
> be something on the lines of what Robert has originally provided as an
> example of this behavior [1].

Done. I tried adding it to several schedules, but for some reason
vacuuming an empty table failed to truncate the heap to 0 blocks.
Putting the test in its own group fixed the problem, but that doesn't
seem ideal.

> 2.
> The similar call is required in AbortSubTransaction function as well.
> I suggest to add it after pgstat_progress_end_command in both
> functions.

Done.

> 3.
>> agree we shouldn't change that without a reason to. One simple idea is
>> add a 3rd boolean parameter to GetPageWithFreeSpace() to control
>> whether it gives up if the FSM fork doesn't indicate free space, like
> I have the exact fix in my mind, so let's do it that way.

Done. This also reverts comments and variable names that referred to
updating the local map after relation extension.

While at it, I changed a couple conditionals to check the locally
cached nblocks rather than the threshold. No functional change, but
looks more precise. Might save a few cycles as well.

>> > 5. Your logic to update FSM on standby seems okay, but can you show
>> > some tests which proves its sanity?
>>
>> I believe to convince myself it was working, I used the individual
>> commands in the sql file in [3], then used the size function on the
>> secondary. I'll redo that to verify.

I've verified the standby behaves precisely as the primary, as far as
the aforementioned script goes.

-John Naylor

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

29 November 2018, 12:47:56

On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote:
> > Few more comments:
> > -------------------------------
> > 1. I think we can add some test(s) to test the new functionality, may
> > be something on the lines of what Robert has originally provided as an
> > example of this behavior [1].
>
> Done. I tried adding it to several schedules, but for some reason
> vacuuming an empty table failed to truncate the heap to 0 blocks.
> Putting the test in its own group fixed the problem, but that doesn't
> seem ideal.
>

It might be because it fails the should_attempt_truncation() check.
See below code:

if (should_attempt_truncation(vacrelstats))
lazy_truncate_heap(onerel, vacrelstats, vac_strategy);


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

01 December 2018, 07:12:11

On 11/29/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote:
>> Done. I tried adding it to several schedules, but for some reason
>> vacuuming an empty table failed to truncate the heap to 0 blocks.
>> Putting the test in its own group fixed the problem, but that doesn't
>> seem ideal.
>>
>
> It might be because it fails the should_attempt_truncation() check.
> See below code:
>
> if (should_attempt_truncation(vacrelstats))
> lazy_truncate_heap(onerel, vacrelstats, vac_strategy);

I see. I think truncating the FSM is not essential to show either the
old or new behavior -- I could skip that portion to enable running the
test in a parallel group.

>> Can you please repeat the copy test you have done above with
>> fillfactor as 20 and 30?
>
> I will send the results in a separate email soon.

I ran the attached scripts which populates 100 tables with either 4 or
8 blocks. The test tables were pre-filled with one tuple and vacuumed
so that the FSMs were already created when testing the master branch.
The patch branch was compiled with a threshold of 8, but testing
inserts of 4 pages effectively simulates a threshold of 4. Config was
stock, except for fsync = off. I took the average of 40 runs (2
complete tests of 20 runs each) after removing the 10% highest and
lowest:

fillfactor=20
# blocks    master        patch
4            19.1ms        17.5ms
8            33.4ms        30.9ms

fillfactor=30
# blocks    master        patch
4            20.1ms        19.7ms
8            34.7ms        34.9ms

It seems the patch might be a bit faster with fillfactor=20, but I'm
at a loss as to why that would be. Previous testing with a higher
threshold showed a significant performance penalty starting around 10
blocks [1], but that used truncation rather than deletion, and had a
fill-factor of 10.

--
[1] https://www.postgresql.org/message-id/CAJVSVGWCRMyi8sSqguf6PfFcpM3hwNY5YhPZTt-8Q3ZGv0UGYw%40mail.gmail.com

-John Naylor

On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Dec 3, 2018 at 11:15 AM John Naylor <jcnaylor@gmail.com> wrote:
>> Per your recent comment, we no longer check relation size if we waited
>> on a relation extension lock, so this is essentially a reversion to an
>> earlier version.
>>
>
> fsm_local_set is being called from RecordAndGetPageWithFreeSpace and
> GetPageWithFreeSpace whereas the change we have discussed was specific
> to GetPageWithFreeSpace, so not sure if we need any change in
> fsm_local_set.

Not needed, but I assumed wrongly you'd think it unclear otherwise.
I've now restored the generality and updated the comments to be closer
to v8.

> It would be good if you add few comments atop functions
> GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace and
> RecordPageWithFreeSpace about their interaction with local map.

Done. Also additional minor comment editing.

I've added an additional regression test for finding the right block
and removed a test I thought was redundant. I've kept the test file in
its own schedule.

-John Naylor

On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 12/6/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
> >>
> >> I've added an additional regression test for finding the right block
> >> and removed a test I thought was redundant. I've kept the test file in
> >> its own schedule.
> >>
> >
> > +# ----------
> > +# fsm does a vacuum, and running it in parallel seems to prevent heap
> > truncation.
> > +# ----------
> > +test: fsm
> > +
> >
> > It is not clear to me from the comment why running it in parallel
> > prevents heap truncation, can you explain what behavior are you seeing
> > and what makes you think that running it in parallel caused it?
>
> One of the tests deletes all records from the relation and vacuums. In
> serial schedule, the heap and FSM are truncated; in parallel they are
> not. Make check fails, since since the tests measure relation size.
> Taking a closer look, I'm even more alarmed to discover that vacuum
> doesn't even seem to remove deleted rows in parallel schedule (that
> was in the last test I added), which makes no sense and causes that
> test to fail. I looked in vacuum.sql for possible clues, but didn't
> see any.
>

I couldn't resist the temptation to figure out what's going on here.
The newly added tests have deletes followed by vacuum and then you
check whether the vacuum has removed the data by checking heap and or
FSM size.  Now, when you run such a test in parallel, the vacuum can
sometimes skip removing the rows because there are parallel
transactions open which can see the deleted rows.  You can easily
verify this phenomenon by running the newly added tests in one session
in psql when there is another parallel session which has an open
transaction.  For example:

Session-1
Begin;
Insert into foo values(1);

Session-2
\i fsm.sql

Now, you should see the results similar to what you are seeing when
you ran the fsm test by adding it to one of the parallel group.  Can
you test this at your end and confirm whether my analysis is correct
or not.

So, you can keep the test as you have in parallel_schedule, but
comment needs to be changed.  Also, you need to add the new test in
serial_schedule.  I have done both the changes in the attached patch,
kindly confirm if this looks correct to you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v11-0001-Avoid-creation-of-the-free-space-map-for-small-table.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

08 December 2018, 13:56:34

On 12/8/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote:
> I couldn't resist the temptation to figure out what's going on here.
> The newly added tests have deletes followed by vacuum and then you
> check whether the vacuum has removed the data by checking heap and or
> FSM size.  Now, when you run such a test in parallel, the vacuum can
> sometimes skip removing the rows because there are parallel
> transactions open which can see the deleted rows.

Ah yes, of course.

> You can easily
> verify this phenomenon by running the newly added tests in one session
> in psql when there is another parallel session which has an open
> transaction.  For example:
>
> Session-1
> Begin;
> Insert into foo values(1);
>
> Session-2
> \i fsm.sql
>
> Now, you should see the results similar to what you are seeing when
> you ran the fsm test by adding it to one of the parallel group.  Can
> you test this at your end and confirm whether my analysis is correct
> or not.

Yes, I see the same behavior.

> So, you can keep the test as you have in parallel_schedule, but
> comment needs to be changed.  Also, you need to add the new test in
> serial_schedule.  I have done both the changes in the attached patch,
> kindly confirm if this looks correct to you.

Looks good to me. I'll just note that the new line in the serial
schedule has an extra space at the end. Thanks for looking into this.

-John Naylor

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

12 December 2018, 21:48:22

On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 4. You have mentioned above that "system catalogs created during
> bootstrap still have a FSM if they have any data." and I can also see
> this behavior, have you investigated this point further?

I found the cause of this. There is some special-case code in md.c to
create any file if it's opened in bootstrap mode. I removed this and a
similar special case (attached), and make check still passes. After
digging through the history, I'm guessing this has been useless code
since about 2001, when certain special catalogs were removed.

-John Naylor

Attachment

remove-bootstrap-case-md.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

15 December 2018, 05:05:23

On Thu, Dec 13, 2018 at 3:18 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 4. You have mentioned above that "system catalogs created during
> > bootstrap still have a FSM if they have any data." and I can also see
> > this behavior, have you investigated this point further?
>
> I found the cause of this. There is some special-case code in md.c to
> create any file if it's opened in bootstrap mode. I removed this and a
> similar special case (attached), and make check still passes. After
> digging through the history, I'm guessing this has been useless code
> since about 2001, when certain special catalogs were removed.
>

Good finding, but I think it is better to discuss this part
separately.  I have started a new thread for this issue [1].

[1] - https://www.postgresql.org/message-id/CAA4eK1KsET6sotf%2BrzOTQfb83pzVEzVhbQi1nxGFYVstVWXUGw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Mithun Cy

Date:

29 December 2018, 11:15:41

On Sat, Dec 8, 2018 at 6:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote:
> >
> > On 12/6/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
> > >>
> > >> I've added an additional regression test for finding the right block

I did run some performance tests on the latest patch v11, I see small regression in execution time of COPY statement. Tests I have used is same as provided in [1] just that I ran it for fill factor 20 and 70. Here are my results!

Machine : cthulhu (Intel based 8 numa machine)
Server setting is default, configured with HEAP_FSM_CREATION_THRESHOLD = 4,
Entire data directory was on HDD.

Results are execution time(unit ms) taken by copy statement when number of records equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4 pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70 till tid (3, 157). Result is taken as a median of 10 runs.

Fill factor 20
Tables Base Patch % of increase in execution time
500 121.97 125.315 2.7424776584
1000 246.592 253.789 2.9185861666

Fill factor 70
500 211.502 217.128 2.6600221275
1000 420.309 432.606 2.9257046601

So 2-3% consistent regression, And on every run I can see for patch v11 execution time is slightly more than base. I also tried to insert more records till 8 pages and same regression is observed! So I guess even HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!

[1] https://www.postgresql.org/message-id/CAJVSVGX%3D2Q52fwijD9cjeq1UdiYGXns2_9WAPFf%3DE8cwbFCDvQ%40mail.gmail.com

--
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

29 December 2018, 22:19:58

On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Results are execution time(unit ms) taken by copy statement when number of
> records  equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4
> pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70
> till tid (3, 157). Result is taken as a median of 10 runs.

> So 2-3% consistent regression, And on every run I can see for patch v11
> execution time is slightly more than base.

Thanks for testing!

> I also tried to insert more
> records till 8 pages and same regression is observed! So I guess even
> HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!

That's curious, because once the table exceeds the threshold, it would
be allowed to update the FSM, and in the process write 3 pages that it
didn't have to in the 4 page test. The master branch has the FSM
already, so I would expect the 8 page case to regress more.

What I can do later is provide a supplementary patch to go on top of
mine that only checks the last block. If that improves performance,
I'll alter my patch to only check every other page.

-John Naylor

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

31 December 2018, 03:49:37

On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote:
>
> On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> > Results are execution time(unit ms) taken by copy statement when number of
> > records  equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4
> > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70
> > till tid (3, 157). Result is taken as a median of 10 runs.
>
> > So 2-3% consistent regression, And on every run I can see for patch v11
> > execution time is slightly more than base.
>

Have you by any chance checked at scale factor 80 or 100?

> Thanks for testing!
>
> > I also tried to insert more
> > records till 8 pages and same regression is observed! So I guess even
> > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
>
> That's curious, because once the table exceeds the threshold, it would
> be allowed to update the FSM, and in the process write 3 pages that it
> didn't have to in the 4 page test. The master branch has the FSM
> already, so I would expect the 8 page case to regress more.
>

It is not clear to me why you think there should be regression at 8
pages when HEAP_FSM_CREATION_THRESHOLD is 4.  Basically, once FSM
starts getting updated, we should be same as HEAD as it won't take any
special path?

> What I can do later is provide a supplementary patch to go on top of
> mine that only checks the last block. If that improves performance,
> I'll alter my patch to only check every other page.
>

Sure, but I guess first we should try to see what is exactly slowing
down via perf report.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Mithun Cy

Date:

31 December 2018, 03:59:24

On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
> On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > fsm_local_set is being called from RecordAndGetPageWithFreeSpace and
> > GetPageWithFreeSpace whereas the change we have discussed was specific
> > to GetPageWithFreeSpace, so not sure if we need any change in
> > fsm_local_set.

I have some minor comments for pg_upgrade patch
1. Now we call stat main fork file in transfer_relfile()
+        sret = stat(old_file, &statbuf);

+        /* Save the size of the first segment of the main fork. */
+        if (type_suffix[0] == '\0' && segno == 0)
+            first_seg_size = statbuf.st_size;

But we do not handle the case if stat has returned any error!

2. src/bin/pg_upgrade/pg_upgrade.h

     char       *relname;
+
+    char        relkind;        /* relation relkind -- see pg_class.h */

I think we can remove the added empty line.

-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Mithun Cy

Date:

04 January 2019, 02:53:00

Thanks,

On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote:
> On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> > Results are execution time(unit ms) taken by copy statement when number of
> > records  equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4
> > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70
> > till tid (3, 157). Result is taken as a median of 10 runs.
>
> > So 2-3% consistent regression, And on every run I can see for patch v11
> > execution time is slightly more than base.
>
> Thanks for testing!
>
> > I also tried to insert more
> > records till 8 pages and same regression is observed! So I guess even
> > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
>
> That's curious, because once the table exceeds the threshold, it would
> be allowed to update the FSM, and in the process write 3 pages that it
> didn't have to in the 4 page test. The master branch has the FSM
> already, so I would expect the 8 page case to regress more.

I tested with configuration HEAP_FSM_CREATION_THRESHOLD = 4 and just
tried to insert till 8 blocks to see if regression is carried on with
further inserts.

> What I can do later is provide a supplementary patch to go on top of
> mine that only checks the last block. If that improves performance,
> I'll alter my patch to only check every other page.

Running callgrind for same test shows below stats
Before patch
==========
Number of calls                function_name
2000                                 heap_multi_insert
2000                                 RelationGetBufferForTuple
3500                                 ReadBufferBI

After Patch
=========
Number of calls                function_name
2000                                 heap_multi_insert
2000                                 RelationGetBufferForTuple
5000                                 ReadBufferBI

I guess Increase in ReadBufferBI() calls might be the reason which is
causing regression. Sorry I have not investigated it. I will check
same with your next patch!

-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From

Mithun Cy

Date:

04 January 2019, 02:57:54

On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote:
> On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> That's curious, because once the table exceeds the threshold, it would
> be allowed to update the FSM, and in the process write 3 pages that it
> didn't have to in the 4 page test. The master branch has the FSM
> already, so I would expect the 8 page case to regress more.

I tested with configuration HEAP_FSM_CREATION_THRESHOLD = 4 and just
tried to insert till 8 blocks to see if regression is carried on with
further inserts.

> What I can do later is provide a supplementary patch to go on top of
> mine that only checks the last block. If that improves performance,
> I'll alter my patch to only check every other page.

Running callgrind for same test shows below stats
Before patch
==========
Number of calls                function_name
2000                                 heap_multi_insert
2000                                 RelationGetBufferForTuple
3500                                 ReadBufferBI

After Patch
=========
Number of calls                function_name
2000                                 heap_multi_insert
2000                                 RelationGetBufferForTuple
5000                                 ReadBufferBI

I guess Increase in ReadBufferBI() calls might be the reason which is
causing regression. Sorry I have not investigated it. I will check
same with your next patch!

-- 
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Attachment

callgrind_report.zip

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

04 January 2019, 03:54:45

On Fri, Jan 4, 2019 at 8:23 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> Thanks,
>
> On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote:
> > On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> > > Results are execution time(unit ms) taken by copy statement when number of
> > > records  equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4
> > > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70
> > > till tid (3, 157). Result is taken as a median of 10 runs.
> >
> > > So 2-3% consistent regression, And on every run I can see for patch v11
> > > execution time is slightly more than base.
> >
> > Thanks for testing!
> >
> > > I also tried to insert more
> > > records till 8 pages and same regression is observed! So I guess even
> > > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
> >
> > That's curious, because once the table exceeds the threshold, it would
> > be allowed to update the FSM, and in the process write 3 pages that it
> > didn't have to in the 4 page test. The master branch has the FSM
> > already, so I would expect the 8 page case to regress more.
>
> I tested with configuration HEAP_FSM_CREATION_THRESHOLD = 4 and just
> tried to insert till 8 blocks to see if regression is carried on with
> further inserts.
>
> > What I can do later is provide a supplementary patch to go on top of
> > mine that only checks the last block. If that improves performance,
> > I'll alter my patch to only check every other page.
>
> Running callgrind for same test shows below stats
> Before patch
> ==========
> Number of calls                function_name
> 2000                                 heap_multi_insert
> 2000                                 RelationGetBufferForTuple
> 3500                                 ReadBufferBI
>
> After Patch
> =========
> Number of calls                function_name
> 2000                                 heap_multi_insert
> 2000                                 RelationGetBufferForTuple
> 5000                                 ReadBufferBI
>
> I guess Increase in ReadBufferBI() calls might be the reason which is
> causing regression. Sorry I have not investigated it.
>

I think the reason is that we are checking each block when blocks are
less than HEAP_FSM_CREATION_THRESHOLD.  Even though all the blocks are
in memory, there is some cost to check them all.  OTOH, without the
patch, even if it accesses FSM, it won't have to make so many
in-memory reads for blocks.

BTW, have you check for scale_factor 80 or 100 as suggested last time?

> I will check
> same with your next patch!
>

Yeah, that makes sense, John, can you provide a patch on top of the
current patch where we check either the last block or every other
block.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

07 January 2019, 20:56:59

On 1/3/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yeah, that makes sense, John, can you provide a patch on top of the
> current patch where we check either the last block or every other
> block.

I've attached two patches for testing. Each one applies on top of the
current patch.

Mithun, I'll respond to your other review comments later this week.

-John Naylor

On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Thanks, Mithun for performance testing, it really helps us to choose
> the right strategy here.  Once John provides next version, it would be
> good to see the results of regular pgbench (read-write) runs (say at
> 50 and 300 scale factor) and the results of large copy.  I don't think
> there will be any problem, but we should just double check that.

Attached is v12 using the alternating-page strategy. I've updated the
comments and README as needed. In addition, I've

-handled a possible stat() call failure during pg_upgrade
-added one more assertion
-moved the new README material into a separate paragraph
-added a comment to FSMClearLocalMap() about transaction abort
-corrected an outdated comment that erroneously referred to extension
rather than creation
-fleshed out the draft commit messages

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From

Mithun Cy

Date:

16 January 2019, 03:54:57

On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Thanks, Mithun for performance testing, it really helps us to choose
> > the right strategy here. Once John provides next version, it would be
> > good to see the results of regular pgbench (read-write) runs (say at
> > 50 and 300 scale factor) and the results of large copy. I don't think
> > there will be any problem, but we should just double check that.
>
> Attached is v12 using the alternating-page strategy. I've updated the
> comments and README as needed. In addition, I've

Below are my performance tests and numbers
Machine : cthulhu

Tests and setups
Server settings:
max_connections = 200
shared_buffers=8GB
checkpoint_timeout =15min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
min_wal_size=15GB and max_wal_size=20GB.

pgbench settings:
-----------------------
read-write settings (TPCB like tests)
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared postgres

scale factor 50 -- median of 3 TPS
clients v12-patch base patch % diff
1 826.081588 834.328238 -0.9884179421
16 10805.807081 10800.662805 0.0476292621
32 19722.277019 19641.546628 0.4110185034
64 30232.681889 30263.616073 -0.1022157561

scale factor 300 -- median of 3 TPS
clients v12-patch base patch % diff
1 813.646062 822.18648 -1.038744641
16 11379.028702 11277.05586 0.9042505709
32 21688.084093 21613.044463 0.3471960192
64 36288.85711 36348.6178 -0.1644098005

Copy command
Test: setup
./psql -d postgres -c "COPY pgbench_accounts TO '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out' WITH csv"
./psql -d postgres -c "CREATE UNLOGGED TABLE pgbench_accounts_ulg (LIKE pgbench_accounts) WITH (fillfactor = 100);"
Test run:
TRUNCATE TABLE pgbench_accounts_ulg;
\timing
COPY pgbench_accounts_ulg FROM '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out' WITH csv;
\timing

execution time in ms. (scale factor indicates size of pgbench_accounts)
scale factor v12-patch base patch % diff
300 77166.407 77862.041 -0.8934186557
50 13329.233 13284.583 0.3361038882

So for large table tests do not show any considerable performance variance from base code!

On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:

On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Thanks, Mithun for performance testing, it really helps us to choose
> the right strategy here. Once John provides next version, it would be
> good to see the results of regular pgbench (read-write) runs (say at
> 50 and 300 scale factor) and the results of large copy. I don't think
> there will be any problem, but we should just double check that.

Attached is v12 using the alternating-page strategy. I've updated the
comments and README as needed. In addition, I've

-handled a possible stat() call failure during pg_upgrade
-added one more assertion
-moved the new README material into a separate paragraph
-added a comment to FSMClearLocalMap() about transaction abort
-corrected an outdated comment that erroneously referred to extension
rather than creation
-fleshed out the draft commit messages

Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

16 January 2019, 13:41:14

On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Thanks, Mithun for performance testing, it really helps us to choose
> > the right strategy here.  Once John provides next version, it would be
> > good to see the results of regular pgbench (read-write) runs (say at
> > 50 and 300 scale factor) and the results of large copy.  I don't think
> > there will be any problem, but we should just double check that.
>
> Attached is v12 using the alternating-page strategy. I've updated the
> comments and README as needed. In addition, I've
>

Few comments:
---------------------------
1.
Commit message:
> Any pages with wasted free space become visible at next relation extension, so we still control table bloat.

I think the free space will be available after the next pass of
vacuum, no?  How can relation extension make it available?

2.
+2. For very small heap relations, the FSM would be relatively large and
+wasteful, so as of PostgreSQL 12 we refrain from creating the FSM for
+heaps with HEAP_FSM_CREATION_THRESHOLD pages or fewer, both to save space
+and to improve performance.  To locate free space in this case, we simply
+iterate over the heap, trying alternating pages in turn.  There may be some
+wasted free space in this case, but it becomes visible again upon next
+relation extension.

a. Again, how space becomes available at next relation extension.
b. I think there is no use of mentioning the version number in the
above comment, this code will be present from PG-12, so one can find
out from which version this optimization is added.

3.
BlockNumber
 RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
    Size oldSpaceAvail, Size spaceNeeded)
{
..
+ /* First try the local map, if it exists. */
+ if (oldPage < fsm_local_map.nblocks)
+ {
..
}

The comment doesn't appear to be completely in sync with the code.
Can't we just check whether "fsm_local_map.nblocks >  0", if so, we
can use a macro for the same? I have changed this in the attached
patch, see what you think about it.  I have used it at a few other
places as well.

4.
+ * When we initialize the map, the whole heap is potentially available to
+ * try.  If a caller wanted to reset the map after another backend extends
+ * the relation, this will only flag new blocks as available.  No callers
+ * do this currently, however.
+ */
+static void
+fsm_local_set(Relation rel, BlockNumber curr_nblocks)
{
..
+ if (blkno >= fsm_local_map.nblocks + 2)
..
}

The way you have tried to support the case as quoted in the comment
"If a caller wanted to reset the map after another backend extends .."
doesn't appear to be solid and I am not sure if it is correct either.
We don't have any way to test the same, so I suggest let's try to
simplify the case w.r.t current requirement of this API.  I think we
should
some simple logic to try every other block like:

+ blkno = cur_nblocks - 1;
+ while (true)
+ {
+ fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
+ if (blkno >= 2)
+ blkno -= 2;
+ else
+ break;
+ }

I have changed this in the attached patch.

5.
+/*
+ * Search the local map for an available block to try, in descending order.
+ *
+ * For use when there is no FSM.
+ */
+static BlockNumber
+fsm_local_search(void)

We should give a brief explanation as to why we try in descending
order.  I have added some explanation in the attached patch, see what
you think about it?

Apart from the above, I have modified a few comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v13-0001-Avoid-creation-of-the-free-space-map-for-small-heap-.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

16 January 2019, 13:45:05

On Wed, Jan 16, 2019 at 9:25 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>
> On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Thanks, Mithun for performance testing, it really helps us to choose
> > > the right strategy here.  Once John provides next version, it would be
> > > good to see the results of regular pgbench (read-write) runs (say at
> > > 50 and 300 scale factor) and the results of large copy.  I don't think
> > > there will be any problem, but we should just double check that.
> >
> > Attached is v12 using the alternating-page strategy. I've updated the
> > comments and README as needed. In addition, I've
>
>
> execution time in ms. (scale factor indicates size of pgbench_accounts)
> scale factor       v12-patch        base patch       % diff
> 300                   77166.407       77862.041     -0.8934186557
> 50                     13329.233      13284.583       0.3361038882
>
> So for large table tests do not show any considerable performance variance from base code!
>

I think with these results, we can conclude this patch doesn't seem to
have any noticeable regression for all the tests we have done, right?
Thanks a lot for doing various performance tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

16 January 2019, 16:40:01

On Wed, Jan 16, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> 1.
> Commit message:
> > Any pages with wasted free space become visible at next relation extension, so we still control table bloat.
>
> I think the free space will be available after the next pass of
> vacuum, no?  How can relation extension make it available?

To explain, this diagram shows the map as it looks for different small
table sizes:

0123
A
NA
ANA
NANA

So for a 3-block table, the alternating strategy never checks block 1.
Any free space block 1 has acquired via delete-and-vacuum will become
visible if it extends to 4 blocks. We are accepting a small amount of
bloat for improved performance, as discussed. Would it help to include
this diagram in the README?

> 2.
> +2. For very small heap relations, the FSM would be relatively large and
> +wasteful, so as of PostgreSQL 12 we refrain from creating the FSM for
> +heaps with HEAP_FSM_CREATION_THRESHOLD pages or fewer, both to save space
> +and to improve performance.  To locate free space in this case, we simply
> +iterate over the heap, trying alternating pages in turn.  There may be some
> +wasted free space in this case, but it becomes visible again upon next
> +relation extension.
>
> a. Again, how space becomes available at next relation extension.
> b. I think there is no use of mentioning the version number in the
> above comment, this code will be present from PG-12, so one can find
> out from which version this optimization is added.

It fits with the reference to PG 8.4 earlier in the document. I chose
to be consistent, but to be honest, I'm not much in favor of a lot of
version references in code/READMEs.

> 3.
> BlockNumber
>  RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
>     Size oldSpaceAvail, Size spaceNeeded)
> {
> ..
> + /* First try the local map, if it exists. */
> + if (oldPage < fsm_local_map.nblocks)
> + {
> ..
> }
>
> The comment doesn't appear to be completely in sync with the code.
> Can't we just check whether "fsm_local_map.nblocks >  0", if so, we
> can use a macro for the same? I have changed this in the attached
> patch, see what you think about it.  I have used it at a few other
> places as well.

The macro adds clarity, so I'm in favor of using it.

> 4.
> + * When we initialize the map, the whole heap is potentially available to
> + * try.  If a caller wanted to reset the map after another backend extends
> + * the relation, this will only flag new blocks as available.  No callers
> + * do this currently, however.
> + */
> +static void
> +fsm_local_set(Relation rel, BlockNumber curr_nblocks)
> {
> ..
> + if (blkno >= fsm_local_map.nblocks + 2)
> ..
> }
>
>
> The way you have tried to support the case as quoted in the comment
> "If a caller wanted to reset the map after another backend extends .."
> doesn't appear to be solid and I am not sure if it is correct either.

I removed this case in v9 and you objected to that as unnecessary, so
I reverted it for v10.

> We don't have any way to test the same, so I suggest let's try to
> simplify the case w.r.t current requirement of this API.  I think we
> should
> some simple logic to try every other block like:
>
> + blkno = cur_nblocks - 1;
> + while (true)
> + {
> + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
> + if (blkno >= 2)
> + blkno -= 2;
> + else
> + break;
> + }
>
> I have changed this in the attached patch.

Fine by me.

> 5.
> +/*
> + * Search the local map for an available block to try, in descending order.
> + *
> + * For use when there is no FSM.
> + */
> +static BlockNumber
> +fsm_local_search(void)
>
> We should give a brief explanation as to why we try in descending
> order.  I have added some explanation in the attached patch, see what
> you think about it?
>
> Apart from the above, I have modified a few comments.

I'll include these with some grammar corrections in the next version.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

16 January 2019, 17:14:49

On Wed, Jan 16, 2019 at 11:40 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
> On Wed, Jan 16, 2019 at 8:41 AM Amit Kapila
<amit.kapila16@gmail.com> wrote:
> > can use a macro for the same? I have changed this in the attached
> > patch, see what you think about it.  I have used it at a few other
> > places as well.
>
> The macro adds clarity, so I'm in favor of using it.

It just occured to me that the style FSM_LOCAL_MAP_EXISTS seems more
common for macros that refer to constants, and FSMLocalMapExists for
expressions, but I've only seen a small amount of the code base. Do we
have a style preference here, or is it more a matter of matching the
surrounding code?
</amit.kapila16@gmail.com></john.naylor@2ndquadrant.com>

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Tom Lane

Date:

16 January 2019, 17:55:01

John Naylor <john.naylor@2ndquadrant.com> writes:
> It just occured to me that the style FSM_LOCAL_MAP_EXISTS seems more
> common for macros that refer to constants, and FSMLocalMapExists for
> expressions, but I've only seen a small amount of the code base. Do we
> have a style preference here, or is it more a matter of matching the
> surrounding code?

I believe there's a pretty longstanding tradition in C coding to use
all-caps names for macros representing constants.  Some people think
that goes for all macros period, but I'm not on board with that for
function-like macros.

Different parts of the PG code base make different choices between
camel-case and underscore-separation for multiword function names.
For that, I'd say match the style of nearby code.

            regards, tom lane

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

17 January 2019, 03:08:00

On Wed, Jan 16, 2019 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> John Naylor <john.naylor@2ndquadrant.com> writes:
> > It just occured to me that the style FSM_LOCAL_MAP_EXISTS seems more
> > common for macros that refer to constants, and FSMLocalMapExists for
> > expressions, but I've only seen a small amount of the code base. Do we
> > have a style preference here, or is it more a matter of matching the
> > surrounding code?
>

I am fine with the style (FSMLocalMapExists) you are suggesting, but
see the similar macros in nearby code like:

#define FSM_TREE_DEPTH ((SlotsPerFSMPage >= 1626) ? 3 : 4)

I think the above is not an exact match.  So, I have looked around and
found few other macros which serve a somewhat similar purpose, see
below:

#define ATT_IS_PACKABLE(att) \
((att)->attlen == -1 && (att)->attstorage != 'p')

#define VARLENA_ATT_IS_PACKABLE(att) \
((att)->attstorage != 'p')

#define CHECK_REL_PROCEDURE(pname)

#define SPTEST(f, x, y) \
DatumGetBool(DirectFunctionCall2(f, PointPGetDatum(x), PointPGetDatum(y)))

> I believe there's a pretty longstanding tradition in C coding to use
> all-caps names for macros representing constants.  Some people think
> that goes for all macros period, but I'm not on board with that for
> function-like macros.
>
> Different parts of the PG code base make different choices between
> camel-case and underscore-separation for multiword function names.
> For that, I'd say match the style of nearby code.
>

Yes, that is what we normally do.  However, in some cases, we might
need to refer to other places as well which I think is the case here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

17 January 2019, 03:35:40

On Wed, Jan 16, 2019 at 10:10 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 16, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > 1.
> > Commit message:
> > > Any pages with wasted free space become visible at next relation extension, so we still control table bloat.
> >
> > I think the free space will be available after the next pass of
> > vacuum, no?  How can relation extension make it available?
>
> To explain, this diagram shows the map as it looks for different small
> table sizes:
>
> 0123
> A
> NA
> ANA
> NANA
>
> So for a 3-block table, the alternating strategy never checks block 1.
> Any free space block 1 has acquired via delete-and-vacuum will become
> visible if it extends to 4 blocks. We are accepting a small amount of
> bloat for improved performance, as discussed. Would it help to include
> this diagram in the README?
>

Yes, I think it would be good if you can explain the concept of
local-map with the help of this example.

> > 2.
> > +2. For very small heap relations, the FSM would be relatively large and
> > +wasteful, so as of PostgreSQL 12 we refrain from creating the FSM for
> > +heaps with HEAP_FSM_CREATION_THRESHOLD pages or fewer, both to save space
> > +and to improve performance.  To locate free space in this case, we simply
> > +iterate over the heap, trying alternating pages in turn.  There may be some
> > +wasted free space in this case, but it becomes visible again upon next
> > +relation extension.
> >
> > a. Again, how space becomes available at next relation extension.
> > b. I think there is no use of mentioning the version number in the
> > above comment, this code will be present from PG-12, so one can find
> > out from which version this optimization is added.
>
> It fits with the reference to PG 8.4 earlier in the document. I chose
> to be consistent, but to be honest, I'm not much in favor of a lot of
> version references in code/READMEs.
>

Then let's not add a reference to the version number in this case.  I
also don't see much advantage of adding version number at least in
this case.

>
> I'll include these with some grammar corrections in the next version.
>

Okay, thanks!

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

17 January 2019, 17:43:36

On Wed, Jan 16, 2019 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Yes, I think it would be good if you can explain the concept of
> local-map with the help of this example.

> Then let's not add a reference to the version number in this case.  I

Okay, done in v14. I kept your spelling of the new macro. One minor
detail added: use uint8 rather than char for the local map array. This
seems to be preferred, especially in this file.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

19 January 2019, 13:05:49

On Thu, Jan 17, 2019 at 11:13 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 16, 2019 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yes, I think it would be good if you can explain the concept of
> > local-map with the help of this example.
>
> > Then let's not add a reference to the version number in this case.  I
>
> Okay, done in v14. I kept your spelling of the new macro. One minor
> detail added: use uint8 rather than char for the local map array. This
> seems to be preferred, especially in this file.
>

I am fine with your change.

Few more comments:
1.
I think we should not allow to create FSM for toast tables as well
till there size reaches HEAP_FSM_CREATION_THRESHOLD.  If you try below
test, you can see that FSM will be created for the toast table even if
the size of toast relation is 1 page.

CREATE OR REPLACE FUNCTION random_text(length INTEGER)
RETURNS TEXT
LANGUAGE SQL
 AS $$ select string_agg(chr
(32+(random()*96)::int), '') from generate_series(1,length); $$;

create table tt(c1 int, c2 text);
insert into tt values(1, random_text(2500));
Vacuum tt;

I have fixed this in the attached patch, kindly verify it once and see
if you can add the test for same as well.

2.
-CREATE TABLE test1 (a int, b int);
-INSERT INTO test1 VALUES (16777217, 131584);
+CREATE TABLE test_rel_forks (a
int);
+-- Make sure there are enough blocks in the heap for the FSM to be created.
+INSERT INTO test_rel_forks SELECT g
from generate_series(1,10000) g;

-VACUUM test1;  -- set up FSM
+-- set up FSM and VM
+VACUUM test_rel_forks;

This test will create 45 pages instead of 1.  I know that to create
FSM, we now need more than 4 pages, but 45 seems to be on the higher
side.  I think we should not unnecessarily populate more data if there
is no particular need for it, let's restrict the number of pages to 5
if possible.

3.
-SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1;
- fsm_1
--------
-  8192
-(1 row)
-
-SELECT octet_length
(get_raw_page('test1', 'vm', 0)) AS vm_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
+ERROR:  block number 10 is out of range for relation "test_rel_forks"

Why have you changed the test definition here?  Previously test checks
the existing FSM page, but now it tries to access out of range page.

Apart from the above, I have changed one sentence in README.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v15-0001-Avoid-creation-of-the-free-space-map-for-small-heap-.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

19 January 2019, 23:49:08

On Sat, Jan 19, 2019 at 8:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 17, 2019 at 11:13 PM John Naylor
> Few more comments:
> 1.
> I think we should not allow to create FSM for toast tables as well
> till there size reaches HEAP_FSM_CREATION_THRESHOLD.  If you try below
> test, you can see that FSM will be created for the toast table even if
> the size of toast relation is 1 page.
...
> I have fixed this in the attached patch, kindly verify it once and see
> if you can add the test for same as well.

Works for me. For v16, I've added and tested similar logic to
pg_upgrade and verified that toast tables work the same as normal
tables in recovery. I used a slightly different method to generate the
long random string to avoid creating a function. Also, some cosmetic
adjustments -- I changed the regression test to use 'i' instead of 'g'
to match the use of generate_series in most other tests, and made
capitalization more consistent.

> 2.
> -CREATE TABLE test1 (a int, b int);
> -INSERT INTO test1 VALUES (16777217, 131584);
> +CREATE TABLE test_rel_forks (a
> int);
> +-- Make sure there are enough blocks in the heap for the FSM to be created.
> +INSERT INTO test_rel_forks SELECT g
> from generate_series(1,10000) g;
>
> -VACUUM test1;  -- set up FSM
> +-- set up FSM and VM
> +VACUUM test_rel_forks;
>
> This test will create 45 pages instead of 1.  I know that to create
> FSM, we now need more than 4 pages, but 45 seems to be on the higher
> side.  I think we should not unnecessarily populate more data if there
> is no particular need for it, let's restrict the number of pages to 5
> if possible.

Good idea, done here and in the fsm regression test.

> 3.
> -SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1;
> - fsm_1
> --------
> -  8192
> -(1 row)
> -
> -SELECT octet_length
> (get_raw_page('test1', 'vm', 0)) AS vm_0;
> +SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> +ERROR:  block number 10 is out of range for relation "test_rel_forks"
>
> Why have you changed the test definition here?  Previously test checks
> the existing FSM page, but now it tries to access out of range page.

The patch is hard to read here, but I still have a test for the
existing FSM page:

-SELECT octet_length(get_raw_page('test1', 'fsm', 0)) AS fsm_0;
+SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
+ERROR:  block number 100 is out of range for relation "test_rel_forks"
+SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
  fsm_0
 -------
   8192
 (1 row)

I have a test for in-range and out-of-range for each relation fork.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Sun, Jan 20, 2019 at 5:19 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> I have a test for in-range and out-of-range for each relation fork.
>

I think the first two patches (a) removal of dead code in bootstrap
and (b) the core patch to avoid creation of FSM file for the small
table are good now.  I have prepared the patches along with commit
message.  There is no change except for some changes in README and
commit message of the second patch.  Kindly let me know what you think
about them?

I think these two patches can go even without the upgrade patch
(during pg_upgrade, conditionally skip transfer of FSMs.) which is
still under discussion.  However, I am not in a hurry if you or other
thinks that upgrade patch must be committed along with the second
patch.  I think the upgrade patch is generally going on track but
might need some more review.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Wed, Jan 23, 2019 at 9:18 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 23, 2019 at 7:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think the first two patches (a) removal of dead code in bootstrap
> > and (b) the core patch to avoid creation of FSM file for the small
> > table are good now.  I have prepared the patches along with commit
> > message.  There is no change except for some changes in README and
> > commit message of the second patch.  Kindly let me know what you think
> > about them?
>
> Good to hear! The additional language is fine. In "Once the FSM is
> created for heap", I would just change that to "...for a heap".
>

Sure, apart from this I have run pgindent on the patches and make some
changes accordingly.  Latest patches attached (only second patch has
some changes).  I will take one more pass on Monday morning (28th Jan)
and will commit unless you or others see any problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Thu, Jan 24, 2019 at 5:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 1.
> + if ((maps[mapnum].relkind != RELKIND_RELATION &&
> + maps[mapnum].relkind != RELKIND_TOASTVALUE) ||
> + first_seg_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ ||
> + GET_MAJOR_VERSION(new_cluster.major_version) <= 1100)
> + (void) transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit);
>
> I think this check will needlessly be performed for future versions as
> well, say when wants to upgrade from PG12 to PG13.  That might not
> create any problem, but let's try to be more precise.  Can you try to
> rewrite this check?  You might want to encapsulate it inside a
> function.  I have thought of doing something similar to what we do for
> vm, see checks relate to VISIBILITY_MAP_FROZEN_BIT_CAT_VER, but I
> guess for this patch it is not important to check catalog version as
> even if someone tries to upgrade to the same version.

Agreed, done for v19 (I've only attached the pg_upgrade patch).

> 2.
> transfer_relfile()
> {
> ..
> - /* Is it an extent, fsm, or vm file? */
> - if (type_suffix[0] != '\0' || segno != 0)
> + /* Did file open fail? */
> + if (stat(old_file, &statbuf) != 0)
> ..
> }
>
> So from now onwards, we will call stat for even 0th segment which
> means there is one additional system call for each relation, not sure
> if that matters, but I think there is no harm in once testing with a
> large number of relations say 10K to 50K relations which have FSM.

Performance testing is probably a good idea anyway, but I went ahead
and implemented your next idea:

> The other alternative is we can fetch pg_class.relpages and rely on
> that to take this decision, but again if that is not updated, we might
> take the wrong decision.

We can think of it this way: Which is worse,
1. Transferring a FSM we don't need, or
2. Skipping a FSM we need

I'd say #2 is worse. So, in v19 we check pg_class.relpages and if it's
a heap and less than or equal the threshold we call stat on the 0th
segment to verify. In the common case, the cost of the stat call is
offset by not linking the FSM. Despite needing another pg_class field,
I think this code is actually easier to read than my earlier versions.

> 3.
> -static void
> +static Size
>  transfer_relfile(FileNameMap *map, const char *type_suffix, bool
> vm_must_add_frozenbit)
>
> If we decide to go with the approach proposed by you, we should add
> some comments atop this function for return value change?

Done, as well as other comment edits.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v19-0003-During-pg_upgrade-conditionally-skip-transfer-of.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

25 January 2019, 23:38:50

On Thu, Jan 24, 2019 at 9:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Jan 25, 2019 at 1:03 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Wed, Jan 23, 2019 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I think what doc means to say is
> > > that it copies any unlinked files present in primary's new cluster
> > > (which in your case will be data2).
> >
> > In that case, I'm still confused why that doc says, "Unfortunately,
> > rsync needlessly copies files associated with temporary and unlogged
> > tables because these files don't normally exist on standby servers."
> > I fail to see why the primary's new cluster would have these if they
> > weren't linked.
> >
>
> Why unlogged files won't be in primary's new cluster?  After the
> upgrade, they should be present in a new cluster if they were present
> in the old cluster.

I assume they would be linked, however (I haven't checked this). I did
think rewritten VM files would fall under this, but I was confused
about unlogged files.

> > And in the case we're discussing here, the skipped
> > FSMs won't be on data2, so won't end up in standby/data2.
> >
>
> Right.  I think we are safe with respect to rsync because I have seen
> that we do rewrite the vm files in link mode and rsync will copy them
> from primary's new cluster.

Okay.

> I think you can try to address my other comments on your pg_upgrade
> patch.   Once we agree on the code, we need to test below scenarios:
> (a) upgrade from all supported versions to the latest version
> (b) upgrade standby with and without using rsync.

Sounds good.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

26 January 2019, 13:14:30

On Sat, Jan 26, 2019 at 5:05 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 24, 2019 at 5:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> Performance testing is probably a good idea anyway, but I went ahead
> and implemented your next idea:
>
> > The other alternative is we can fetch pg_class.relpages and rely on
> > that to take this decision, but again if that is not updated, we might
> > take the wrong decision.
>
> We can think of it this way: Which is worse,
> 1. Transferring a FSM we don't need, or
> 2. Skipping a FSM we need
>
> I'd say #2 is worse.
>

Agreed.

> So, in v19 we check pg_class.relpages and if it's
> a heap and less than or equal the threshold we call stat on the 0th
> segment to verify.
>

Okay, but the way logic is implemented appears clumsy to me.

@@ -234,16 +243,40 @@ transfer_relfile(FileNameMap *map, const char
*type_suffix, bool vm_must_add_fro
  {
  /* File does not exist?  That's OK, just return */
  if (errno == ENOENT)
- return;
+ return first_seg_size;
  else
- pg_fatal("error while checking for file existence \"%s.%s\" (\"%s\"
to \"%s\"): %s\n",
- map->nspname, map->relname, old_file, new_file,
- strerror(errno));
+ goto fatal;
  }

  /* If file is empty, just return */
  if (statbuf.st_size == 0)
- return;
+ return first_seg_size;
+ }
+
+ /* Save size of the first segment of the main fork. */
+
+ else if (map->relpages <= HEAP_FSM_CREATION_THRESHOLD &&
+ (map->relkind == RELKIND_RELATION ||
+   map->relkind == RELKIND_TOASTVALUE))
+ {
+ /*
+ * In this case, if pg_class.relpages is wrong, it's possible
+ * that a FSM will be skipped when we actually need it.  To guard
+ * against this, we verify the size of the first segment.
+ */
+ if (stat(old_file, &statbuf) != 0)
+ goto fatal;
+ else
+ first_seg_size = statbuf.st_size;
+ }
+ else
+ {
+ /*
+ * For indexes etc., we don't care if pg_class.relpages is wrong,
+ * since we always transfer their FSMs.  For heaps, we might
+ * transfer a FSM when we don't need to, but this is harmless.
+ */
+ first_seg_size = Min(map->relpages, RELSEG_SIZE) * BLCKSZ;
  }

The function transfer_relfile has no clue about skipping of FSM stuff,
but it contains comments about it.  The check "if (map->relpages <=
HEAP_FSM_CREATION_THRESHOLD ..." will needlessly be executed for each
segment.  I think there is some value in using the information from
this function to skip fsm files, but the code doesn't appear to fit
well, how about moving this check to new function
new_cluster_needs_fsm()?


> In the common case, the cost of the stat call is
> offset by not linking the FSM.
>

Agreed.

> Despite needing another pg_class field,
> I think this code is actually easier to read than my earlier versions.
>

Yeah, the code appears cleaner from the last version, but I think we
can do more in that regards.

One more minor comment:
snprintf(query + strlen(query), sizeof(query) - strlen(query),
  "SELECT all_rels.*, n.nspname, c.relname, "
- "  c.relfilenode, c.reltablespace, %s "
+ "  c.relfilenode, c.reltablespace, c.relpages, c.relkind, %s "
  "FROM (SELECT * FROM regular_heap "
  "      UNION ALL "
  "      SELECT * FROM toast_heap "
@@ -525,6 +530,8 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo)
  i_relname = PQfnumber(res, "relname");
  i_relfilenode = PQfnumber(res, "relfilenode");
  i_reltablespace = PQfnumber(res, "reltablespace");
+ i_relpages = PQfnumber(res, "relpages");
+ i_relkind = PQfnumber(res, "relkind");
  i_spclocation = PQfnumber(res, "spclocation");

The order in which relkind and relpages is used in the above code is
different from the order in which it is mentioned in the query, it
won't matter, but keeping in order will make look code consistent.  I
have made this and some more minor code adjustments in the attached
patch.  If you like those, you can include them in the next version of
your patch.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v20-0003-During-pg_upgrade-conditionally-skip-transfer-of-FSM.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

27 January 2019, 21:03:22

On Sat, Jan 26, 2019 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jan 26, 2019 at 5:05 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > So, in v19 we check pg_class.relpages and if it's
> > a heap and less than or equal the threshold we call stat on the 0th
> > segment to verify.
> >
>
> Okay, but the way logic is implemented appears clumsy to me.

> The function transfer_relfile has no clue about skipping of FSM stuff,
> but it contains comments about it.

Yeah, I wasn't entirely happy with how that turned out.

> I think there is some value in using the information from
> this function to skip fsm files, but the code doesn't appear to fit
> well, how about moving this check to new function
> new_cluster_needs_fsm()?

For v21, new_cluster_needs_fsm() has all responsibility for obtaining
the info it needs. I think this is much cleaner, but there is a small
bit of code duplication since it now has to form the file name. One
thing we could do is form the the base old/new file names in
transfer_single_new_db() and pass those to transfer_relfile(), which
will only add suffixes and segment numbers. We could then pass the
base old file name to new_cluster_needs_fsm() and use it as is. Not
sure if that's worthwhile, though.

> The order in which relkind and relpages is used in the above code is
> different from the order in which it is mentioned in the query, it
> won't matter, but keeping in order will make look code consistent.  I
> have made this and some more minor code adjustments in the attached
> patch.  If you like those, you can include them in the next version of
> your patch.

Okay, done.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v21-0003-During-pg_upgrade-conditionally-skip-transfer-of.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

28 January 2019, 02:53:06

On Thu, Jan 24, 2019 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Sure, apart from this I have run pgindent on the patches and make some
> changes accordingly.  Latest patches attached (only second patch has
> some changes).  I will take one more pass on Monday morning (28th Jan)
> and will commit unless you or others see any problem.
>

Pushed these two patches.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

28 January 2019, 03:46:33

On Mon, Jan 28, 2019 at 3:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 24, 2019 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Sure, apart from this I have run pgindent on the patches and make some
> > changes accordingly.  Latest patches attached (only second patch has
> > some changes).  I will take one more pass on Monday morning (28th Jan)
> > and will commit unless you or others see any problem.
>
> Pushed these two patches.

Thank you for your input and detailed review! Thank you Mithun for testing!

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

28 January 2019, 03:53:17

On Mon, Jan 28, 2019 at 9:16 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Jan 28, 2019 at 3:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Jan 24, 2019 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > Sure, apart from this I have run pgindent on the patches and make some
> > > changes accordingly.  Latest patches attached (only second patch has
> > > some changes).  I will take one more pass on Monday morning (28th Jan)
> > > and will commit unless you or others see any problem.
> >
> > Pushed these two patches.
>
> Thank you for your input and detailed review! Thank you Mithun for testing!
>

There are a few buildfarm failures due to this commit, see my email on
pgsql-committers.  If you have time, you can also once look into
those.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

28 January 2019, 04:33:12

On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> There are a few buildfarm failures due to this commit, see my email on
> pgsql-committers.  If you have time, you can also once look into
> those.

I didn't see anything in common with the configs of the failed
members. None have a non-default BLCKSZ that I can see.

Looking at this typical example from woodlouse:

================== pgsql.build/src/test/regress/regression.diffs
==================
--- C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/expected/fsm.out
2019-01-28 04:43:09.031456700 +0100
+++ C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/results/fsm.out
2019-01-28 05:06:20.351100400 +0100
@@ -26,7 +26,7 @@
 pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
  heap_size | fsm_size
 -----------+----------
-     24576 |        0
+     32768 |        0
 (1 row)

***It seems like the relation extended when the new records should
have gone into block 0.

 -- Extend table with enough blocks to exceed the FSM threshold
@@ -56,7 +56,7 @@
 SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
  fsm_size
 ----------
-    16384
+    24576
 (1 row)

***And here it seems vacuum didn't truncate the FSM. I wonder if the
heap didn't get truncated either.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

28 January 2019, 04:53:26

On Mon, Jan 28, 2019 at 10:03 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > There are a few buildfarm failures due to this commit, see my email on
> > pgsql-committers.  If you have time, you can also once look into
> > those.
>
> I didn't see anything in common with the configs of the failed
> members. None have a non-default BLCKSZ that I can see.
>
> Looking at this typical example from woodlouse:
>
> ================== pgsql.build/src/test/regress/regression.diffs
> ==================
> --- C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/expected/fsm.out
> 2019-01-28 04:43:09.031456700 +0100
> +++ C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/results/fsm.out
> 2019-01-28 05:06:20.351100400 +0100
> @@ -26,7 +26,7 @@
>  pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
>   heap_size | fsm_size
>  -----------+----------
> -     24576 |        0
> +     32768 |        0
>  (1 row)
>
> ***It seems like the relation extended when the new records should
> have gone into block 0.
>
>  -- Extend table with enough blocks to exceed the FSM threshold
> @@ -56,7 +56,7 @@
>  SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
>   fsm_size
>  ----------
> -    16384
> +    24576
>  (1 row)
>
> ***And here it seems vacuum didn't truncate the FSM. I wonder if the
> heap didn't get truncated either.
>

Yeah, it seems to me that vacuum is not able to truncate the relation,
see my latest reply on another thread [1].

[1] - https://www.postgresql.org/message-id/CAA4eK1JntHd7X6dLJVPGYV917HejjhbMKXn9m_RnnCE162LbLA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

28 January 2019, 11:10:13

On Mon, Jan 28, 2019 at 10:03 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > There are a few buildfarm failures due to this commit, see my email on
> > pgsql-committers.  If you have time, you can also once look into
> > those.
>
> I didn't see anything in common with the configs of the failed
> members. None have a non-default BLCKSZ that I can see.
>

I have done an analysis of the different failures on buildfarm.

1.
@@ -26,7 +26,7 @@
 pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
  heap_size | fsm_size
 -----------+----------
-     24576 |        0
+     32768 |        0
 (1 row)

 -- Extend table with enough blocks to exceed the FSM threshold
@@ -56,7 +56,7 @@
 SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
  fsm_size
 ----------
-    16384
+    24576
 (1 row)

As discussed on another thread, this seems to be due to the reason
that a parallel auto-analyze doesn't allow vacuum to remove dead-row
versions.  To fix this, I think we should avoid having a dependency on
vacuum to remove dead rows.

2.
@@ -15,13 +15,9 @@
 SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
 ERROR:  block number 100 is out of range for relation "test_rel_forks"
 SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
- fsm_0
--------
-  8192
-(1 row)
-
+ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
 SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
-ERROR:  block number 10 is out of range for relation "test_rel_forks"
+ERROR:  could not open file "base/50769/50798_fsm": No such file or directory

This indicates that even though the Vacuum is executed, but the FSM
doesn't get created.  This could be due to different BLCKSZ, but the
failed machines don't seem to have a non-default value of it.  I am
not sure why this could happen, maybe we need to check once in the
failed regression database to see the size of relation?

3.  Failure on 'mantid'
2019-01-28 00:13:55.191 EST [123979] 001_pgbench_with_server.pl LOG:
statement: CREATE UNLOGGED TABLE insert_tbl (id
serial primary key);
2019-01-28 00:13:55.218 EST [123982] 001_pgbench_with_server.pl LOG:
execute P0_0: INSERT INTO insert_tbl SELECT
FROM generate_series(1,1000);
2019-01-28 00:13:55.219 EST [123983] 001_pgbench_with_server.pl LOG:
execute P0_0: INSERT INTO insert_tbl SELECT
FROM generate_series(1,1000);
2019-01-28 00:13:55.220 EST [123984] 001_pgbench_with_server.pl LOG:
execute P0_0: INSERT INTO insert_tbl SELECT
FROM generate_series(1,1000);
..
..
TRAP: FailedAssertion("!((rel->rd_rel->relkind == 'r' ||
rel->rd_rel->relkind == 't') && fsm_local_map.map[oldPage] == 0x01)",
File: "freespace.c", Line: 223)

I think this can happen if we forget to clear the local map after we
get the block with space in function RelationGetBufferForTuple().  I
see the race condition in the code where that can happen.  Say, we
tried all the blocks in the local map and then tried to extend the
relation and we didn't get ConditionalLockRelationForExtension, in the
meantime, another backend has extended the relation and updated the
FSM (via RelationAddExtraBlocks).  Now, when the backend that didn't
get the extension lock will get the target block from FSM which will
be greater than HEAP_FSM_CREATION_THRESHOLD.  Next, it will find that
the block can be used to insert a new row and return the buffer, but
won't clear the local map due to below condition in code:

@@ -377,20 +383,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
+
+ /*
+ * In case we used an in-memory map of available blocks, reset it
+ * for next use.
+ */
+ if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
+ FSMClearLocalMap();
+

I think here you need to clear the map if it exists or clear it
unconditionally, the earlier one would be better.

This test gets executed concurrently by 5 clients, so it can hit the
above race condition.

4.  Failure on jacana:
--- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/box.out
2018-09-26
17:53:33 -0400
+++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/results/box.out
2019-01-27 23:14:35
-0500
@@ -252,332 +252,7 @@
     ('(0,100)(0,infinity)'),
     ('(-infinity,0)(0,infinity)'),
     ('(-infinity,-infinity)(infinity,infinity)');
-SET enable_seqscan = false;
-SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)';
..
..
TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File:
"c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/freespace/freespace.c",
Line:
1118)
..
2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:4] LOG:  server process
(PID 14388) exited with exit code 3
2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:5] DETAIL:  Failed process
was running: INSERT INTO box_temp
VALUES (NULL),

I think the reason for this failure is same as previous (as mentioned
in point-3), but this can happen in a different way.  Say, we have
searched the local map and then try to extend a relation 'X' and in
the meantime, another backend has extended such that it creates FSM.
Now, we will reuse that page and won't clear local map.  Now, say we
try to insert in relation 'Y' which doesn't have FSM.  It will try to
set the local map and will find that it already exists, so will fail.
Now, the question is how it can happen in this box.sql test.  I guess
that is happening for some system table which is being populated by
Create Index statement executed just before the failing Insert.

I think both 3 and 4 are timing issues, so we didn't got in our local
regression runs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

28 January 2019, 11:57:16

On Mon, Jan 28, 2019 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 28, 2019 at 10:03 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> > On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > There are a few buildfarm failures due to this commit, see my email on
> > > pgsql-committers. If you have time, you can also once look into
> > > those.
> >
> > I didn't see anything in common with the configs of the failed
> > members. None have a non-default BLCKSZ that I can see.
> >
>
> I have done an analysis of the different failures on buildfarm.
>
>
> 2.
> @@ -15,13 +15,9 @@
> SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
> ERROR: block number 100 is out of range for relation "test_rel_forks"
> SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> - fsm_0
> --------
> - 8192
> -(1 row)
> -
> +ERROR: could not open file "base/50769/50798_fsm": No such file or directory
> SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> -ERROR: block number 10 is out of range for relation "test_rel_forks"
> +ERROR: could not open file "base/50769/50798_fsm": No such file or directory
>
> This indicates that even though the Vacuum is executed, but the FSM
> doesn't get created. This could be due to different BLCKSZ, but the
> failed machines don't seem to have a non-default value of it. I am
> not sure why this could happen, maybe we need to check once in the
> failed regression database to see the size of relation?
>

This symptom is shown in the below buildfarm critters:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-01-28%2005%3A05%3A22
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-01-28%2003%3A20%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2019-01-28%2003%3A13%3A47
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-01-28%2003%3A07%3A39

All of these seems to run with fsync=off. Is it possible that vacuum has updated FSM, but the same is not synced to disk and when we try to read it, we didn't get the required page? This is just a guess.

I have checked all the buildfarm failures and I see only 4 symptoms for which I have sent some initial analysis. I think you can also once cross-verify the same.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Andrew Gierth

Date:

28 January 2019, 13:09:42

>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes:

 Amit> All of these seems to run with fsync=off. Is it possible that
 Amit> vacuum has updated FSM, but the same is not synced to disk and
 Amit> when we try to read it, we didn't get the required page?

No.

fsync never affects what programs see while the system is running, only
what happens after an OS crash.

-- 
Andrew (irc:RhodiumToad)

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

28 January 2019, 19:06:49

On Mon, Jan 28, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 28, 2019 at 10:03 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> 1.
> @@ -26,7 +26,7 @@
>  pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
>   heap_size | fsm_size
>  -----------+----------
> -     24576 |        0
> +     32768 |        0
>  (1 row)
>
>  -- Extend table with enough blocks to exceed the FSM threshold
> @@ -56,7 +56,7 @@
>  SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size;
>   fsm_size
>  ----------
> -    16384
> +    24576
>  (1 row)
>
>
> As discussed on another thread, this seems to be due to the reason
> that a parallel auto-analyze doesn't allow vacuum to remove dead-row
> versions.  To fix this, I think we should avoid having a dependency on
> vacuum to remove dead rows.

Ok, to make the first test here more reliable I will try Andrew's idea
to use fillfactor to save free space. As I said earlier, I think that
second test isn't helpful and can be dropped.

> 2.
> @@ -15,13 +15,9 @@
>  SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
>  ERROR:  block number 100 is out of range for relation "test_rel_forks"
>  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> - fsm_0
> --------
> -  8192
> -(1 row)
> -
> +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
>  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> -ERROR:  block number 10 is out of range for relation "test_rel_forks"
> +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
>
> This indicates that even though the Vacuum is executed, but the FSM
> doesn't get created.  This could be due to different BLCKSZ, but the
> failed machines don't seem to have a non-default value of it.  I am
> not sure why this could happen, maybe we need to check once in the
> failed regression database to see the size of relation?

I'm also having a hard time imagining why this failed. Just in case,
we could return ctid in a plpgsql loop and stop as soon as we see the
5th block. I've done that for some tests during development and is a
safer method anyway.

> <timing failures in 3 and 4>
>
> @@ -377,20 +383,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
> +
> + /*
> + * In case we used an in-memory map of available blocks, reset it
> + * for next use.
> + */
> + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD)
> + FSMClearLocalMap();
> +
>
> I think here you need to clear the map if it exists or clear it
> unconditionally, the earlier one would be better.

Ok, maybe all callers should call it unconditonally, but within the
function, check "if (FSM_LOCAL_MAP_EXISTS)"?

Thanks for investigating the failures -- I'm a bit pressed for time this week.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

29 January 2019, 00:29:02

On Tue, Jan 29, 2019 at 12:37 AM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Jan 28, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > 2.
> > @@ -15,13 +15,9 @@
> >  SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
> >  ERROR:  block number 100 is out of range for relation "test_rel_forks"
> >  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> > - fsm_0
> > --------
> > -  8192
> > -(1 row)
> > -
> > +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
> >  SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> > -ERROR:  block number 10 is out of range for relation "test_rel_forks"
> > +ERROR:  could not open file "base/50769/50798_fsm": No such file or directory
> >
> > This indicates that even though the Vacuum is executed, but the FSM
> > doesn't get created.  This could be due to different BLCKSZ, but the
> > failed machines don't seem to have a non-default value of it.  I am
> > not sure why this could happen, maybe we need to check once in the
> > failed regression database to see the size of relation?
>
> I'm also having a hard time imagining why this failed. Just in case,
> we could return ctid in a plpgsql loop and stop as soon as we see the
> 5th block. I've done that for some tests during development and is a
> safer method anyway.
>

I think we can devise some concrete way, but it is better first we try
to understand why it failed, otherwise there is always a chance that
we will repeat the mistake in some other case.  I think we have no
other choice, but to request the buildfarm owners to either give us
the access to see what happens or help us in investigating the
problem. The four buildfarms where it failed were lapwing, locust,
dromedary, prairiedog.   Among these, the owner of last two is Tom
Lane and others I don't recognize.  Tom, Andrew, can you help us in
getting the access of one of those four?  Yet another alternative is
the owner can apply the patch attached (this is same what got
committed) or reset to commit ac88d2962a and execute below statements
and share the results:

CREATE EXTENSION pageinspect;

CREATE TABLE test_rel_forks (a int);
INSERT INTO test_rel_forks SELECT i from generate_series(1,1000) i;
VACUUM test_rel_forks;
SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0;
SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;

SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;

SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0;
SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 1)) AS vm_1;

If the above statements give error: "ERROR:  could not open file ...", then run:
Analyze test_rel_forks;
Select oid, relname, relpages, reltuples from pg_class where relname
like 'test%';

The result of the above tests will tell us whether there are 5 pages
in the table or not.  If the table contains 5 pages and throws an
error, then there is some bug in our code, otherwise, there is
something specific to those systems where the above insert doesn't
result in 5 pages.

> > I think here you need to clear the map if it exists or clear it
> > unconditionally, the earlier one would be better.
>
> Ok, maybe all callers should call it unconditonally, but within the
> function, check "if (FSM_LOCAL_MAP_EXISTS)"?
>

Sounds sensible.  I think we should try to reproduce these failures,
for ex. for pgbench failure, we can try the same test with more
clients.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v18-0002-Avoid-creation-of-the-free-space-map-for-small-heap-.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

29 January 2019, 10:55:52

On Tue, Jan 29, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 12:37 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> > > I think here you need to clear the map if it exists or clear it
> > > unconditionally, the earlier one would be better.
> >
> > Ok, maybe all callers should call it unconditonally, but within the
> > function, check "if (FSM_LOCAL_MAP_EXISTS)"?
> >
>
> Sounds sensible.  I think we should try to reproduce these failures,
> for ex. for pgbench failure, we can try the same test with more
> clients.
>

I am able to reproduce this by changing pgbench test as below:

--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -56,9 +56,9 @@ $node->safe_psql('postgres',
  'CREATE UNLOGGED TABLE insert_tbl (id serial primary key); ');

 pgbench(
- '--no-vacuum --client=5 --protocol=prepared --transactions=25',
+ '--no-vacuum --client=10 --protocol=prepared --transactions=25',
  0,
- [qr{processed: 125/125}],
+ [qr{processed: 250/250}],

You can find this change in attached patch.  Then, I ran the make
check in src/bin/pgbench multiple times using test_conc_insert.sh.
You can vary the number of times the test should run, if you are not
able to reproduce it with this.

The attached patch (clear_local_map_if_exists_1.patch) atop the main
patch fixes the issue for me.  Kindly verify the same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > You can find this change in attached patch.  Then, I ran the make
> > check in src/bin/pgbench multiple times using test_conc_insert.sh.
> > You can vary the number of times the test should run, if you are not
> > able to reproduce it with this.
> >
> > The attached patch (clear_local_map_if_exists_1.patch) atop the main
> > patch fixes the issue for me.  Kindly verify the same.
>
> I got one failure in 50 runs. With the new patch, I didn't get any
> failures in 300 runs.
>

Thanks for verification.  I have included it in the attached patch and
I have also modified the page.sql test to have enough number of pages
in relation so that FSM will get created irrespective of alignment
boundaries.  Masahiko San, can you verify if this now works for you?

There are two more failures which we need to something about.
1. Make fsm.sql independent of vacuum without much losing on coverage
of newly added code.  John, I guess you have an idea, see if you can
take care of it, otherwise, I will see what I can do for it.
2. I still could not figure out how to verify if the failure on Jacana
will be fixed.  I have posted some theory above and the attached patch
has a solution for it, but I think it would be better if find out some
way to verify the same.

Note - you might see some cosmetic changes in freespace.c due to pgindent.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v19-0001-Avoid-creation-of-the-free-space-map-for-small-heap-.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

30 January 2019, 09:56:32

On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> There are two more failures which we need to something about.
> 1. Make fsm.sql independent of vacuum without much losing on coverage
> of newly added code.  John, I guess you have an idea, see if you can
> take care of it, otherwise, I will see what I can do for it.

I've attached a patch that applies on top of v19 that uses Andrew
Gierth's idea to use fillfactor to control free space. I've also
removed tests that relied on truncation and weren't very useful to
begin with.

--
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

remove-reliance-on-vacuum-for-fsm-regression-test.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

30 January 2019, 13:11:01

On Wed, Jan 30, 2019 at 3:26 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > There are two more failures which we need to something about.
> > 1. Make fsm.sql independent of vacuum without much losing on coverage
> > of newly added code.  John, I guess you have an idea, see if you can
> > take care of it, otherwise, I will see what I can do for it.
>
> I've attached a patch that applies on top of v19 that uses Andrew
> Gierth's idea to use fillfactor to control free space. I've also
> removed tests that relied on truncation and weren't very useful to
> begin with.
>

This is much better than the earlier version of test and there is no
dependency on the vacuum.  However, I feel still there is some
dependency on how the rows will fit in a page and we have seen some
related failures due to alignment stuff.  By looking at the test, I
can't envision any such problem, but how about if we just write some
simple tests where we can check that the FSM won't be created for very
small number of records say one or two and then when we increase the
records FSM gets created, here if we want, we can even use vacuum to
ensure FSM gets created.  Once we are sure that the main patch passes
all the buildfarm tests, we can extend the test to something advanced
as you are proposing now.  I think that will reduce the chances of
failure, what do you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

30 January 2019, 14:41:46

On Wed, Jan 30, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> This is much better than the earlier version of test and there is no
> dependency on the vacuum.  However, I feel still there is some
> dependency on how the rows will fit in a page and we have seen some
> related failures due to alignment stuff.  By looking at the test, I
> can't envision any such problem, but how about if we just write some
> simple tests where we can check that the FSM won't be created for very
> small number of records say one or two and then when we increase the
> records FSM gets created, here if we want, we can even use vacuum to
> ensure FSM gets created.  Once we are sure that the main patch passes
> all the buildfarm tests, we can extend the test to something advanced
> as you are proposing now.  I think that will reduce the chances of
> failure, what do you think?

That's probably a good idea to limit risk. I just very basic tests
now, and vacuum before every relation size check to make sure any FSM
extension (whether desired or not) is invoked. Also, in my last patch
I forgot to implement explicit checks of the block number instead of
assuming how many rows will fit on a page. I've used a plpgsql code
block to do this.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

simple-regression-test-plus-ctid-loop.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Masahiko Sawada

Date:

30 January 2019, 17:11:12

On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > > You can find this change in attached patch.  Then, I ran the make
> > > check in src/bin/pgbench multiple times using test_conc_insert.sh.
> > > You can vary the number of times the test should run, if you are not
> > > able to reproduce it with this.
> > >
> > > The attached patch (clear_local_map_if_exists_1.patch) atop the main
> > > patch fixes the issue for me.  Kindly verify the same.
> >
> > I got one failure in 50 runs. With the new patch, I didn't get any
> > failures in 300 runs.
> >
>
> Thanks for verification.  I have included it in the attached patch and
> I have also modified the page.sql test to have enough number of pages
> in relation so that FSM will get created irrespective of alignment
> boundaries.  Masahiko San, can you verify if this now works for you?
>

Thank you for updating the patch!

The modified page.sql test could fail if the block size is more than
8kB? We can ensure the number of pages are more than 4 by checking it
and adding more data if no enough but I'm really not sure we should
care the bigger-block size cases. However maybe it's good to check the
number of pages after insertion so that we can break down the issue in
case the test failed again.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

31 January 2019, 05:36:52

On Wed, Jan 30, 2019 at 8:11 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 30, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This is much better than the earlier version of test and there is no
> > dependency on the vacuum.  However, I feel still there is some
> > dependency on how the rows will fit in a page and we have seen some
> > related failures due to alignment stuff.  By looking at the test, I
> > can't envision any such problem, but how about if we just write some
> > simple tests where we can check that the FSM won't be created for very
> > small number of records say one or two and then when we increase the
> > records FSM gets created, here if we want, we can even use vacuum to
> > ensure FSM gets created.  Once we are sure that the main patch passes
> > all the buildfarm tests, we can extend the test to something advanced
> > as you are proposing now.  I think that will reduce the chances of
> > failure, what do you think?
>
> That's probably a good idea to limit risk. I just very basic tests
> now, and vacuum before every relation size check to make sure any FSM
> extension (whether desired or not) is invoked. Also, in my last patch
> I forgot to implement explicit checks of the block number instead of
> assuming how many rows will fit on a page. I've used a plpgsql code
> block to do this.
>

-- Extend table with enough blocks to exceed the FSM threshold
 -- FSM is created and extended to 3 blocks

The second comment line seems redundant to me, so I have removed that
and integrated it in the main patch.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v20-0001-Avoid-creation-of-the-free-space-map-for-small-heap-.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

31 January 2019, 05:41:35

On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > >
> > > On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > > You can find this change in attached patch.  Then, I ran the make
> > > > check in src/bin/pgbench multiple times using test_conc_insert.sh.
> > > > You can vary the number of times the test should run, if you are not
> > > > able to reproduce it with this.
> > > >
> > > > The attached patch (clear_local_map_if_exists_1.patch) atop the main
> > > > patch fixes the issue for me.  Kindly verify the same.
> > >
> > > I got one failure in 50 runs. With the new patch, I didn't get any
> > > failures in 300 runs.
> > >
> >
> > Thanks for verification.  I have included it in the attached patch and
> > I have also modified the page.sql test to have enough number of pages
> > in relation so that FSM will get created irrespective of alignment
> > boundaries.  Masahiko San, can you verify if this now works for you?
> >
>
> Thank you for updating the patch!
>
> The modified page.sql test could fail if the block size is more than
> 8kB?

That's right, but I don't think current regression tests will work for
block size greater than 8KB.  I have tried with 16 and 32 as block
size, there were few failures on the head itself.

> We can ensure the number of pages are more than 4 by checking it
> and adding more data if no enough but I'm really not sure we should
> care the bigger-block size cases.
>

Yeah, I am not sure either.  I think as this is an existing test, we
should not try to change it too much.  However, if both you and John
feel it is better to change, we can go with that.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

31 January 2019, 08:32:01

On Thu, Jan 31, 2019 at 6:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 30, 2019 at 8:11 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > That's probably a good idea to limit risk. I just very basic tests
> > now, and vacuum before every relation size check to make sure any FSM
> > extension (whether desired or not) is invoked. Also, in my last patch
> > I forgot to implement explicit checks of the block number instead of
> > assuming how many rows will fit on a page. I've used a plpgsql code
> > block to do this.
> >
>
> -- Extend table with enough blocks to exceed the FSM threshold
>  -- FSM is created and extended to 3 blocks
>
> The second comment line seems redundant to me, so I have removed that
> and integrated it in the main patch.

FYI, the second comment is still present in v20.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

31 January 2019, 08:41:56

On Thu, Jan 31, 2019 at 6:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > The modified page.sql test could fail if the block size is more than
> > 8kB?
>
> That's right, but I don't think current regression tests will work for
> block size greater than 8KB.  I have tried with 16 and 32 as block
> size, there were few failures on the head itself.
>
> > We can ensure the number of pages are more than 4 by checking it
> > and adding more data if no enough but I'm really not sure we should
> > care the bigger-block size cases.
> >
>
> Yeah, I am not sure either.  I think as this is an existing test, we
> should not try to change it too much.  However, if both you and John
> feel it is better to change, we can go with that.

I have an idea -- instead of adding a bunch of records and hoping that
the relation size and free space is consistent across platforms, how
about we revert to the original test input, and add a BRIN index? That
should have a FSM even with one record.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

31 January 2019, 12:33:25

On Thu, Jan 31, 2019 at 2:02 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 31, 2019 at 6:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 30, 2019 at 8:11 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > >
> > > That's probably a good idea to limit risk. I just very basic tests
> > > now, and vacuum before every relation size check to make sure any FSM
> > > extension (whether desired or not) is invoked. Also, in my last patch
> > > I forgot to implement explicit checks of the block number instead of
> > > assuming how many rows will fit on a page. I've used a plpgsql code
> > > block to do this.
> > >
> >
> > -- Extend table with enough blocks to exceed the FSM threshold
> >  -- FSM is created and extended to 3 blocks
> >
> > The second comment line seems redundant to me, so I have removed that
> > and integrated it in the main patch.
>
> FYI, the second comment is still present in v20.
>

oops, forgot to include in commit after making a change, done now.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v21-0001-Avoid-creation-of-the-free-space-map-for-small-heap-.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

31 January 2019, 12:43:32

On Thu, Jan 31, 2019 at 2:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 31, 2019 at 6:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > The modified page.sql test could fail if the block size is more than
> > > 8kB?
> >
> > That's right, but I don't think current regression tests will work for
> > block size greater than 8KB.  I have tried with 16 and 32 as block
> > size, there were few failures on the head itself.
> >
> > > We can ensure the number of pages are more than 4 by checking it
> > > and adding more data if no enough but I'm really not sure we should
> > > care the bigger-block size cases.
> > >
> >
> > Yeah, I am not sure either.  I think as this is an existing test, we
> > should not try to change it too much.  However, if both you and John
> > feel it is better to change, we can go with that.
>
> I have an idea -- instead of adding a bunch of records and hoping that
> the relation size and free space is consistent across platforms, how
> about we revert to the original test input, and add a BRIN index? That
> should have a FSM even with one record.
>

Why would BRIN index allow having FSM for heap relation?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

31 January 2019, 12:52:47

On Thu, Jan 31, 2019 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have an idea -- instead of adding a bunch of records and hoping that
> > the relation size and free space is consistent across platforms, how
> > about we revert to the original test input, and add a BRIN index? That
> > should have a FSM even with one record.
> >
>
> Why would BRIN index allow having FSM for heap relation?

Oops, I forgot this file is for testing heaps only. That said, we
could possibly put most of the FSM tests such as

SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));

into brin.sql since we know a non-empty BRIN index will have a FSM.
And in page.sql we could just have a test that the table has no FSM.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

31 January 2019, 13:53:44

On Thu, Jan 31, 2019 at 1:52 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Jan 31, 2019 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I have an idea -- instead of adding a bunch of records and hoping that
> > > the relation size and free space is consistent across platforms, how
> > > about we revert to the original test input, and add a BRIN index? That
> > > should have a FSM even with one record.
> > >
> >
> > Why would BRIN index allow having FSM for heap relation?
>
> Oops, I forgot this file is for testing heaps only. That said, we
> could possibly put most of the FSM tests such as
>
> SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
>
> into brin.sql since we know a non-empty BRIN index will have a FSM.

As in the attached. Applies on top of v20. First to revert to HEAD,
second to move FSM tests to brin.sql. This is a much less invasive and
more readable patch, in addition to being hopefully more portable.

> And in page.sql we could just have a test that the table has no FSM.

This is not possible, since we don't know the relfilenode for the
error text, and it's not important. Better to have everything in
brin.sql.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jan 31, 2019 at 2:02 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> >
> > FYI, the second comment is still present in v20.
> >
>
> oops, forgot to include in commit after making a change, done now.
>

This doesn't get applied cleanly after recent commit 0d1fe9f74e.
Attached is a rebased version.  I have checked once that the changes
done by 0d1fe9f74e don't impact this patch.  John, see if you can also
once confirm whether the recent commit (0d1fe9f74e) has any impact.  I
am planning to push this tomorrow morning (IST) unless you or anyone
see any problem with this.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v22-0001-Avoid-creation-of-the-free-space-map-for-small-heap-.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

03 February 2019, 19:09:14

On Sun, Feb 3, 2019 at 2:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> This doesn't get applied cleanly after recent commit 0d1fe9f74e.
> Attached is a rebased version.  I have checked once that the changes
> done by 0d1fe9f74e don't impact this patch.  John, see if you can also
> once confirm whether the recent commit (0d1fe9f74e) has any impact.  I
> am planning to push this tomorrow morning (IST) unless you or anyone
> see any problem with this.

Since that commit changes RelationAddExtraBlocks(), which can be
induces by your pgbench adjustment upthread, I ran make check with
that adjustment in the pgbench dir 300 times without triggering
asserts.

I also tried to follow the logic in 0d1fe9f74e, and I believe it will
be correct without a FSM.

[1] https://www.postgresql.org/message-id/CAA4eK1KRByXY03qR2JvUjUxKBzpBnCSO5H19oAC%3D_v4r5dzTwQ%40mail.gmail.com

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

04 February 2019, 03:17:33

On Mon, Feb 4, 2019 at 12:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Sun, Feb 3, 2019 at 2:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This doesn't get applied cleanly after recent commit 0d1fe9f74e.
> > Attached is a rebased version.  I have checked once that the changes
> > done by 0d1fe9f74e don't impact this patch.  John, see if you can also
> > once confirm whether the recent commit (0d1fe9f74e) has any impact.  I
> > am planning to push this tomorrow morning (IST) unless you or anyone
> > see any problem with this.
>
> Since that commit changes RelationAddExtraBlocks(), which can be
> induces by your pgbench adjustment upthread, I ran make check with
> that adjustment in the pgbench dir 300 times without triggering
> asserts.
>
> I also tried to follow the logic in 0d1fe9f74e, and I believe it will
> be correct without a FSM.
>

I have just pushed it and buildfarm has shown two failures:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-02-04%2002%3A27%3A26

--- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out
2019-02-03 21:27:29.000000000 -0500
+++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out
2019-02-03 21:41:32.000000000 -0500
@@ -38,19 +38,19 @@
 SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
  fsm_page_contents
 -------------------
- 0: 39            +
- 1: 39            +
- 3: 39            +
- 7: 39            +
- 15: 39           +
- 31: 39           +
- 63: 39           +
..

This one seems to be FSM test portability issue (due to different page
contents, maybe).  Looking into it, John, see if you are around and
have some thoughts on it.

2.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2019-02-04%2002%3A30%3A25

select explain_parallel_append('execute ab_q5 (33, 44, 55)');
-                            explain_parallel_append
--------------------------------------------------------------------------------
- Finalize Aggregate (actual rows=1 loops=1)
-   ->  Gather (actual rows=3 loops=1)
-         Workers Planned: 2
-         Workers Launched: 2
-         ->  Partial Aggregate (actual rows=1 loops=3)
-               ->  Parallel Append (actual rows=0 loops=N)
-                     Subplans Removed: 8
-                     ->  Parallel Seq Scan on ab_a1_b1 (never executed)
-                           Filter: ((b < 4) AND (a = ANY (ARRAY[$1, $2, $3])))
-(9 rows)
-
+ERROR:  lost connection to parallel worker
+CONTEXT:  PL/pgSQL function explain_parallel_append(text) line 5 at
FOR over EXECUTE statement
 -- Test Parallel Append with PARAM_EXEC Params
 select explain_parallel_append('select count(*) from ab where (a =
(select 1) or a = (select 3)) and b = 2');
                          explain_parallel_append

Failure is something like:

2019-02-03 21:44:42.456 EST [2812:327] pg_regress/partition_prune LOG:
 statement: select explain_parallel_append('execute ab_q5 (1, 1, 1)');
2019-02-03 21:44:42.493 EST [2812:328] pg_regress/partition_prune LOG:
 statement: select explain_parallel_append('execute ab_q5 (2, 3, 3)');
2019-02-03 21:44:42.531 EST [2812:329] pg_regress/partition_prune LOG:
 statement: select explain_parallel_append('execute ab_q5 (33, 44,
55)');
2019-02-04 02:44:42.552 GMT [4172] FATAL:  could not reattach to
shared memory (key=00000000000001B4, addr=0000000001980000): error
code 487
2019-02-03 21:44:42.555 EST [5116:6] LOG:  background worker "parallel
worker" (PID 4172) exited with exit code 1
2019-02-03 21:44:42.560 EST [2812:330] pg_regress/partition_prune
ERROR:  lost connection to parallel worker

I don't think this is related to this commit.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

04 February 2019, 03:54:22

On Mon, Feb 4, 2019 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 4, 2019 at 12:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Sun, Feb 3, 2019 at 2:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > This doesn't get applied cleanly after recent commit 0d1fe9f74e.
> > > Attached is a rebased version.  I have checked once that the changes
> > > done by 0d1fe9f74e don't impact this patch.  John, see if you can also
> > > once confirm whether the recent commit (0d1fe9f74e) has any impact.  I
> > > am planning to push this tomorrow morning (IST) unless you or anyone
> > > see any problem with this.
> >
> > Since that commit changes RelationAddExtraBlocks(), which can be
> > induces by your pgbench adjustment upthread, I ran make check with
> > that adjustment in the pgbench dir 300 times without triggering
> > asserts.
> >
> > I also tried to follow the logic in 0d1fe9f74e, and I believe it will
> > be correct without a FSM.
> >
>
> I have just pushed it and buildfarm has shown two failures:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-02-04%2002%3A27%3A26
>
> --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out
> 2019-02-03 21:27:29.000000000 -0500
> +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out
> 2019-02-03 21:41:32.000000000 -0500
> @@ -38,19 +38,19 @@
>  SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0));
>   fsm_page_contents
>  -------------------
> - 0: 39            +
> - 1: 39            +
> - 3: 39            +
> - 7: 39            +
> - 15: 39           +
> - 31: 39           +
> - 63: 39           +
> ..
>
> This one seems to be FSM test portability issue (due to different page
> contents, maybe).  Looking into it, John, see if you are around and
> have some thoughts on it.
>

One more similar failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-02-04%2003%3A20%3A01

So, basically, this is due to difference in the number of tuples that
can fit on a page.  The freespace in FSM for the page is shown
different because of available space on a particular page.  This can
vary due to alignment.  It seems to me we can't rely on FSM contents
if there are many tuples in a relation.  One idea is to get rid of
dependency on FSM contents in this test, can you think of any better
way to have consistent FSM contents across different platforms?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

04 February 2019, 04:19:30

On Mon, Feb 4, 2019 at 9:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Feb 4, 2019 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> One more similar failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-02-04%2003%3A20%3A01
>
> So, basically, this is due to difference in the number of tuples that
> can fit on a page.  The freespace in FSM for the page is shown
> different because of available space on a particular page.  This can
> vary due to alignment.  It seems to me we can't rely on FSM contents
> if there are many tuples in a relation.  One idea is to get rid of
> dependency on FSM contents in this test, can you think of any better
> way to have consistent FSM contents across different platforms?
>

One more idea could be that we devise a test (say with a char/varchar)
such that it always consume same space in a page irrespective of its
alignment.  Yet another way could be we use explain (costs off,
analyze on, timing off, summary off) ..., this will ensure that we
will have test coverage for function fsm_page_contents, but we don't
rely on its contents.   What do you think?  I will go with last option
to stablize the buildfarm tests unless anyone thinks otherwise or has
better idea.  I willprobably wait for 20 minutes or so to see if
anyone has inputs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

04 February 2019, 04:48:38

On Mon, Feb 4, 2019 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> This one seems to be FSM test portability issue (due to different page
> contents, maybe).  Looking into it, John, see if you are around and
> have some thoughts on it.

Maybe we can use the same plpgsql loop as fsm.sql that exits after 1
tuple has inserted into the 5th page.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

04 February 2019, 04:59:06

On Mon, Feb 4, 2019 at 10:18 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Feb 4, 2019 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > This one seems to be FSM test portability issue (due to different page
> > contents, maybe).  Looking into it, John, see if you are around and
> > have some thoughts on it.
>
> Maybe we can use the same plpgsql loop as fsm.sql that exits after 1
> tuple has inserted into the 5th page.
>

Yeah that can also work, but we still need to be careful about the
alignment of that one tuple, otherwise, there will could be different
free space on the fifth page.  The probably easier way could be to use
an even number of integers in the table say(int, int).  Anyway, for
now, I have avoided the dependency on FSM contents without losing on
coverage of test.  I have pushed my latest suggestion in the previous
email.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

04 February 2019, 07:40:53

On Mon, Feb 4, 2019 at 10:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 4, 2019 at 10:18 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Mon, Feb 4, 2019 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > This one seems to be FSM test portability issue (due to different page
> > > contents, maybe).  Looking into it, John, see if you are around and
> > > have some thoughts on it.
> >
> > Maybe we can use the same plpgsql loop as fsm.sql that exits after 1
> > tuple has inserted into the 5th page.
> >
>
> Yeah that can also work, but we still need to be careful about the
> alignment of that one tuple, otherwise, there will could be different
> free space on the fifth page.  The probably easier way could be to use
> an even number of integers in the table say(int, int).  Anyway, for
> now, I have avoided the dependency on FSM contents without losing on
> coverage of test.  I have pushed my latest suggestion in the previous
> email.
>

The change seems to have worked.  All the buildfarm machines that were
showing the failure are passed now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

04 February 2019, 08:57:22

On Mon, Feb 4, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Yeah that can also work, but we still need to be careful about the
> > alignment of that one tuple, otherwise, there will could be different
> > free space on the fifth page.  The probably easier way could be to use
> > an even number of integers in the table say(int, int).  Anyway, for
> > now, I have avoided the dependency on FSM contents without losing on
> > coverage of test.  I have pushed my latest suggestion in the previous
> > email.
> >
>
> The change seems to have worked.  All the buildfarm machines that were
> showing the failure are passed now.

Excellent!

Now that the buildfarm is green as far as this patch goes, I will
touch on a few details to round out development in this area:

1. Earlier, I had a test to ensure that free space towards the front
of the relation was visible with no FSM. In [1], I rewrote it without
using vacuum, so we can consider adding it back now if desired. I can
prepare a patch for this.

2. As a follow-on, since we don't rely on vacuum to remove dead rows,
we could try putting the fsm.sql test in some existing group in the
parallel schedule, rather than its own group is it is now.

3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
patch's effects on GetRecordedFreeSpace(), which will return zero for
tables with no FSM. The other callers are in:
contrib/pg_freespacemap/pg_freespacemap.c
contrib/pgstattuple/pgstatapprox.c

For pg_freespacemap, this doesn't matter, since it's just reporting
the facts. For pgstattuple_approx(), it might under-estimate the free
space and over-estimate the number of live tuples. This might be fine,
since it is approximate after all, but maybe a comment would be
helpful. If this is a problem, we could tweak it to be more precise
for tables without FSMs. Thoughts?

4. The latest patch for the pg_upgrade piece was in [2]

Anything else?

[1] https://www.postgresql.org/message-id/CACPNZCvEXLUx10pFvNcOs88RvqemMEjOv7D9MhL3ac86EzjAOA%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CACPNZCu4cOdm3uGnNEGXivy7Gz8UWyQjynDpdkPGabQ18_zK6g%40mail.gmail.com

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

05 February 2019, 03:04:15

On Mon, Feb 4, 2019 at 2:27 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Feb 4, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > The change seems to have worked. All the buildfarm machines that were
> > showing the failure are passed now.
>
> Excellent!
>
> Now that the buildfarm is green as far as this patch goes,
>

There is still one recent failure which I don't think is related to commit of this patch:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-02-04%2016%3A38%3A48

================== pgsql.build/src/bin/pg_ctl/tmp_check/log/004_logrotate_primary.log ===================
TRAP: FailedAssertion("!(UsedShmemSegAddr != ((void *)0))", File: "g:\prog\bf\root\head\pgsql.build\src\backend\port\win32_shmem.c", Line: 513)

I think we need to do something about this random failure, but not as part of this thread/patch.

> I will
> touch on a few details to round out development in this area:
>
> 1. Earlier, I had a test to ensure that free space towards the front
> of the relation was visible with no FSM. In [1], I rewrote it without
> using vacuum, so we can consider adding it back now if desired. I can
> prepare a patch for this.
>

Yes, this is required. It is generally a good practise to add test (unless it takes a lot of time) which covers new code/functionality.

> 2. As a follow-on, since we don't rely on vacuum to remove dead rows,
> we could try putting the fsm.sql test in some existing group in the
> parallel schedule, rather than its own group is it is now.
>

+1.

> 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
> patch's effects on GetRecordedFreeSpace(), which will return zero for
> tables with no FSM.

Right, but what exactly we want to do for it? Do you want to add a comment atop of this function?

> The other callers are in:
> contrib/pg_freespacemap/pg_freespacemap.c
> contrib/pgstattuple/pgstatapprox.c
>
> For pg_freespacemap, this doesn't matter, since it's just reporting
> the facts. For pgstattuple_approx(), it might under-estimate the free
> space and over-estimate the number of live tuples.

Sure, but without patch also, it can do so, if the vacuum hasn't updated freespace map.

> This might be fine,
> since it is approximate after all, but maybe a comment would be
> helpful. If this is a problem, we could tweak it to be more precise
> for tables without FSMs.

Sounds reasonable to me.

> Thoughts?
>
> 4. The latest patch for the pg_upgrade piece was in [2]
>

It will be good if we get this one as well. I will look into it once we are done with the other points you have mentioned.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

05 February 2019, 09:55:28

On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 4, 2019 at 2:27 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > 1. Earlier, I had a test to ensure that free space towards the front
> > of the relation was visible with no FSM. In [1], I rewrote it without
> > using vacuum, so we can consider adding it back now if desired. I can
> > prepare a patch for this.
> >
>
> Yes, this is required.  It is generally a good practise to add test (unless it takes a lot of time) which covers new
code/functionality.
>
> > 2. As a follow-on, since we don't rely on vacuum to remove dead rows,
> > we could try putting the fsm.sql test in some existing group in the
> > parallel schedule, rather than its own group is it is now.
> >
>
> +1.

This is done in 0001.

> > 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
> > patch's effects on GetRecordedFreeSpace(), which will return zero for
> > tables with no FSM.
> >
>
> Right, but what exactly we want to do for it?  Do you want to add a comment atop of this function?

Hmm, the comment already says "according to the FSM", so maybe it's
already obvious. I was thinking more about maybe commenting the
callsite where it's helpful, as in 0002.

> > The other callers are in:
> > contrib/pg_freespacemap/pg_freespacemap.c
> > contrib/pgstattuple/pgstatapprox.c
> >
> > For pg_freespacemap, this doesn't matter, since it's just reporting
> > the facts. For pgstattuple_approx(), it might under-estimate the free
> > space and over-estimate the number of live tuples.
> >
>
> Sure, but without patch also, it can do so, if the vacuum hasn't updated freespace map.

Okay, then maybe we don't need to do anything else here.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

09 February 2019, 12:04:02

On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Feb 4, 2019 at 2:27 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> > >
> > > 1. Earlier, I had a test to ensure that free space towards the front
> > > of the relation was visible with no FSM. In [1], I rewrote it without
> > > using vacuum, so we can consider adding it back now if desired. I can
> > > prepare a patch for this.
> > >
> >
> > Yes, this is required.  It is generally a good practise to add test (unless it takes a lot of time) which covers
newcode/functionality.
 
> >
> > > 2. As a follow-on, since we don't rely on vacuum to remove dead rows,
> > > we could try putting the fsm.sql test in some existing group in the
> > > parallel schedule, rather than its own group is it is now.
> > >
> >
> > +1.
>
> This is done in 0001.
>

This is certainly a good test w.r.t code coverage of new code, but I
have few comments:
1. The size of records in test still depends on alignment (MAXALIGN).
Though it doesn't seem to be a problematic case, I still suggest we
can avoid using records whose size depends on alignment.  If you
change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int,
str text);, then you can avoid alignment related issues for the
records being used in test.
2.
+-- Fill most of the last block
..
+-- Make sure records can go into any block but the last one
..
+-- Insert large record and make sure it does not cause the relation to extend

The comments in some part of the test seems too focussed towards the
algorithm used for in-memory map.  I think we can keep these if we
want, but it is required to write a more generic comment stating what
is the actual motive of additional tests (basically we are testing the
functionality of in-memory map (LSM) for the heap, so we should write
about it.).


> > > 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
> > > patch's effects on GetRecordedFreeSpace(), which will return zero for
> > > tables with no FSM.
> > >
> >
> > Right, but what exactly we want to do for it?  Do you want to add a comment atop of this function?
>
> Hmm, the comment already says "according to the FSM", so maybe it's
> already obvious. I was thinking more about maybe commenting the
> callsite where it's helpful, as in 0002.
>
> > > The other callers are in:
> > > contrib/pg_freespacemap/pg_freespacemap.c
> > > contrib/pgstattuple/pgstatapprox.c
> > >
> > > For pg_freespacemap, this doesn't matter, since it's just reporting
> > > the facts. For pgstattuple_approx(), it might under-estimate the free
> > > space and over-estimate the number of live tuples.
> > >
> >
> > Sure, but without patch also, it can do so, if the vacuum hasn't updated freespace map.
>
> Okay, then maybe we don't need to do anything else here.
>

Shall we add a note to the docs of pg_freespacemap and
pgstattuple_approx indicating that for small relations, FSM won't be
created, so these functions won't give appropriate value?  Or other
possibility could be that we return an error if the block number is
less than the threshold value, but not sure if that is a good
alternative as that can happen today also if the vacuum hasn't run on
the table.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

11 February 2019, 17:18:28

On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com>
> wrote:
>>
>> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
> This is certainly a good test w.r.t code coverage of new code, but I
> have few comments:
> 1. The size of records in test still depends on alignment (MAXALIGN).
> Though it doesn't seem to be a problematic case, I still suggest we
> can avoid using records whose size depends on alignment.  If you
> change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int,
> str text);, then you can avoid alignment related issues for the
> records being used in test.

Done.

> 2.
> +-- Fill most of the last block
> ..
> +-- Make sure records can go into any block but the last one
> ..
> +-- Insert large record and make sure it does not cause the relation to
> extend
>
> The comments in some part of the test seems too focussed towards the
> algorithm used for in-memory map.  I think we can keep these if we
> want, but it is required to write a more generic comment stating what
> is the actual motive of additional tests (basically we are testing the
> functionality of in-memory map (LSM) for the heap, so we should write
> about it.).

Done.

> Shall we add a note to the docs of pg_freespacemap and
> pgstattuple_approx indicating that for small relations, FSM won't be
> created, so these functions won't give appropriate value?

I've given this a try in 0002.

> Or other
> possibility could be that we return an error if the block number is
> less than the threshold value, but not sure if that is a good
> alternative as that can happen today also if the vacuum hasn't run on
> the table.

Yeah, an error doesn't seem helpful.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

20 February 2019, 11:09:09

On Mon, Feb 11, 2019 at 10:48 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Shall we add a note to the docs of pg_freespacemap and
> > pgstattuple_approx indicating that for small relations, FSM won't be
> > created, so these functions won't give appropriate value?
>
> I've given this a try in 0002.
>

This looks mostly correct, but I have a few observations:
1.
-      tuples.
+      tuples. Small tables don't have a free space map, so in that case
+      this function will report zero free space, likewise inflating the
+      estimated number of live tuples.

The last part of the sentence "likewise inflating the estimated number
of live tuples." seems incorrect to me because live tuples are
computed based on the pages scanned, live tuples in them and total
blocks in the relation.  So, I think it should be "likewise inflating
the approximate tuple length".

2.
+   In addition, small tables don't have a free space map, so this function
+   will return zero even if free space is available.

Actually, the paragraph you have modified applies to both the
functions mentioned on that page.  So instead of saying "this function
..", we can say "these functions .."

3.
* space from the FSM and move on.
+ * Note: If a relation has no FSM, GetRecordedFreeSpace() will report
+ * zero free space.  This is fine for the purposes of approximation.
  */

It is better to have an empty line before Note: ...

I have modified the patch for the above observations and added a
commit message as well, see if it looks okay to you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v3-0002-Document-that-functions-that-use-the-FSM-for-will.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

20 February 2019, 11:34:12

On Wed, Feb 20, 2019 at 6:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have modified the patch for the above observations and added a
> commit message as well, see if it looks okay to you.

Looks good to me, thanks.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

20 February 2019, 12:55:58

On Mon, Feb 11, 2019 at 10:48 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com>
> > wrote:
> >>
> >> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> > This is certainly a good test w.r.t code coverage of new code, but I
> > have few comments:
> > 1. The size of records in test still depends on alignment (MAXALIGN).
> > Though it doesn't seem to be a problematic case, I still suggest we
> > can avoid using records whose size depends on alignment.  If you
> > change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int,
> > str text);, then you can avoid alignment related issues for the
> > records being used in test.
>
> Done.
>
> > 2.
> > +-- Fill most of the last block
> > ..
> > +-- Make sure records can go into any block but the last one
> > ..
> > +-- Insert large record and make sure it does not cause the relation to
> > extend
> >
> > The comments in some part of the test seems too focussed towards the
> > algorithm used for in-memory map.  I think we can keep these if we
> > want, but it is required to write a more generic comment stating what
> > is the actual motive of additional tests (basically we are testing the
> > functionality of in-memory map (LSM) for the heap, so we should write
> > about it.).
>
> Done.
>

Thanks, the modification looks good.  I have slightly changed the
commit message in the attached patch.  I will spend some more time
tomorrow morning on this and will commit unless I see any new problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v3-0001-Add-more-tests-for-FSM.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Alvaro Herrera

Date:

20 February 2019, 14:38:25

Please remember to keep serial_schedule in sync.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

21 February 2019, 03:55:31

On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> Please remember to keep serial_schedule in sync.
>

I don't understand what you mean by this?  It is already present in
serial_schedule.  In parallel_schedule, we are just moving this test
to one of the parallel groups.  Do we need to take care of something
else?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Alvaro Herrera

Date:

21 February 2019, 04:35:13

On 2019-Feb-21, Amit Kapila wrote:

> On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > Please remember to keep serial_schedule in sync.
> 
> I don't understand what you mean by this?  It is already present in
> serial_schedule.  In parallel_schedule, we are just moving this test
> to one of the parallel groups.  Do we need to take care of something
> else?

Just to make sure it's in the same relative position.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

21 February 2019, 06:58:16

On Mon, Feb 11, 2019 at 10:48 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com>
> > wrote:
> >>
> >> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> > This is certainly a good test w.r.t code coverage of new code, but I
> > have few comments:
> > 1. The size of records in test still depends on alignment (MAXALIGN).
> > Though it doesn't seem to be a problematic case, I still suggest we
> > can avoid using records whose size depends on alignment.  If you
> > change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int,
> > str text);, then you can avoid alignment related issues for the
> > records being used in test.
>
> Done.
>

Oops, on again carefully studying the test, I realized my above
comment was wrong.  Let me explain with a test this time:
CREATE TABLE fsm_check_size (num int, str text);
INSERT INTO fsm_check_size SELECT i, rpad('', 1024, 'a') FROM
generate_series(1,3) i;

So here you are inserting 4-byte integer and 1024-bytes variable
length record.  So the tuple length will be tuple_header (24-bytes) +
4-bytes for integer + 4-bytes header for variable length data + 1024
bytes of actual data.  So, the length will be 1056 which is already
MAXALIGN.  I took the new comments added in your latest version of the
patch and added them to the previous version of the patch.   Kindly
see if I have not missed anything while merging the patch-versions?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v4-0001-Add-more-tests-for-FSM.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

21 February 2019, 07:26:25

On Thu, Feb 21, 2019 at 7:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> So here you are inserting 4-byte integer and 1024-bytes variable
> length record.  So the tuple length will be tuple_header (24-bytes) +
> 4-bytes for integer + 4-bytes header for variable length data + 1024
> bytes of actual data.  So, the length will be 1056 which is already
> MAXALIGN.  I took the new comments added in your latest version of the
> patch and added them to the previous version of the patch.   Kindly
> see if I have not missed anything while merging the patch-versions?

OK, that makes sense. Looks fine to me.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

21 February 2019, 14:37:46

On Thu, Feb 21, 2019 at 6:39 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2019-Feb-21, Amit Kapila wrote:
>
> > On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >
> > > Please remember to keep serial_schedule in sync.
> >
> > I don't understand what you mean by this?  It is already present in
> > serial_schedule.  In parallel_schedule, we are just moving this test
> > to one of the parallel groups.  Do we need to take care of something
> > else?
>
> Just to make sure it's in the same relative position.
>

Okay, thanks for the input.  Attached, find the updated patch, will
try to commit it tomorrow unless you or someone else has any other
comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v5-0001-Add-more-tests-for-FSM.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Alvaro Herrera

Date:

21 February 2019, 17:37:06

I think this test is going to break on nonstandard block sizes.  While
we don't promise that all tests work on such installs (particularly
planner ones), it seems fairly easy to cope with this one -- just use a
record size expressed as a fraction of current_setting('block_size').
So instead of "1024" you'd write current_setting('block_size') / 8.
And then display the relation size in terms of pages, not bytes, so
divide pg_relation_size by block size.

(I see that there was already a motion for this, but was dismissed
because of lack of interest.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

22 February 2019, 02:59:27

On Fri, Feb 22, 2019 at 1:57 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> I think this test is going to break on nonstandard block sizes.  While
> we don't promise that all tests work on such installs (particularly
> planner ones),
>

The reason for not pushing much on making the test pass for
nonstandard block sizes is that when I tried existing tests, there
were already some failures.  For example, see the failures in the
attached regression diff files (for block_size as 16K and 32K
respectively).  I saw those failures during the previous
investigation, the situation on HEAD might or might not be exactly the
same.  Whereas I see the value in trying to make sure that tests pass
for nonstandard block sizes, but that doesn't seem to be followed for
all the tests.

> it seems fairly easy to cope with this one -- just use a
> record size expressed as a fraction of current_setting('block_size').
> So instead of "1024" you'd write current_setting('block_size') / 8.
> And then display the relation size in terms of pages, not bytes, so
> divide pg_relation_size by block size.
>

The idea sounds good.  John, would you like to give it a try?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

On Thu, Feb 21, 2019 at 9:27 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> I think this test is going to break on nonstandard block sizes.  While
> we don't promise that all tests work on such installs (particularly
> planner ones), it seems fairly easy to cope with this one -- just use a
> record size expressed as a fraction of current_setting('block_size').
> So instead of "1024" you'd write current_setting('block_size') / 8.
> And then display the relation size in terms of pages, not bytes, so
> divide pg_relation_size by block size.

I've done this for v6, tested on 16k block size.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v6-0001-Add-more-tests-for-FSM.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

23 February 2019, 08:01:11

On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> The reason for not pushing much on making the test pass for
> nonstandard block sizes is that when I tried existing tests, there
> were already some failures.

FWIW, I currently see 8 failures (attached).

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

jcn-16blk-regress.diffs

Re: WIP: Avoid creation of the free space map for small tables

From

Alvaro Herrera

Date:

26 February 2019, 15:20:04

On 2019-Feb-23, John Naylor wrote:

> On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > The reason for not pushing much on making the test pass for
> > nonstandard block sizes is that when I tried existing tests, there
> > were already some failures.
> 
> FWIW, I currently see 8 failures (attached).

Hmm, not great -- even the strings test fails, which seems to try to handle
the case explicitly.  I did expect the plan shape ones to fail, but I'm
surprised about the tablesample one.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Petr Jelinek

Date:

26 February 2019, 15:25:42

On 26/02/2019 16:20, Alvaro Herrera wrote:
> On 2019-Feb-23, John Naylor wrote:
> 
>> On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> The reason for not pushing much on making the test pass for
>>> nonstandard block sizes is that when I tried existing tests, there
>>> were already some failures.
>>
>> FWIW, I currently see 8 failures (attached).
> 
> Hmm, not great -- even the strings test fails, which seems to try to handle
> the case explicitly.  I did expect the plan shape ones to fail, but I'm
> surprised about the tablesample one.
> 

The SYSTEM table sampling is basically per-page sampling so it depends
heavily on which rows are on which page.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

06 March 2019, 11:49:26

On Fri, Jan 25, 2019 at 9:50 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Once we agree on the code, we need to test below scenarios:
> (a) upgrade from all supported versions to the latest version
> (b) upgrade standby with and without using rsync.

Although the code hasn't been reviewed yet, I went ahead and tested
(a) on v21 of the pg_upgrade patch [1]. To do this I dumped out a 9.4
instance with the regression database and restored it to all supported
versions. To make it work with pg_upgrade, I first had to drop tables
with oids, drop functions referring to C libraries, and drop the
later-removed '=>' operator. Then I pg_upgrade'd in copy mode from all
versions to HEAD with the patch applied. pg_upgrade worked without
error, and the following query returned 0 bytes on all the new
clusters:

select sum(pg_relation_size(oid, 'fsm')) as total_fsm_size
from pg_class where relkind in ('r', 't')
and pg_relation_size(oid, 'main') <= 4 * 8192;

The complementary query (> 4 * 8192) returned the same number of bytes
as in the original 9.4 instance.

To make it easy to find, the latest regression test patch, which is
intended to be independent of block-size, was in [2].

[1] https://www.postgresql.org/message-id/CACPNZCu4cOdm3uGnNEGXivy7Gz8UWyQjynDpdkPGabQ18_zK6g%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CACPNZCsWa%3Ddd0K%2BFiODwM%3DLEsepAHVJCoSx2Avew%3DxBEX3Ywjw%40mail.gmail.com

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

07 March 2019, 13:43:18

On Sat, Feb 23, 2019 at 1:24 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Feb 21, 2019 at 9:27 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> >
> > I think this test is going to break on nonstandard block sizes.  While
> > we don't promise that all tests work on such installs (particularly
> > planner ones), it seems fairly easy to cope with this one -- just use a
> > record size expressed as a fraction of current_setting('block_size').
> > So instead of "1024" you'd write current_setting('block_size') / 8.
> > And then display the relation size in terms of pages, not bytes, so
> > divide pg_relation_size by block size.
>
> I've done this for v6, tested on 16k block size.
>

Thanks, the patch looks good to me.  I have additionally tested it 32K
and 1K sized blocks and the test passes.  I will commit this early
next week.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

08 March 2019, 11:43:39

On Mon, Jan 28, 2019 at 2:33 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Sat, Jan 26, 2019 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > I think there is some value in using the information from
> > this function to skip fsm files, but the code doesn't appear to fit
> > well, how about moving this check to new function
> > new_cluster_needs_fsm()?
>
> For v21, new_cluster_needs_fsm() has all responsibility for obtaining
> the info it needs. I think this is much cleaner,
>

Right, now the code looks much better.

> but there is a small
> bit of code duplication since it now has to form the file name. One
> thing we could do is form the the base old/new file names in
> transfer_single_new_db() and pass those to transfer_relfile(), which
> will only add suffixes and segment numbers. We could then pass the
> base old file name to new_cluster_needs_fsm() and use it as is. Not
> sure if that's worthwhile, though.
>

I don't think it is worth.

Few minor comments:
1.
warning C4715: 'new_cluster_needs_fsm': not all control paths return a value

Getting this new warning in the patch.

2.
+
+ /* Transfer any VM files if we can trust their
contents. */
  if (vm_crashsafe_match)

3. Can we add a note about this in the Notes section of pg_upgrade
documentation [1]?

This comment line doesn't seem to be related to this patch.  If so, I
think we can avoid having any additional change which is not related
to the functionality of this patch.  Feel free to submit it
separately, if you think it is an improvement.

Have you done any performance testing of this patch?  I mean to say
now that we added a new stat call for each table, we should see if
that has any impact.  Ideally, that should be compensated by the fact
that we are now not transferring *fsm files for small relations.  How
about constructing a test where all relations are greater than 4 pages
and then try to upgrade them.  We can check for a cluster with a
different number of relations say 10K, 20K, 50K, 100K.

In general, the patch looks okay to me.  I would like to know if
anybody else has any opinion whether pg_upgrade should skip
transferring fsm files for small relations or not?  I think both me
and John thinks that it is good to have feature and now that patch
turns out to be simpler, I feel we can go ahead with this optimization
in pg_upgrade.

[1] - https://www.postgresql.org/docs/devel/pgupgrade.html

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

08 March 2019, 11:45:31

On Wed, Mar 6, 2019 at 5:19 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Fri, Jan 25, 2019 at 9:50 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Once we agree on the code, we need to test below scenarios:
> > (a) upgrade from all supported versions to the latest version
> > (b) upgrade standby with and without using rsync.
>
> Although the code hasn't been reviewed yet, I went ahead and tested
> (a) on v21 of the pg_upgrade patch [1]. To do this I dumped out a 9.4
> instance with the regression database and restored it to all supported
> versions. To make it work with pg_upgrade, I first had to drop tables
> with oids, drop functions referring to C libraries, and drop the
> later-removed '=>' operator. Then I pg_upgrade'd in copy mode from all
> versions to HEAD with the patch applied. pg_upgrade worked without
> error, and the following query returned 0 bytes on all the new
> clusters:
>
> select sum(pg_relation_size(oid, 'fsm')) as total_fsm_size
> from pg_class where relkind in ('r', 't')
> and pg_relation_size(oid, 'main') <= 4 * 8192;
>
> The complementary query (> 4 * 8192) returned the same number of bytes
> as in the original 9.4 instance.
>

Thanks, the tests done by you are quite useful.  I have given a few
comments as a response to your previous email.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

08 March 2019, 11:49:06

On Fri, Mar 8, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> Few minor comments:
..
>
> 2.
> +
> + /* Transfer any VM files if we can trust their
> contents. */
>   if (vm_crashsafe_match)
>
> 3. Can we add a note about this in the Notes section of pg_upgrade
> documentation [1]?
>
> This comment line doesn't seem to be related to this patch.  If so, I
> think we can avoid having any additional change which is not related
> to the functionality of this patch.  Feel free to submit it
> separately, if you think it is an improvement.
>

oops, I have messed up the above comments.  The paragraph starting
with "This comment line doesn't ..." is for comment number-2.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

10 March 2019, 14:16:48

On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Few minor comments:
> 1.
> warning C4715: 'new_cluster_needs_fsm': not all control paths return a value
>
> Getting this new warning in the patch.

Hmm, I don't get that in a couple versions of gcc. Your compiler must
not know that pg_fatal() cannot return. I blindly added a fix.

> 2.
>
> This comment line doesn't seem to be related to this patch.  If so, I
> think we can avoid having any additional change which is not related
> to the functionality of this patch.  Feel free to submit it
> separately, if you think it is an improvement.

> +
> + /* Transfer any VM files if we can trust their contents. */
>   if (vm_crashsafe_match)

Well, I guess the current comment is still ok, so reverted. If I were
to do a separate cleanup patch, I would rather remove the
vm_must_add_frozenbit parameter -- there's no reason I can see for
calls that transfer the heap and FSM to know about this.

I also changed references to the 'first segment of the main fork'
where there will almost always only be one segment. This was a vestige
of the earlier algorithm I had.

> 3. Can we add a note about this in the Notes section of pg_upgrade
> documentation [1]?

Done.

> Have you done any performance testing of this patch?  I mean to say
> now that we added a new stat call for each table, we should see if
> that has any impact.  Ideally, that should be compensated by the fact
> that we are now not transferring *fsm files for small relations.  How
> about constructing a test where all relations are greater than 4 pages
> and then try to upgrade them.  We can check for a cluster with a
> different number of relations say 10K, 20K, 50K, 100K.

I have not, but I agree it should be done. I will try to do so soon.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

v23-0001-During-pg_upgrade-conditionally-skip-transfer-of.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

12 March 2019, 02:52:23

On Thu, Mar 7, 2019 at 7:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Feb 23, 2019 at 1:24 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
> >
> > On Thu, Feb 21, 2019 at 9:27 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > >
> > > I think this test is going to break on nonstandard block sizes.  While
> > > we don't promise that all tests work on such installs (particularly
> > > planner ones), it seems fairly easy to cope with this one -- just use a
> > > record size expressed as a fraction of current_setting('block_size').
> > > So instead of "1024" you'd write current_setting('block_size') / 8.
> > > And then display the relation size in terms of pages, not bytes, so
> > > divide pg_relation_size by block size.
> >
> > I've done this for v6, tested on 16k block size.
> >
>
> Thanks, the patch looks good to me.  I have additionally tested it 32K
> and 1K sized blocks and the test passes.  I will commit this early
> next week.
>

Pushed this patch.  Last time, we have seen a few portability issues
with this test.  Both John and me with the help of others tried to
ensure that there are no more such issues, but there is always a
chance that we missed something.  Anyway, I will keep an eye on
buildfarm to see if there is any problem related to this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

13 March 2019, 02:52:07

On Sun, Mar 10, 2019 at 7:47 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Have you done any performance testing of this patch?  I mean to say
> > now that we added a new stat call for each table, we should see if
> > that has any impact.  Ideally, that should be compensated by the fact
> > that we are now not transferring *fsm files for small relations.  How
> > about constructing a test where all relations are greater than 4 pages
> > and then try to upgrade them.  We can check for a cluster with a
> > different number of relations say 10K, 20K, 50K, 100K.
>
> I have not, but I agree it should be done. I will try to do so soon.
>

Thanks, I will wait for your test results.  I believe this is the last
patch in this project and we should try to get it done soon.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

13 March 2019, 11:26:56

On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

> Have you done any performance testing of this patch?  I mean to say
> now that we added a new stat call for each table, we should see if
> that has any impact.  Ideally, that should be compensated by the fact
> that we are now not transferring *fsm files for small relations.

To be precise, it will only call stat if pg_class.relpages is below
the threshold. I suppose I could hack a database where all the
relpages values are wrong, but that seems like a waste of time.

> How
> about constructing a test where all relations are greater than 4 pages
> and then try to upgrade them.  We can check for a cluster with a
> different number of relations say 10K, 20K, 50K, 100K.

I did both greater and less than 4 pages for 10k relations. Since
pg_upgrade is O(# relations), I don't see a point in going higher.

First, I had a problem: On MacOS with their "gcc" wrapper around
clang, I got a segfault 11 when compiled with no debugging symbols. I
added "CFLAGS=-O0" and it worked fine. Since this doesn't happen in a
debugging build, I'm not sure how to investigate this. IIRC, this
doesn't happen for me on Linux gcc.

Since it was at least running now, I measured by putting
gettimeofday() calls around transfer_all_new_tablespaces(). I did 10
runs each and took the average, except for patch/1-page case since it
was obviously faster after a couple runs.

5 pages:
master    patch
5.59s     5.64s

The variation within the builds is up to +/- 0.2s, so there is no
difference, as expected.

1 page:
master    patch
5.62s     4.25s

Clearly, linking is much slower than stat.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

13 March 2019, 12:18:02

On Wed, Mar 13, 2019 at 4:57 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > Have you done any performance testing of this patch?  I mean to say
> > now that we added a new stat call for each table, we should see if
> > that has any impact.  Ideally, that should be compensated by the fact
> > that we are now not transferring *fsm files for small relations.
>
> To be precise, it will only call stat if pg_class.relpages is below
> the threshold. I suppose I could hack a database where all the
> relpages values are wrong, but that seems like a waste of time.
>

Right.

> > How
> > about constructing a test where all relations are greater than 4 pages
> > and then try to upgrade them.  We can check for a cluster with a
> > different number of relations say 10K, 20K, 50K, 100K.
>
> I did both greater and less than 4 pages for 10k relations. Since
> pg_upgrade is O(# relations), I don't see a point in going higher.
>
> First, I had a problem: On MacOS with their "gcc" wrapper around
> clang, I got a segfault 11 when compiled with no debugging symbols.
>

Did you get this problem with the patch or both with and without the
patch?  If it is only with patch, then we definitely need to
investigate.

> I
> added "CFLAGS=-O0" and it worked fine. Since this doesn't happen in a
> debugging build, I'm not sure how to investigate this. IIRC, this
> doesn't happen for me on Linux gcc.
>
> Since it was at least running now, I measured by putting
> gettimeofday() calls around transfer_all_new_tablespaces(). I did 10
> runs each and took the average, except for patch/1-page case since it
> was obviously faster after a couple runs.
>
> 5 pages:
> master    patch
> 5.59s     5.64s
>
> The variation within the builds is up to +/- 0.2s, so there is no
> difference, as expected.
>
> 1 page:
> master    patch
> 5.62s     4.25s
>
> Clearly, linking is much slower than stat.
>

The results are fine.  Thanks for doing the tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

13 March 2019, 14:12:03

On Wed, Mar 13, 2019 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > First, I had a problem: On MacOS with their "gcc" wrapper around
> > clang, I got a segfault 11 when compiled with no debugging symbols.
> >
>
> Did you get this problem with the patch or both with and without the
> patch?  If it is only with patch, then we definitely need to
> investigate.

Only with the patch.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

13 March 2019, 15:27:45

On Wed, Mar 13, 2019 at 7:42 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Mar 13, 2019 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > First, I had a problem: On MacOS with their "gcc" wrapper around
> > > clang, I got a segfault 11 when compiled with no debugging symbols.
> > >
> >
> > Did you get this problem with the patch or both with and without the
> > patch?  If it is only with patch, then we definitely need to
> > investigate.
>
> Only with the patch.
>

If the problem is reproducible, then I think we can try out a few
things to narrow down or get some clue about the problem:
(a) modify function new_cluster_needs_fsm() such that it always
returns true as its first statement.  This will tell us if the code in
that function has some problem.
(b) run with Valgrind

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

14 March 2019, 01:38:33

> [segfault problems]

This now seems spurious. I ran make distclean, git pull, reapplied the
patch (leaving out the gettimeofday() calls), and now my upgrade perf
test works with default compiler settings. Not sure what happened, but
hopefully we can move forward.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

14 March 2019, 06:17:29

On Thu, Mar 14, 2019 at 7:08 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> > [segfault problems]
>
> This now seems spurious. I ran make distclean, git pull, reapplied the
> patch (leaving out the gettimeofday() calls), and now my upgrade perf
> test works with default compiler settings. Not sure what happened, but
> hopefully we can move forward.
>

Yeah, I took another pass through the patch and change minor things as
you can see in the attached patch.

1. Added an Assert in new_cluster_needs_fsm() that old cluster version
should be >= 804 as that is where fsm support has been added.
2. Reverted the old cluster version check to <= 1100.  There was
nothing wrong in the way you have written a check, but I find most of
the other places in the code uses that way.
3. At one place changed the function header to be consistent with the
nearby code, run pgindent on the code and changed the commit message.

Let me know what you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

v24-0001-During-pg_upgrade-conditionally-skip-transfer-of-FSM.patch

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

14 March 2019, 07:07:01

On Thu, Mar 14, 2019 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> 1. Added an Assert in new_cluster_needs_fsm() that old cluster version
> should be >= 804 as that is where fsm support has been added.

There is already an explicit check for 804 in the caller, and the HEAD
code is already resilient to FSMs not existing, so I think this is
superfluous.

> 2. Reverted the old cluster version check to <= 1100.  There was
> nothing wrong in the way you have written a check, but I find most of
> the other places in the code uses that way.
> 3. At one place changed the function header to be consistent with the
> nearby code, run pgindent on the code and changed the commit message.

Looks good to me, thanks.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

14 March 2019, 14:16:41

On Thu, Mar 14, 2019 at 12:37 PM John Naylor
<john.naylor@2ndquadrant.com> wrote:
>
> On Thu, Mar 14, 2019 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 1. Added an Assert in new_cluster_needs_fsm() that old cluster version
> > should be >= 804 as that is where fsm support has been added.
>
> There is already an explicit check for 804 in the caller,
>

Yeah, I know that, but I have added it to prevent this function being
used elsewhere.  OTOH, maybe you are right that as per current code it
is superfluous, so we shouldn't add this assert.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

15 March 2019, 09:55:34

On Thu, Mar 14, 2019 at 7:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 14, 2019 at 12:37 PM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> > On Thu, Mar 14, 2019 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 1. Added an Assert in new_cluster_needs_fsm() that old cluster version
> > > should be >= 804 as that is where fsm support has been added.
> >
> > There is already an explicit check for 804 in the caller,
> >
>
> Yeah, I know that, but I have added it to prevent this function being
> used elsewhere.  OTOH, maybe you are right that as per current code it
> is superfluous, so we shouldn't add this assert.
>

I have committed the latest version of this patch.  I think we can
wait for a day or two see if there is any complain from buildfarm or
in general and then we can close this CF entry.  IIRC, this was the
last patch in the series, right?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

John Naylor

Date:

15 March 2019, 10:10:16

On Fri, Mar 15, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I have committed the latest version of this patch.  I think we can
> wait for a day or two see if there is any complain from buildfarm or
> in general and then we can close this CF entry.  IIRC, this was the
> last patch in the series, right?

Great, thanks! I'll keep an eye on the buildfarm as well.

I just spotted two comments in freespace.c that were true during
earlier patch revisions, but are no longer true, so I've attached a
fix for those. There are no other patches in the series.

-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

correct-local-map-comments.patch

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

16 March 2019, 01:29:53

On Fri, Mar 15, 2019 at 3:40 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Fri, Mar 15, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I have committed the latest version of this patch.  I think we can
> > wait for a day or two see if there is any complain from buildfarm or
> > in general and then we can close this CF entry.  IIRC, this was the
> > last patch in the series, right?
>
> Great, thanks! I'll keep an eye on the buildfarm as well.
>
> I just spotted two comments in freespace.c that were true during
> earlier patch revisions, but are no longer true, so I've attached a
> fix for those.
>

LGTM, so committed.

> There are no other patches in the series.
>

Okay.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: Avoid creation of the free space map for small tables

From

Amit Kapila

Date:

18 March 2019, 02:28:51

On Fri, Mar 15, 2019 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> I have committed the latest version of this patch.  I think we can
> wait for a day or two see if there is any complain from buildfarm or
> in general and then we can close this CF entry.
>

Closed this CF entry.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com