Thread: WIP: Avoid creation of the free space map for small tables
Hi all, A while back, Robert Haas noticed that the space taken up by very small tables is dominated by the FSM [1]. Tom suggested that we could prevent creation of the FSM until the heap has reached a certain threshold size [2]. Attached is a WIP patch to implement that. I've also attached a SQL script to demonstrate the change in behavior for various scenarios. The behavior that allows the simplest implementation I thought of is as follows: -The FSM isn't created if the heap has fewer than 10 blocks (or whatever). If the last known good block has insufficient space, try every block before extending the heap. -If a heap with a FSM is truncated back to below the threshold, the FSM stays around and can be used as usual. -If the heap tuples are all deleted, the FSM stays but has no leaf blocks (same as on master). Although it exists, it won't be re-extended until the heap re-passes the threshold. -- Some notes: -For normal mode, I taught fsm_set_and_search() to switch to a non-extending buffer call, but the biggest missing piece is WAL replay. I couldn't find a non-extending equivalent of XLogReadBufferExtended(), so I might have to create one. -There'll need to be some performance testing to make sure there's no regression, and to choose a good value for the threshold. I'll look into that, but if anyone has any ideas for tests, that'll help this effort along. -A possible TODO item is to teach pg_upgrade not to link FSMs for small heaps. I haven't look into the feasibility of that, however. -RelationGetBufferForTuple() now has two boolean variables that mean "don't use the FSM", but with different behaviors. To avoid confusion, I've renamed use_fsm to always_extend and revised the commentary accordingly. -I've only implemented this for heaps, because indexes (at least B-tree) don't seem to be as eager to create a FSM. I haven't looked at the code, however. -- [1] https://www.postgresql.org/message-id/CA%2BTgmoac%2B6qTNp2U%2BwedY8-PU6kK_b6hbdhR5xYGBG3GtdFcww%40mail.gmail.com [2] https://www.postgresql.org/message-id/11360.1345502641%40sss.pgh.pa.us -- I'll add this to the November commitfest. -John Naylor
Attachment
On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote: > A while back, Robert Haas noticed that the space taken up by very > small tables is dominated by the FSM [1]. Tom suggested that we could > prevent creation of the FSM until the heap has reached a certain > threshold size [2]. Attached is a WIP patch to implement that. I've > also attached a SQL script to demonstrate the change in behavior for > various scenarios. Hi John, You'll need to tweak the test in contrib/pageinspect/sql/page.sql, because it's currently asserting that there is an FSM on a small table so make check-world fails. -- Thomas Munro http://www.enterprisedb.com
On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote: >> A while back, Robert Haas noticed that the space taken up by very >> small tables is dominated by the FSM [1]. Tom suggested that we could >> prevent creation of the FSM until the heap has reached a certain >> threshold size [2]. Attached is a WIP patch to implement that. I've >> also attached a SQL script to demonstrate the change in behavior for >> various scenarios. > > Hi John, > > You'll need to tweak the test in contrib/pageinspect/sql/page.sql, > because it's currently asserting that there is an FSM on a small table > so make check-world fails. Whoops, sorry about that; the attached patch passes make check-world. While looking into that, I also found a regression: If the cached target block is the last block in the relation and there is no free space, that block will be tried twice. That's been fixed as well. Thanks, -John Naylor
Attachment
John Naylor <jcnaylor@gmail.com> writes: > On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote: >>> A while back, Robert Haas noticed that the space taken up by very >>> small tables is dominated by the FSM [1]. Tom suggested that we could >>> prevent creation of the FSM until the heap has reached a certain >>> threshold size [2]. Attached is a WIP patch to implement that. BTW, don't we need a similar hack for visibility maps? regards, tom lane
On 10/7/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > John Naylor <jcnaylor@gmail.com> writes: >> On 10/6/18, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> On Sat, Oct 6, 2018 at 7:47 AM John Naylor <jcnaylor@gmail.com> wrote: >>>> A while back, Robert Haas noticed that the space taken up by very >>>> small tables is dominated by the FSM [1]. Tom suggested that we could >>>> prevent creation of the FSM until the heap has reached a certain >>>> threshold size [2]. Attached is a WIP patch to implement that. > > BTW, don't we need a similar hack for visibility maps? The FSM is the bigger bang for the buck, and fairly simple to do, but it would be nice to do something about VMs as well. I'm not sure if simply lacking a VM would be as simple (or as free of downsides) as for the FSM. I haven't studied the VM code in detail, however. -John Naylor
On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote: > > -There'll need to be some performance testing to make sure there's no > regression, and to choose a good value for the threshold. I'll look > into that, but if anyone has any ideas for tests, that'll help this > effort along. > Can you try with a Copy command which copies just enough tuples to fill the pages equivalent to HEAP_FSM_EXTENSION_THRESHOLD? It seems to me in such a case patch will try each of the blocks multiple times. It looks quite lame that we have to try again and again the blocks which we have just filled by ourselves but may be that doesn't matter much as the threshold value is small. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote: > > Hi all, > A while back, Robert Haas noticed that the space taken up by very > small tables is dominated by the FSM [1]. Tom suggested that we could > prevent creation of the FSM until the heap has reached a certain > threshold size [2]. Attached is a WIP patch to implement that. I've > also attached a SQL script to demonstrate the change in behavior for > various scenarios. > > The behavior that allows the simplest implementation I thought of is as follows: > > -The FSM isn't created if the heap has fewer than 10 blocks (or > whatever). If the last known good block has insufficient space, try > every block before extending the heap. > > -If a heap with a FSM is truncated back to below the threshold, the > FSM stays around and can be used as usual. > > -If the heap tuples are all deleted, the FSM stays but has no leaf > blocks (same as on master). Although it exists, it won't be > re-extended until the heap re-passes the threshold. > > -- > Some notes: > > -For normal mode, I taught fsm_set_and_search() to switch to a > non-extending buffer call, but the biggest missing piece is WAL > replay. > fsm_set_and_search() { .. + /* + * For heaps we prevent extension of the FSM unless the number of pages + * exceeds HEAP_FSM_EXTENSION_THRESHOLD. For tables that don't already + * have a FSM, this will save an inode and a few kB of space. + * For sane threshold values, the FSM address will be zero, so we + * don't bother dealing with anything else. + */ + if (rel->rd_rel->relkind == RELKIND_RELATION + && addr.logpageno == 0) I am not sure if this is a solid way to avoid creating FSM. What if fsm_set_and_search gets called for the level other than 0? Also, when the relation has blocks more than HEAP_FSM_EXTENSION_THRESHOLD, then first time when vacuum will try to record the free space in the page, won't it skip recording free space for first HEAP_FSM_EXTENSION_THRESHOLD pages? I think you have found a good way to avoid creating FSM, but can't we use some simpler technique like if the FSM fork for a relation doesn't exist, then check the heapblk number for which we try to update the FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid creating the FSM. > I couldn't find a non-extending equivalent of > XLogReadBufferExtended(), so I might have to create one. > I think it would be better if we can find a common way to avoid creating FSM both during DO and REDO time. It might be possible if somethin like what I have said above is feasible. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Oct 6, 2018 at 12:17 AM John Naylor <jcnaylor@gmail.com> wrote: >> -For normal mode, I taught fsm_set_and_search() to switch to a >> non-extending buffer call, but the biggest missing piece is WAL >> replay. >> > > fsm_set_and_search() > { > .. > + /* > + * For heaps we prevent extension of the FSM unless the number of pages > + * exceeds > HEAP_FSM_EXTENSION_THRESHOLD. For tables that don't already > + * have a FSM, this will save an inode and a few kB > of space. > + * For sane threshold values, the FSM address will be zero, so we > + * don't bother dealing with > anything else. > + */ > + if (rel->rd_rel->relkind == RELKIND_RELATION > + && addr.logpageno == 0) > > I am not sure if this is a solid way to avoid creating FSM. What if > fsm_set_and_search gets called for the level other than 0? Thanks for taking a look. As for levels other than 0, I think that only happens when fsm_set_and_search() is called by fsm_search(), which will not cause extension. > Also, > when the relation has blocks more than HEAP_FSM_EXTENSION_THRESHOLD, > then first time when vacuum will try to record the free space in the > page, won't it skip recording free space for first > HEAP_FSM_EXTENSION_THRESHOLD pages? Hmm, that's a good point. > I think you have found a good way to avoid creating FSM, but can't we > use some simpler technique like if the FSM fork for a relation doesn't > exist, then check the heapblk number for which we try to update the > FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid > creating the FSM. I think I see what you mean, but to avoid the vacuum problem you just mentioned, we'd need to check the relation size, too. I've attached an unpolished revision to do this. It seems to work, but I haven't tested the vacuum issue yet. I'll do that and some COPY performance testing in the next day or so. There's a bit more repetition than I would like, so I'm not sure it's simpler - perhaps RecordPageWithFreeSpace() could be turned into a wrapper around RecordAndGetPageWithFreeSpace(). Also new in this version, some non-functional improvements to hio.c: -debugging calls that are #ifdef'd out. -move some code out into a function instead of adding another goto. >> I couldn't find a non-extending equivalent of >> XLogReadBufferExtended(), so I might have to create one. >> > > I think it would be better if we can find a common way to avoid > creating FSM both during DO and REDO time. It might be possible if > somethin like what I have said above is feasible. That would be ideal. -John Naylor
Attachment
> On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote: >> I think you have found a good way to avoid creating FSM, but can't we >> use some simpler technique like if the FSM fork for a relation doesn't >> exist, then check the heapblk number for which we try to update the >> FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid >> creating the FSM. >> I think it would be better if we can find a common way to avoid >> creating FSM both during DO and REDO time. It might be possible if >> somethin like what I have said above is feasible. I've attached v4, which implements the REDO case, and as closely as possible to the DO case. I've created a new function to guard against creation of the FSM, which is called by RecordPageWithFreeSpace() and RecordAndGetPageWithFreeSpace(). Since XLogRecordPageWithFreeSpace() takes a relfilenode and not a relation, I had to reimplement that separately, but the logic is basically the same. It works under streaming replication. I've also attached a couple SQL scripts which, when the aforementioned DEBUG1 calls are enabled, show what the heap insert code is doing for different scenarios. Make check-world passes. -John Naylor
Attachment
On Sun, Oct 14, 2018 at 1:09 AM John Naylor <jcnaylor@gmail.com> wrote: > > On 10/13/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I think you have found a good way to avoid creating FSM, but can't we > > use some simpler technique like if the FSM fork for a relation doesn't > > exist, then check the heapblk number for which we try to update the > > FSM and if it is lesser than HEAP_FSM_EXTENSION_THRESHOLD, then avoid > > creating the FSM. > > I think I see what you mean, but to avoid the vacuum problem you just > mentioned, we'd need to check the relation size, too. > Sure, but vacuum already has relation size. In general, I think we should try to avoid adding more system calls in the common code path. It can impact the performance. Few comments on your latest patch: - +static bool +allow_write_to_fsm(Relation rel, BlockNumber heapBlk) +{ + BlockNumber heap_nblocks; + + if (heapBlk > HEAP_FSM_EXTENSION_THRESHOLD || + rel->rd_rel->relkind != RELKIND_RELATION) + return true; + + /* XXX is this value cached? */ + heap_nblocks = RelationGetNumberOfBlocks(rel); + + if (heap_nblocks > HEAP_FSM_EXTENSION_THRESHOLD) + return true; + else + { + RelationOpenSmgr(rel); + return smgrexists(rel->rd_smgr, FSM_FORKNUM); + } +} I think you can avoid calling RelationGetNumberOfBlocks, if you call smgrexists before and for the purpose of vacuum, we can get that as an input parameter. I think one can argue for not changing the interface functions like RecordPageWithFreeSpace to avoid calling RelationGetNumberOfBlocks, but to me, it appears worth to save the additional system call. - targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); - - /* - * If the FSM knows nothing of the rel, try the last page before we - * give up and extend. This avoids one-tuple-per-page syndrome during - * bootstrapping or in a recently-started system. - */ if (targetBlock == InvalidBlockNumber) - { - BlockNumber nblocks = RelationGetNumberOfBlocks(relation); - - if (nblocks > 0) - targetBlock = nblocks - 1; - } + targetBlock = get_page_no_fsm(relation, InvalidBlockNumber, + &try_every_page); Is it possible to hide the magic of trying each block within GetPageWithFreeSpace? It will simplify the code and in future, if another storage API has a different function for RelationGetBufferForTuple, it will work seamlessly, provided they are using same FSM. One such user is zheap. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > Few comments on your latest patch: > - > +static bool > +allow_write_to_fsm(Relation rel, BlockNumber heapBlk) > +{ > + BlockNumber heap_nblocks; > + > + if (heapBlk > HEAP_FSM_EXTENSION_THRESHOLD || > + rel->rd_rel->relkind != RELKIND_RELATION) > + return true; > + > + /* XXX is this value cached? */ > + heap_nblocks = RelationGetNumberOfBlocks(rel); > + > + if (heap_nblocks > HEAP_FSM_EXTENSION_THRESHOLD) > + return true; > + else > + { > + RelationOpenSmgr(rel); > + return smgrexists(rel->rd_smgr, FSM_FORKNUM); > + } > +} > > I think you can avoid calling RelationGetNumberOfBlocks, if you call > smgrexists before Okay, I didn't know which was cheaper, but I'll check smgrexists first. Thanks for the info. > and for the purpose of vacuum, we can get that as an > input parameter. I think one can argue for not changing the interface > functions like RecordPageWithFreeSpace to avoid calling > RelationGetNumberOfBlocks, but to me, it appears worth to save the > additional system call. I agree, and that should be fairly straightforward. I'll include that in the next patch. > - > targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); > - > - /* > - * If the FSM knows nothing of the rel, try the last page before we > - * give up and extend. This avoids one-tuple-per-page syndrome during > - * bootstrapping or in a recently-started system. > - */ > if (targetBlock == InvalidBlockNumber) > - { > - BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > - > - if (nblocks > 0) > - targetBlock = nblocks - 1; > - } > + targetBlock = get_page_no_fsm(relation, InvalidBlockNumber, > + &try_every_page); > > > Is it possible to hide the magic of trying each block within > GetPageWithFreeSpace? It will simplify the code and in future, if > another storage API has a different function for > RelationGetBufferForTuple, it will work seamlessly, provided they are > using same FSM. One such user is zheap. Hmm, here I'm a bit more skeptical about the trade offs. That would mean, in effect, to put a function called get_page_no_fsm() in the FSM code. ;-) I'm willing to be convinced otherwise, of course, but here's my reasoning: For one, we'd have to pass prevBlockNumber and &try_every_block to GetPageWithFreeSpace() (and RecordAndGetPageWithFreeSpace() by extension), which are irrelevant to some callers. In addition, in hio.c, there is a call where we don't want to try any blocks that we have already, much less all of them: /* * Check if some other backend has extended a block for us while * we were waiting on the lock. */ targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); By the time we get to this call, we likely wouldn't trigger the logic to try every block, but I don't think we can guarantee that. We could add a boolean parameter that means "consider trying every block", but I don't think the FSM code should have so much state passed to it. Thanks for reviewing, -John Naylor
On Mon, Oct 15, 2018 at 4:09 PM John Naylor <jcnaylor@gmail.com> wrote: > On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > Few comments on your latest patch: > > - > > targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); > > - > > - /* > > - * If the FSM knows nothing of the rel, try the last page before we > > - * give up and extend. This avoids one-tuple-per-page syndrome during > > - * bootstrapping or in a recently-started system. > > - */ > > if (targetBlock == InvalidBlockNumber) > > - { > > - BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > > - > > - if (nblocks > 0) > > - targetBlock = nblocks - 1; > > - } > > + targetBlock = get_page_no_fsm(relation, InvalidBlockNumber, > > + &try_every_page); > > > > > > Is it possible to hide the magic of trying each block within > > GetPageWithFreeSpace? It will simplify the code and in future, if > > another storage API has a different function for > > RelationGetBufferForTuple, it will work seamlessly, provided they are > > using same FSM. One such user is zheap. > > Hmm, here I'm a bit more skeptical about the trade offs. That would > mean, in effect, to put a function called get_page_no_fsm() in the FSM > code. ;-) I'm willing to be convinced otherwise, of course, but > here's my reasoning: > > For one, we'd have to pass prevBlockNumber and &try_every_block to > GetPageWithFreeSpace() (and RecordAndGetPageWithFreeSpace() by > extension), which are irrelevant to some callers. > I think we can avoid using prevBlockNumber and try_every_block, if we maintain a small cache which tells whether the particular block is tried or not. What I am envisioning is that while finding the block with free space, if we came to know that the relation in question is small enough that it doesn't have FSM, we can perform a local_search. In local_seach, we can enquire the cache for any block that we can try and if we find any block, we can try inserting in that block, otherwise, we need to extend the relation. One simple way to imagine such a cache would be an array of structure and structure has blkno and status fields. After we get the usable block, we need to clear the cache, if exists. > In addition, in > hio.c, there is a call where we don't want to try any blocks that we > have already, much less all of them: > > /* > * Check if some other backend has extended a block for us while > * we were waiting on the lock. > */ > targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); > > By the time we get to this call, we likely wouldn't trigger the logic > to try every block, but I don't think we can guarantee that. > I think the current code as proposed has that limitation, but if we have a cache, then we can check if the relation has actually extended after taking the lock and if so we can try only newly added block/'s. I am not completely sure if the idea described above is certainly better, but it seems that it will be clean and can handle some of the cases with ease. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 10/15/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think you can avoid calling RelationGetNumberOfBlocks, if you call > smgrexists before This is done in the attached v5, 0001. > and for the purpose of vacuum, we can get that as an > input parameter. I think one can argue for not changing the interface > functions like RecordPageWithFreeSpace to avoid calling > RelationGetNumberOfBlocks, but to me, it appears worth to save the > additional system call. This is done in 0002. I also added a check for the cached value of pg_class.relpages, since it's cheap and may help non-VACUUM callers. > [proposal for a cache of blocks to try] That's interesting. I'll have to do some reading elsewhere in the codebase, and then I'll follow up. Thanks, -John Naylor
Attachment
On Tue, Oct 16, 2018 at 4:27 PM John Naylor <jcnaylor@gmail.com> wrote: > > [proposal for a cache of blocks to try] > > That's interesting. I'll have to do some reading elsewhere in the > codebase, and then I'll follow up. > Thanks, I have changed the status of this patch as "Waiting on Author". Feel free to change it once you have a new patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 10/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think we can avoid using prevBlockNumber and try_every_block, if we > maintain a small cache which tells whether the particular block is > tried or not. What I am envisioning is that while finding the block > with free space, if we came to know that the relation in question is > small enough that it doesn't have FSM, we can perform a local_search. > In local_seach, we can enquire the cache for any block that we can try > and if we find any block, we can try inserting in that block, > otherwise, we need to extend the relation. One simple way to imagine > such a cache would be an array of structure and structure has blkno > and status fields. After we get the usable block, we need to clear > the cache, if exists. Here is the design I've implemented in the attached v6. There is more code than v5, but there's a cleaner separation between freespace.c and hio.c, as you preferred. I also think it's more robust. I've expended some effort to avoid doing unnecessary system calls to get the number of blocks. -- For the local, in-memory map, maintain a static array of status markers, of fixed-length HEAP_FSM_CREATION_THRESHOLD, indexed by block number. This is populated every time we call GetPageWithFreeSpace() on small tables with no FSM. The statuses are 'zero' (beyond the relation) 'available to try' 'tried already' Example for a 4-page heap: 01234567 AAAA0000 If we try block 3 and there is no space, we set it to 'tried' and next time through the loop we'll try 2, etc: 01234567 AAAT0000 If we try all available blocks, we will extend the relation. As in the master branch, first we call GetPageWithFreeSpace() again to see if another backend extended the relation to 5 blocks while we were waiting for the lock. If we find a new block, we will mark the new block available and leave the rest alone: 01234567 TTTTA000 On the odd chance we still can't insert into the new block, we'll skip checking any others and we'll redo the logic to extend the relation. If we're about to successfully return a buffer, whether from an existing block, or by extension, we clear the local map. Once this is in shape, I'll do some performance testing. -John Naylor
Attachment
I wrote: > Once this is in shape, I'll do some performance testing. On second thought, there's no point in waiting, especially if a regression points to a design flaw. I compiled patched postgres with HEAP_FSM_CREATION_THRESHOLD set to 32, then ran the attached script which populates 100 tables with varying numbers of blocks. I wanted a test that created pages eagerly and wrote to disk as little as possible. Config was stock, except for fsync = off. I took the average of 10 runs after removing the slowest and fastest run: # blocks master patch 4 36.4ms 33.9ms 8 50.6ms 48.9ms 12 58.6ms 66.3ms 16 65.5ms 81.4ms It seems under these circumstances a threshold of up to 8 performs comparably to the master branch, with small block numbers possibly faster than with the FSM, provided they're in shared buffers already. I didn't bother testing higher values because it's clear there's a regression starting around 10 or so, beyond which it helps to have the FSM. A case could be made for setting the threshold to 4, since not having 3 blocks of FSM in shared buffers exactly makes up for the 3 other blocks of heap that are checked when free space runs out. I can run additional tests if there's interest. -John Naylor
Attachment
Upthread I wrote: > -A possible TODO item is to teach pg_upgrade not to link FSMs for > small heaps. I haven't look into the feasibility of that, however. This turned out to be relatively light weight (0002 attached). I had to add relkind to the RelInfo struct and save the size of each heap as its transferred. The attached SQL script will setup a couple test cases to demonstrate pg_upgrade. Installations with large numbers of small tables will be able to see space savings right away. For 0001, I adjusted the README and docs, and also made some cosmetic improvements to the code, mostly in the comments. I've set the commitfest entry back to 'needs review' One thing I noticed is that one behavior on master hasn't changed: System catalogs created during bootstrap still have a FSM if they have any data. Possibly related, catalogs also have a VM even if they have no data at all. This isn't anything to get excited about, but it would be nice to investigate, at least so it can be documented. A cursory dig hasn't found the cause, but I'll keep doing that as time permits. -John Naylor
Attachment
On Wed, Oct 31, 2018 at 1:42 PM John Naylor <jcnaylor@gmail.com> wrote: > > Upthread I wrote: > > > -A possible TODO item is to teach pg_upgrade not to link FSMs for > > small heaps. I haven't look into the feasibility of that, however. > > This turned out to be relatively light weight (0002 attached). I had > to add relkind to the RelInfo struct and save the size of each heap as > its transferred. The attached SQL script will setup a couple test > cases to demonstrate pg_upgrade. Installations with large numbers of > small tables will be able to see space savings right away. > > For 0001, I adjusted the README and docs, and also made some cosmetic > improvements to the code, mostly in the comments. I've set the > commitfest entry back to 'needs review' > > One thing I noticed is that one behavior on master hasn't changed: > System catalogs created during bootstrap still have a FSM if they have > any data. Possibly related, catalogs also have a VM even if they have > no data at all. This isn't anything to get excited about, but it would > be nice to investigate, at least so it can be documented. A cursory > dig hasn't found the cause, but I'll keep doing that as time permits. > Thanks for your work on this, I will try to review it during CF. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Oct 23, 2018 at 9:42 AM John Naylor <jcnaylor@gmail.com> wrote: > A case could be made for setting the threshold to 4, since not having > 3 blocks of FSM in shared buffers exactly makes up for the 3 other > blocks of heap that are checked when free space runs out. That doesn't seem like an unreasonable argument. I'm not sure whether the right threshold is 4 or something a little bigger, but I bet it's not very large. It seems important to me that before anybody thinks about committing this, we construct some kind of destruction case where repeated scans of the whole table are triggered as frequently as possible, and then run that test with varying thresholds. I might be totally wrong, but I bet with a value as large as 32 you will be able to find cases where it regresses in a big way. We also need to think about what happens on the standby, where the FSM is updated in a fairly different way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 31, 2018 at 10:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Oct 23, 2018 at 9:42 AM John Naylor <jcnaylor@gmail.com> wrote: > > A case could be made for setting the threshold to 4, since not having > > 3 blocks of FSM in shared buffers exactly makes up for the 3 other > > blocks of heap that are checked when free space runs out. > > That doesn't seem like an unreasonable argument. I'm not sure whether > the right threshold is 4 or something a little bigger, but I bet it's > not very large. It seems important to me that before anybody thinks > about committing this, we construct some kind of destruction case > where repeated scans of the whole table are triggered as frequently as > possible, and then run that test with varying thresholds. > Why do you think repeated scans will be a destruction case when there is no FSM for a small table? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 2, 2018 at 7:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > That doesn't seem like an unreasonable argument. I'm not sure whether > > the right threshold is 4 or something a little bigger, but I bet it's > > not very large. It seems important to me that before anybody thinks > > about committing this, we construct some kind of destruction case > > where repeated scans of the whole table are triggered as frequently as > > possible, and then run that test with varying thresholds. > > Why do you think repeated scans will be a destruction case when there > is no FSM for a small table? That's not what I'm saying. If we don't have the FSM, we have to check every page of the table. If there's a workload where that happens a lot on a table that is just under the size threshold for creating the FSM, then it's likely to be a worst case for this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > That's not what I'm saying. If we don't have the FSM, we have to > check every page of the table. If there's a workload where that > happens a lot on a table that is just under the size threshold for > creating the FSM, then it's likely to be a worst case for this patch. Hmm, you're assuming something not in evidence: why would that be the algorithm? On a FSM-less table, I'd be inclined to just check the last page and then grow the table if the tuple doesn't fit there. This would, in many cases, soon result in a FSM being created, but I think that's just fine. The point of the change is to optimize for cases where a table *never* gets more than a few inserts. Not, IMO, for cases where a table gets a lot of churn but never has a whole lot of live tuples. In the latter scenario we are far better off having a FSM. regards, tom lane
On Fri, Nov 2, 2018 at 10:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > That's not what I'm saying. If we don't have the FSM, we have to > > check every page of the table. If there's a workload where that > > happens a lot on a table that is just under the size threshold for > > creating the FSM, then it's likely to be a worst case for this patch. > > Hmm, you're assuming something not in evidence: why would that be the > algorithm? I think it's in evidence, in the form of several messages mentioning a flag called try_every_block. Just checking the last page of the table doesn't sound like a good idea to me. I think that will just lead to a lot of stupid bloat. It seems likely that checking every page of the table is fine for npages <= 3, and that would still be win in a very significant number of cases, since lots of instances have many empty or tiny tables. I was merely reacting to the suggestion that the approach should be used for npages <= 32; that threshold sounds way too high. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/2/18, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Nov 2, 2018 at 10:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >> > That's not what I'm saying. If we don't have the FSM, we have to >> > check every page of the table. If there's a workload where that >> > happens a lot on a table that is just under the size threshold for >> > creating the FSM, then it's likely to be a worst case for this patch. >> >> Hmm, you're assuming something not in evidence: why would that be the >> algorithm? > > I think it's in evidence, in the form of several messages mentioning a > flag called try_every_block. Correct. > Just checking the last page of the table doesn't sound like a good > idea to me. I think that will just lead to a lot of stupid bloat. It > seems likely that checking every page of the table is fine for npages > <= 3, and that would still be win in a very significant number of > cases, since lots of instances have many empty or tiny tables. I was > merely reacting to the suggestion that the approach should be used for > npages <= 32; that threshold sounds way too high. To be clear, no one suggested that. The patch has always had 8 or 10 as a starting point, and I've mentioned 4 and 8 as good possibilities based on the COPY tests upthread. It was apparent I didn't need to recompile a bunch of binaries with different thresholds. All I had to do was compile with a threshold much larger than required, and then test inserting into X number of pages, to simulate a threshold of X. I increased X until I saw a regression. That's where the 32 came from, sorry if that was misleading, in my head it was obvious. I'd be happy test other scenarios. I'm not sure how to test redo -- seems more difficult to get meaningful results than the normal case. -John Naylor
On Fri, Nov 2, 2018 at 7:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Nov 2, 2018 at 7:23 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > That doesn't seem like an unreasonable argument. I'm not sure whether > > > the right threshold is 4 or something a little bigger, but I bet it's > > > not very large. It seems important to me that before anybody thinks > > > about committing this, we construct some kind of destruction case > > > where repeated scans of the whole table are triggered as frequently as > > > possible, and then run that test with varying thresholds. > > > > Why do you think repeated scans will be a destruction case when there > > is no FSM for a small table? > > That's not what I'm saying. If we don't have the FSM, we have to > check every page of the table. If there's a workload where that > happens a lot on a table that is just under the size threshold for > creating the FSM, then it's likely to be a worst case for this patch. > That makes sense and this is the first thing I was also worried about after looking at the initial patch and suggested a test [1] which can hit the worst case. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BhP-jGYWi25-1QMedxeM_0H01s%3D%3D4-t74oEgL2EDVicw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 2, 2018 at 7:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > > That's not what I'm saying. If we don't have the FSM, we have to > > check every page of the table. If there's a workload where that > > happens a lot on a table that is just under the size threshold for > > creating the FSM, then it's likely to be a worst case for this patch. > > Hmm, you're assuming something not in evidence: why would that be the > algorithm? On a FSM-less table, I'd be inclined to just check the > last page and then grow the table if the tuple doesn't fit there. > This would, in many cases, soon result in a FSM being created, but > I think that's just fine. The point of the change is to optimize > for cases where a table *never* gets more than a few inserts. Not, IMO, > for cases where a table gets a lot of churn but never has a whole lot of > live tuples. In the latter scenario we are far better off having a FSM. > In the past, you seem to have suggested an approach to try each block [1] for small tables which don't have FSM. I think if we do what you are suggesting now, then we don't need to worry much about any regression and code will be somewhat simpler, but OTOH, I don't see much harm in trying every block if we keep the threshold as no more than 4. That would address more cases. [1] - https://www.postgresql.org/message-id/11360.1345502641%40sss.pgh.pa.us -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Nov 02, 2018 at 10:38:45AM -0400, Robert Haas wrote: > I think it's in evidence, in the form of several messages mentioning a > flag called try_every_block. > > Just checking the last page of the table doesn't sound like a good > idea to me. I think that will just lead to a lot of stupid bloat. It > seems likely that checking every page of the table is fine for npages > <= 3, and that would still be win in a very significant number of > cases, since lots of instances have many empty or tiny tables. I was > merely reacting to the suggestion that the approach should be used for > npages <= 32; that threshold sounds way too high. It seems to me that it would be costly for schemas which have one core table with a couple of records used in many joins with other queries. Imagine for example a core table like that: CREATE TABLE us_states (id serial, initials varchar(2)); INSERT INTO us_states VALUES (DEFAULT, 'CA'); If there is a workload where those initials need to be fetched a lot, this patch could cause a loss. It looks hard to me to put a straight number on when not having the FSM is better than having it because that could be environment-dependent, so there is an argument for making the default very low, still configurable? -- Michael
Attachment
On Sun, Nov 4, 2018 at 5:56 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Nov 02, 2018 at 10:38:45AM -0400, Robert Haas wrote: > > I think it's in evidence, in the form of several messages mentioning a > > flag called try_every_block. > > > > Just checking the last page of the table doesn't sound like a good > > idea to me. I think that will just lead to a lot of stupid bloat. It > > seems likely that checking every page of the table is fine for npages > > <= 3, and that would still be win in a very significant number of > > cases, since lots of instances have many empty or tiny tables. I was > > merely reacting to the suggestion that the approach should be used for > > npages <= 32; that threshold sounds way too high. > > It seems to me that it would be costly for schemas which have one core > table with a couple of records used in many joins with other queries. > Imagine for example a core table like that: > CREATE TABLE us_states (id serial, initials varchar(2)); > INSERT INTO us_states VALUES (DEFAULT, 'CA'); > > If there is a workload where those initials need to be fetched a lot, > this patch could cause a loss. > How alone fetching would cause any loss? If it gets updated, then there is a chance that we might have some performance impact. > It looks hard to me to put a straight > number on when not having the FSM is better than having it because that > could be environment-dependent, so there is an argument for making the > default very low, still configurable? > I think 3 or 4 as threshold should work fine (though we need to thoroughly test that) as we will anyway avoid having three additional pages of FSM for such tables. I am not sure how easy it would be for users to set this value if we make it configurable or on what basis can they configure? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 10/31/18, Robert Haas <robertmhaas@gmail.com> wrote: > It seems important to me that before anybody thinks > about committing this, we construct some kind of destruction case > where repeated scans of the whole table are triggered as frequently as > possible, and then run that test with varying thresholds. I might be > totally wrong, but I bet with a value as large as 32 you will be able > to find cases where it regresses in a big way. Here's an attempt at a destruction case: Lobotomize the heap insert logic such that it never checks the cached target block and has to call the free space logic for every single insertion, like this: index ff13c03083..5d5b36af29 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -377,7 +377,7 @@ RelationGetBufferForTuple(Relation relation, Size len, else if (bistate && bistate->current_buf != InvalidBuffer) targetBlock = BufferGetBlockNumber(bistate->current_buf); else - targetBlock = RelationGetTargetBlock(relation); + targetBlock = InvalidBlockNumber; if (targetBlock == InvalidBlockNumber && use_fsm) { (with the threshold patch I had to do additional work) With the small tuples used in the attached v2 test, this means the free space logic is called ~225 times per block. The test tables are pre-filled with one tuple and vacuumed so that the FSMs are already created when testing the master branch. The patch branch is compiled with a threshold of 8, but testing inserts of 4 pages will effectively simulate a threshold of 4, etc. As before, trimmed average of 10 runs, loading to 100 tables each: # blocks master patch 2 25.1ms 30.3ms 4 40.7ms 48.1ms 6 56.6ms 64.7ms 8 73.1ms 82.0ms Without this artificial penalty, the 8 block case was about 50ms for both branches. So if I calculated right, of that 50 ms, master is spending ~0.10ms looking for free space, and the patch is spending about ~0.15ms. So, from that perspective, the difference is trivial. Of course, this is a single client, so not entirely realistic. I think that shared buffer considerations are most important for deciding the threshold. > We also need to think about what happens on the standby, where the FSM > is updated in a fairly different way. Were you referring to performance or just functionality? Because the threshold works on the standby, but I don't know about the performance there. -John Naylor
Attachment
On 11/2/18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > On a FSM-less table, I'd be inclined to just check the > last page and then grow the table if the tuple doesn't fit there. > This would, in many cases, soon result in a FSM being created, but > I think that's just fine. The point of the change is to optimize > for cases where a table *never* gets more than a few inserts. Not, IMO, > for cases where a table gets a lot of churn but never has a whole lot of > live tuples. In the latter scenario we are far better off having a FSM. and... On 11/2/18, Robert Haas <robertmhaas@gmail.com> wrote: > Just checking the last page of the table doesn't sound like a good > idea to me. I think that will just lead to a lot of stupid bloat. It > seems likely that checking every page of the table is fine for npages > <= 3, and that would still be win in a very significant number of > cases, I see the merit of both of these arguments, and it occurred to me that there is middle ground between checking only the last page and checking every page: Check the last 3 pages and set the threshold to 6. That way, with npages <= 3, every page will be checked. In the unlikely case that npages = 6 and the first 3 pages are all wasted space, that's the amount of space that would have gone to the FSM anyway, and the relation will likely grow beyond the threshold soon, at which point the free space will become visible again. -John Naylor
On Mon, Oct 22, 2018 at 12:14 PM John Naylor <jcnaylor@gmail.com> wrote: > > On 10/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think we can avoid using prevBlockNumber and try_every_block, if we > > maintain a small cache which tells whether the particular block is > > tried or not. What I am envisioning is that while finding the block > > with free space, if we came to know that the relation in question is > > small enough that it doesn't have FSM, we can perform a local_search. > > In local_seach, we can enquire the cache for any block that we can try > > and if we find any block, we can try inserting in that block, > > otherwise, we need to extend the relation. One simple way to imagine > > such a cache would be an array of structure and structure has blkno > > and status fields. After we get the usable block, we need to clear > > the cache, if exists. > > Here is the design I've implemented in the attached v6. There is more > code than v5, but there's a cleaner separation between freespace.c and > hio.c, as you preferred. > This approach seems better. > I also think it's more robust. I've expended > some effort to avoid doing unnecessary system calls to get the number > of blocks. > -- > > For the local, in-memory map, maintain a static array of status > markers, of fixed-length HEAP_FSM_CREATION_THRESHOLD, indexed by block > number. This is populated every time we call GetPageWithFreeSpace() on > small tables with no FSM. The statuses are > > 'zero' (beyond the relation) > 'available to try' > 'tried already' > +/* Status codes for the local map. */ +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */ +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */ +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */ Instead of maintaining three states, can't we do with two states (Available and Not Available), basically combine 0 and 2 in your case. I think it will save some cycles in fsm_local_set, where each time you need to initialize all the entries in the map. I think we can argue that it is not much overhead, but I think it is better code-wise also if we can make it happen with fewer states. Some assorted comments: 1. <para> -Each heap and index relation, except for hash indexes, has a Free Space Map +Each heap relation, unless it is very small, and each index relation, +except for hash indexes, has a Free Space Map (FSM) to keep track of available space in the relation. It's stored It appears that line has ended abruptly. 2. page = BufferGetPage(buffer); + targetBlock = BufferGetBlockNumber(buffer); if (!PageIsNew(page)) elog(ERROR, "page %u of relation \"%s\" should be empty but is not", - BufferGetBlockNumber(buffer), + targetBlock, RelationGetRelationName(relation)); PageInit(page, BufferGetPageSize(buffer), 0); @@ -623,7 +641,18 @@ loop: * current backend to make more insertions or not, which is probably a * good bet most of the time. So for now, don't add it to FSM yet. */ - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer)); + RelationSetTargetBlock(relation, targetBlock); Is this related to this patch? If not, I suggest let's do it separately if required. 3. static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, - uint8 newValue, uint8 minValue); + uint8 newValue, uint8 minValue); This appears to be a spurious change. 4. @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size len, * target. */ targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); + + /* + * In case we used an in-memory map of available blocks, reset + * it for next use. + */ + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD) + ClearLocalMap(); How will you clear the local map during error? I think you need to clear it in abort path and you can name the function as FSMClearLocalMap or something like that. 5. +/*#define TRACE_TARGETBLOCK */ Debugging leftover, do you want to retain this and related stuff during the development of patch? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > +/* Status codes for the local map. */ > +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */ > +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */ > +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */ > > Instead of maintaining three states, can't we do with two states > (Available and Not Available), basically combine 0 and 2 in your case. > I think it will save some cycles in > fsm_local_set, where each time you need to initialize all the entries > in the map. I think we can argue that it is not much overhead, but I > think it is better code-wise also if we can make it happen with fewer > states. That'd work too, but let's consider this scenario: We have a 2-block table that has no free space. After trying each block, the local cache looks like 0123 TT00 Let's say we have to wait to acquire a relation extension lock, because another backend had already started extending the heap by 1 block. We call GetPageWithFreeSpace() and now the local map looks like 0123 TTA0 By using bitwise OR to set availability, the already-tried blocks remain as they are. With only 2 states, the map would look like this instead: 0123 AAAN If we assume that an insert into the newly-created block 2 will almost always succeed, we don't have to worry about wasting time re-checking the first 2 full blocks. Does that sound right to you? > Some assorted comments: > 1. > <para> > -Each heap and index relation, except for hash indexes, has a Free Space > Map > +Each heap relation, unless it is very small, and each index relation, > +except for hash indexes, has a Free Space Map > (FSM) to keep track of available space in the relation. It's stored > > It appears that line has ended abruptly. Not sure what you're referring to here. > 2. > page = BufferGetPage(buffer); > + targetBlock = BufferGetBlockNumber(buffer); > > if (!PageIsNew(page)) > elog(ERROR, "page %u of relation \"%s\" should be empty but is not", > - BufferGetBlockNumber(buffer), > + targetBlock, > RelationGetRelationName(relation)); > > PageInit(page, BufferGetPageSize(buffer), 0); > @@ -623,7 +641,18 @@ loop: > * current backend to make more insertions or not, which is probably a > * good bet most of the time. So for now, don't add it to FSM yet. > */ > - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer)); > + RelationSetTargetBlock(relation, targetBlock); > > Is this related to this patch? If not, I suggest let's do it > separately if required. I will separate this out. > 3. > static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, > - uint8 newValue, uint8 minValue); > + uint8 newValue, uint8 minValue); > > This appears to be a spurious change. It was intentional, but I will include it separately as above. > 4. > @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size > len, > * target. > */ > targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); > + > + /* > + * In case we used an in-memory map of available blocks, reset > + * it for next use. > + */ > + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD) > + ClearLocalMap(); > > How will you clear the local map during error? I think you need to > clear it in abort path and you can name the function as > FSMClearLocalMap or something like that. That sounds right, and I will rename the function that way. For the abort path, were you referring to this or somewhere else? if (!PageIsNew(page)) elog(ERROR, "page %u of relation \"%s\" should be empty but is not", targetBlock, RelationGetRelationName(relation)); > 5. > +/*#define TRACE_TARGETBLOCK */ > > Debugging leftover, do you want to retain this and related stuff > during the development of patch? I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's useful for development, but I don't particularly care whether it's in the final verision. Also, I found an off-by-one error that caused an unnecessary smgrexists() call in tables with threshold + 1 pages. This will be fixed in the next version. -John Naylor
On Mon, Nov 19, 2018 at 7:30 AM John Naylor <jcnaylor@gmail.com> wrote: > > On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > +/* Status codes for the local map. */ > > +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */ > > +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */ > > +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */ > > > > Instead of maintaining three states, can't we do with two states > > (Available and Not Available), basically combine 0 and 2 in your case. > > I think it will save some cycles in > > fsm_local_set, where each time you need to initialize all the entries > > in the map. I think we can argue that it is not much overhead, but I > > think it is better code-wise also if we can make it happen with fewer > > states. > > That'd work too, but let's consider this scenario: We have a 2-block > table that has no free space. After trying each block, the local cache > looks like > > 0123 > TT00 > > Let's say we have to wait to acquire a relation extension lock, > because another backend had already started extending the heap by 1 > block. We call GetPageWithFreeSpace() and now the local map looks like > > 0123 > TTA0 > > By using bitwise OR to set availability, the already-tried blocks > remain as they are. With only 2 states, the map would look like this > instead: > > 0123 > AAAN > I expect below part of code to go-away. +fsm_local_set(Relation rel, BlockNumber nblocks) { .. + /* + * If the blkno is beyond the end of the relation, the status should + * be zero already, but make sure it is. If the blkno is within the + * relation, mark it available unless it's already been tried. + */ + for (blkno = 0; blkno < HEAP_FSM_CREATION_THRESHOLD; blkno++) + { + if (blkno < nblocks) + FSMLocalMap[blkno] |= FSM_LOCAL_AVAIL; + else + FSMLocalMap[blkno] = FSM_LOCAL_ZERO; + } .. } In my mind for such a case it should look like below: 0123 NNAN > If we assume that an insert into the newly-created block 2 will almost > always succeed, we don't have to worry about wasting time re-checking > the first 2 full blocks. Does that sound right to you? > As explained above, such a situation won't exist. > > > Some assorted comments: > > 1. > > <para> > > -Each heap and index relation, except for hash indexes, has a Free Space > > Map > > +Each heap relation, unless it is very small, and each index relation, > > +except for hash indexes, has a Free Space Map > > (FSM) to keep track of available space in the relation. It's stored > > > > It appears that line has ended abruptly. > > Not sure what you're referring to here. > There is a space after "has a Free Space Map " so you can combine next line. > > 2. > > page = BufferGetPage(buffer); > > + targetBlock = BufferGetBlockNumber(buffer); > > > > if (!PageIsNew(page)) > > elog(ERROR, "page %u of relation \"%s\" should be empty but is not", > > - BufferGetBlockNumber(buffer), > > + targetBlock, > > RelationGetRelationName(relation)); > > > > PageInit(page, BufferGetPageSize(buffer), 0); > > @@ -623,7 +641,18 @@ loop: > > * current backend to make more insertions or not, which is probably a > > * good bet most of the time. So for now, don't add it to FSM yet. > > */ > > - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer)); > > + RelationSetTargetBlock(relation, targetBlock); > > > > Is this related to this patch? If not, I suggest let's do it > > separately if required. > > I will separate this out. > > > 3. > > static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, > > - uint8 newValue, uint8 minValue); > > + uint8 newValue, uint8 minValue); > > > > This appears to be a spurious change. > > It was intentional, but I will include it separately as above. > > > 4. > > @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size > > len, > > * target. > > */ > > targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); > > + > > + /* > > + * In case we used an in-memory map of available blocks, reset > > + * it for next use. > > + */ > > + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD) > > + ClearLocalMap(); > > > > How will you clear the local map during error? I think you need to > > clear it in abort path and you can name the function as > > FSMClearLocalMap or something like that. > > That sounds right, and I will rename the function that way. For the > abort path, were you referring to this or somewhere else? > > if (!PageIsNew(page)) > elog(ERROR, "page %u of relation \"%s\" should be empty but is not", > targetBlock, > RelationGetRelationName(relation)); > I think it might come from any other place between when you set it and before it got cleared (like any intermediate buffer and pin related API's). > > 5. > > +/*#define TRACE_TARGETBLOCK */ > > > > Debugging leftover, do you want to retain this and related stuff > > during the development of patch? > > I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's > useful for development, but I don't particularly care whether it's in > the final verision. > Okay, so if you want to retain it for the period of development, then I am fine with it. We can see at the end if it makes sense to retain it. > Also, I found an off-by-one error that caused an unnecessary > smgrexists() call in tables with threshold + 1 pages. This will be > fixed in the next version. > Thanks. One other thing that slightly bothers me is the call to RelationGetNumberOfBlocks via fsm_allow_writes. It seems that call will happen quite frequently in this code-path and can have some performance impact. As of now, I don't have any idea to avoid it or reduce it more than what you already have in the patch, but I think we should try some more to avoid it. Let me know if you have any ideas around that? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Nov 19, 2018 at 7:30 AM John Naylor <jcnaylor@gmail.com> wrote: >> Let's say we have to wait to acquire a relation extension lock, >> because another backend had already started extending the heap by 1 >> block. We call GetPageWithFreeSpace() and now the local map looks like >> >> 0123 >> TTA0 >> >> By using bitwise OR to set availability, the already-tried blocks >> remain as they are. With only 2 states, the map would look like this >> instead: >> >> 0123 >> AAAN >> > In my mind for such a case it should look like below: > 0123 > NNAN Okay, to retain that behavior with only 2 status codes, I have implemented the map as a struct with 2 members: the cached number of blocks, plus the same array I had before. This also allows a more efficient implementation at the micro level. I just need to do some more testing on it. [ abortive states ] > I think it might come from any other place between when you set it and > before it got cleared (like any intermediate buffer and pin related > API's). Okay, I will look into that. > One other thing that slightly bothers me is the call to > RelationGetNumberOfBlocks via fsm_allow_writes. It seems that call > will happen quite frequently in this code-path and can have some > performance impact. As of now, I don't have any idea to avoid it or > reduce it more than what you already have in the patch, but I think we > should try some more to avoid it. Let me know if you have any ideas > around that? FWIW, I believe that the callers of RecordPageWithFreeSpace() will almost always avoid that call. Otherwise, there is at least one detail that could use attention: If rel->rd_rel->relpages shows fewer pages than the threshold, than the code doesn't trust it to be true. Might be worth revisiting. Aside from that, I will have to think about it. More generally, I have a couple ideas about performance: 1. Only mark available every other block such that visible blocks are interleaved as the relation extends. To explain, this diagram shows a relation extending, with 1 meaning marked available and 0 meaning marked not-available. A NA ANA NANA So for a 3-block table, we never check block 1. Any free space it has acquired will become visible when it extends to 4 blocks. For a 4-block threshold, we only check 2 blocks or less. This reduces the number of lock/pin events but still controls bloat. We could also check both blocks of a 2-block table. 2. During manual testing I seem to remember times that the FSM code was invoked even though I expected the smgr entry to have a cached target block. Perhaps VACUUM or something is clearing that away unnecessarily. It seems worthwhile to verify and investigate, but that seems like a separate project. -John Naylor
On Mon, Nov 19, 2018 at 4:40 PM John Naylor <jcnaylor@gmail.com> wrote: > > On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 19, 2018 at 7:30 AM John Naylor <jcnaylor@gmail.com> wrote: > >> Let's say we have to wait to acquire a relation extension lock, > >> because another backend had already started extending the heap by 1 > >> block. We call GetPageWithFreeSpace() and now the local map looks like > >> > >> 0123 > >> TTA0 > >> > >> By using bitwise OR to set availability, the already-tried blocks > >> remain as they are. With only 2 states, the map would look like this > >> instead: > >> > >> 0123 > >> AAAN > >> > > > In my mind for such a case it should look like below: > > 0123 > > NNAN > > Okay, to retain that behavior with only 2 status codes, I have > implemented the map as a struct with 2 members: the cached number of > blocks, plus the same array I had before. This also allows a more > efficient implementation at the micro level. I just need to do some > more testing on it. > Okay. > [ abortive states ] > > I think it might come from any other place between when you set it and > > before it got cleared (like any intermediate buffer and pin related > > API's). > > Okay, I will look into that. > > > One other thing that slightly bothers me is the call to > > RelationGetNumberOfBlocks via fsm_allow_writes. It seems that call > > will happen quite frequently in this code-path and can have some > > performance impact. As of now, I don't have any idea to avoid it or > > reduce it more than what you already have in the patch, but I think we > > should try some more to avoid it. Let me know if you have any ideas > > around that? > > FWIW, I believe that the callers of RecordPageWithFreeSpace() will > almost always avoid that call. Otherwise, there is at least one detail > that could use attention: If rel->rd_rel->relpages shows fewer pages > than the threshold, than the code doesn't trust it to be true. Might > be worth revisiting. > I think it is less of a concern when called from vacuum code path. > Aside from that, I will have to think about it. > > More generally, I have a couple ideas about performance: > > 1. Only mark available every other block such that visible blocks are > interleaved as the relation extends. To explain, this diagram shows a > relation extending, with 1 meaning marked available and 0 meaning > marked not-available. > > A > NA > ANA > NANA > > So for a 3-block table, we never check block 1. Any free space it has > acquired will become visible when it extends to 4 blocks. For a > 4-block threshold, we only check 2 blocks or less. This reduces the > number of lock/pin events but still controls bloat. We could also > check both blocks of a 2-block table. > We can try something like this if we see there is any visible performance hit in some scenario. > 2. During manual testing I seem to remember times that the FSM code > was invoked even though I expected the smgr entry to have a cached > target block. Perhaps VACUUM or something is clearing that away > unnecessarily. It seems worthwhile to verify and investigate, but that > seems like a separate project. > makes sense, let's not get distracted by stuff that is not related to this patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
I wrote: > On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > [ abortive states ] >> I think it might come from any other place between when you set it and >> before it got cleared (like any intermediate buffer and pin related >> API's). > > Okay, I will look into that. LockBuffer(), visibilitymap_pin(), and GetVisibilityMapPins() don't call errors at this level. I don't immediately see any additional good places from which to clear the local map. -John Naylor
On Tue, Nov 20, 2018 at 1:42 PM John Naylor <jcnaylor@gmail.com> wrote: > > I wrote: > > > On 11/19/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > [ abortive states ] > >> I think it might come from any other place between when you set it and > >> before it got cleared (like any intermediate buffer and pin related > >> API's). > > > > Okay, I will look into that. > > LockBuffer(), visibilitymap_pin(), and GetVisibilityMapPins() don't > call errors at this level. I don't immediately see any additional good > places from which to clear the local map. > LockBuffer()->LWLockAcquire() can error out. Similarly, ReadBuffer()->ReadBufferExtended() and calls below it can error ou. To handle them, you need to add a call to clear local map in Abortransaction code path. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote: I've attached v8, which includes the 2-state map and addresses the points below: > Some assorted comments: > 1. > <para> > -Each heap and index relation, except for hash indexes, has a Free Space > Map > +Each heap relation, unless it is very small, and each index relation, > +except for hash indexes, has a Free Space Map > (FSM) to keep track of available space in the relation. It's stored > > It appears that line has ended abruptly. Revised. > 2. > page = BufferGetPage(buffer); > + targetBlock = BufferGetBlockNumber(buffer); > > if (!PageIsNew(page)) > elog(ERROR, "page %u of relation \"%s\" should be empty but is not", > - BufferGetBlockNumber(buffer), > + targetBlock, > RelationGetRelationName(relation)); > > PageInit(page, BufferGetPageSize(buffer), 0); > @@ -623,7 +641,18 @@ loop: > * current backend to make more insertions or not, which is probably a > * good bet most of the time. So for now, don't add it to FSM yet. > */ > - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer)); > + RelationSetTargetBlock(relation, targetBlock); > > Is this related to this patch? If not, I suggest let's do it > separately if required. > > 3. > static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, > - uint8 newValue, uint8 minValue); > + uint8 newValue, uint8 minValue); > > This appears to be a spurious change. 2 and 3 are separated into 0001. > 4. > @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size > len, > * target. > */ > targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); > + > + /* > + * In case we used an in-memory map of available blocks, reset > + * it for next use. > + */ > + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD) > + ClearLocalMap(); > > How will you clear the local map during error? I think you need to > clear it in abort path and you can name the function as > FSMClearLocalMap or something like that. Done. I've put this call last before abort processing. > 5. > +/*#define TRACE_TARGETBLOCK */ > > Debugging leftover, do you want to retain this and related stuff > during the development of patch? I have kept it aside as a separate patch but not attached it for now. Also, we don't quite have a consensus on the threshold value, but I have set it to 4 pages for v8. If this is still considered too expensive (and basic tests show it shouldn't be), I suspect it'd be better to interleave the available block numbers as described a couple days ago than lower the threshold further. I have looked at zhio.c, and it seems trivial to adapt zheap to this patchset. -John Naylor
Attachment
On Fri, Nov 23, 2018 at 11:56 AM John Naylor <jcnaylor@gmail.com> wrote: > > On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 3. > > static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, > > - uint8 newValue, uint8 minValue); > > + uint8 newValue, uint8 minValue); > > > > This appears to be a spurious change. > > 2 and 3 are separated into 0001. > Is the point 3 change related to pgindent? I think even if you want these, then don't prepare other patches on top of this, keep it entirely separate. > > Also, we don't quite have a consensus on the threshold value, but I > have set it to 4 pages for v8. If this is still considered too > expensive (and basic tests show it shouldn't be), I suspect it'd be > better to interleave the available block numbers as described a couple > days ago than lower the threshold further. > Can you please repeat the copy test you have done above with fillfactor as 20 and 30? > I have looked at zhio.c, and it seems trivial to adapt zheap to this patchset. > Cool, I also think so. Few more comments: ------------------------------- 1. I think we can add some test(s) to test the new functionality, may be something on the lines of what Robert has originally provided as an example of this behavior [1]. 2. @@ -2554,6 +2555,12 @@ AbortTransaction(void) s->parallelModeLevel = 0; } + /* + * In case we aborted during RelationGetBufferForTuple(), + * clear the local map of heap pages. + */ + FSMClearLocalMap(); + The similar call is required in AbortSubTransaction function as well. I suggest to add it after pgstat_progress_end_command in both functions. 3. GetPageWithFreeSpace(Relation rel, Size spaceNeeded) { .. + if (target_block == InvalidBlockNumber && + rel->rd_rel->relkind == RELKIND_RELATION) + { + nblocks = RelationGetNumberOfBlocks(rel); + + if (nblocks > HEAP_FSM_CREATION_THRESHOLD) + { + /* + * If the FSM knows nothing of the rel, try the last page before + * we give up and extend. This avoids one-tuple-per-page syndrome + * during bootstrapping or in a recently-started system. + */ + target_block = nblocks - 1; + } .. } Moving this check inside GetPageWithFreeSpace has one disadvantage, we will always consider last block which can have some inadvertent effects. Consider when this function gets called from RelationGetBufferForTuple just before extension, it can consider to check for the last block even though that is already being done in the begining when GetPageWithFreeSpace was called. I am not completely sure how much this is a case to worry because it will help to check last block when the same is concurrently added and FSM is not updated for same. I am slightly worried because the unpatched code doesn't care for such case and we have no intention to change this behaviour. What do you think? 4. You have mentioned above that "system catalogs created during bootstrap still have a FSM if they have any data." and I can also see this behavior, have you investigated this point further? 5. Your logic to update FSM on standby seems okay, but can you show some tests which proves its sanity? [1] - https://www.postgresql.org/message-id/CA%2BTgmoac%2B6qTNp2U%2BwedY8-PU6kK_b6hbdhR5xYGBG3GtdFcww%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Nov 23, 2018 at 11:56 AM John Naylor <jcnaylor@gmail.com> wrote: >> On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> > >> > 3. >> > static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 >> > slot, >> > - uint8 newValue, uint8 minValue); >> > + uint8 newValue, uint8 minValue); >> > >> > This appears to be a spurious change. >> >> 2 and 3 are separated into 0001. >> > > Is the point 3 change related to pgindent? I think even if you want > these, then don't prepare other patches on top of this, keep it > entirely separate. There are some places in the codebase that have spaces after tabs for no apparent reason (not to line up function parameters or pointer declarations). pgindent hasn't been able to fix this. If you wish, you are of course free to apply 0001 separately at any time. >> Also, we don't quite have a consensus on the threshold value, but I >> have set it to 4 pages for v8. If this is still considered too >> expensive (and basic tests show it shouldn't be), I suspect it'd be >> better to interleave the available block numbers as described a couple >> days ago than lower the threshold further. >> > > Can you please repeat the copy test you have done above with > fillfactor as 20 and 30? I did two kinds of tests. The first had a fill-factor of 10 [1], the second had the default storage, but I prevented the backend from caching the target block [2], to fully exercise the free space code. Would you like me to repeat the first one with 20 and 30? And do you think it is useful enough to test the copying of 4 blocks and not smaller numbers? > Few more comments: > ------------------------------- > 1. I think we can add some test(s) to test the new functionality, may > be something on the lines of what Robert has originally provided as an > example of this behavior [1]. Maybe the SQL script attached to [3] (which I probably based on Robert's report) can be cleaned up into a regression test. > 3. > GetPageWithFreeSpace(Relation rel, Size spaceNeeded) > { > .. > + if (target_block == InvalidBlockNumber && > + rel->rd_rel->relkind == RELKIND_RELATION) > + { > + nblocks = RelationGetNumberOfBlocks(rel); > + > + if (nblocks > HEAP_FSM_CREATION_THRESHOLD) > + { > + /* > + * If the FSM knows nothing of the rel, try the last page before > + * we give up and extend. This avoids one-tuple-per-page syndrome > + * during bootstrapping or in a recently-started system. > + */ > + target_block = nblocks - 1; > + } > .. > } > > Moving this check inside GetPageWithFreeSpace has one disadvantage, we > will always consider last block which can have some inadvertent > effects. Consider when this function gets called from > RelationGetBufferForTuple just before extension, it can consider to > check for the last block even though that is already being done in the > begining when GetPageWithFreeSpace was called. I am not completely > sure how much this is a case to worry because it will help to check > last block when the same is concurrently added and FSM is not updated > for same. I am slightly worried because the unpatched code doesn't > care for such case and we have no intention to change this behaviour. > What do you think? I see what you mean. If the other backend extended by 1 block, the intention is to keep it out of the FSM at first, and by extension, not visible in other ways. The comment implies that's debatable, but I agree we shouldn't change that without a reason to. One simple idea is add a 3rd boolean parameter to GetPageWithFreeSpace() to control whether it gives up if the FSM fork doesn't indicate free space, like if (target_block == InvalidBlockNumber && rel->rd_rel->relkind == RELKIND_RELATION && !check_fsm_only) { nblocks = RelationGetNumberOfBlocks(rel); > 4. You have mentioned above that "system catalogs created during > bootstrap still have a FSM if they have any data." and I can also see > this behavior, have you investigated this point further? Code reading didn't uncover the cause. I might have to step through with a debugger or something similar. I should find time for that next month. > 5. Your logic to update FSM on standby seems okay, but can you show > some tests which proves its sanity? I believe to convince myself it was working, I used the individual commands in the sql file in [3], then used the size function on the secondary. I'll redo that to verify. -- [1] https://www.postgresql.org/message-id/CAJVSVGWCRMyi8sSqguf6PfFcpM3hwNY5YhPZTt-8Q3ZGv0UGYw%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAJVSVGWMXzsqYpPhO3Snz4n5y8Tq-QiviuSCKyB5czCTnq9rzA%40mail.gmail.com [3] https://www.postgresql.org/message-id/CAJVSVGWvB13PzpbLEecFuGFc5V2fsO736BsdTakPiPAcdMM5tQ%40mail.gmail.com -John Naylor
On Mon, Nov 26, 2018 at 3:46 PM John Naylor <jcnaylor@gmail.com> wrote: > > On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 23, 2018 at 11:56 AM John Naylor <jcnaylor@gmail.com> wrote: > >> On 11/16/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> > > >> > 3. > >> > static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 > >> > slot, > >> > - uint8 newValue, uint8 minValue); > >> > + uint8 newValue, uint8 minValue); > >> > > >> > This appears to be a spurious change. > >> > >> 2 and 3 are separated into 0001. > >> > > > > Is the point 3 change related to pgindent? I think even if you want > > these, then don't prepare other patches on top of this, keep it > > entirely separate. > > There are some places in the codebase that have spaces after tabs for > no apparent reason (not to line up function parameters or pointer > declarations). pgindent hasn't been able to fix this. If you wish, you > are of course free to apply 0001 separately at any time. > I am not sure that I am interested in generally changing the parts of code that are not directly related to this patch, feel free to post them separately. > >> Also, we don't quite have a consensus on the threshold value, but I > >> have set it to 4 pages for v8. If this is still considered too > >> expensive (and basic tests show it shouldn't be), I suspect it'd be > >> better to interleave the available block numbers as described a couple > >> days ago than lower the threshold further. > >> > > > > Can you please repeat the copy test you have done above with > > fillfactor as 20 and 30? > > I did two kinds of tests. The first had a fill-factor of 10 [1], the > second had the default storage, but I prevented the backend from > caching the target block [2], to fully exercise the free space code. > Would you like me to repeat the first one with 20 and 30? > Yes. > And do you > think it is useful enough to test the copying of 4 blocks and not > smaller numbers? > You can try that, but I am mainly interested with 4 as threshold or may be higher to see where we start loosing. I think 4 should be a reasonable default for this patch, if later anyone wants to extend, we might want to provide a table level knob. > > Few more comments: > > ------------------------------- > > 1. I think we can add some test(s) to test the new functionality, may > > be something on the lines of what Robert has originally provided as an > > example of this behavior [1]. > > Maybe the SQL script attached to [3] (which I probably based on > Robert's report) can be cleaned up into a regression test. > Yeah, something on those lines works for me. > > 3. > > GetPageWithFreeSpace(Relation rel, Size spaceNeeded) > > { > > .. > > + if (target_block == InvalidBlockNumber && > > + rel->rd_rel->relkind == RELKIND_RELATION) > > + { > > + nblocks = RelationGetNumberOfBlocks(rel); > > + > > + if (nblocks > HEAP_FSM_CREATION_THRESHOLD) > > + { > > + /* > > + * If the FSM knows nothing of the rel, try the last page before > > + * we give up and extend. This avoids one-tuple-per-page syndrome > > + * during bootstrapping or in a recently-started system. > > + */ > > + target_block = nblocks - 1; > > + } > > .. > > } > > > > Moving this check inside GetPageWithFreeSpace has one disadvantage, we > > will always consider last block which can have some inadvertent > > effects. Consider when this function gets called from > > RelationGetBufferForTuple just before extension, it can consider to > > check for the last block even though that is already being done in the > > begining when GetPageWithFreeSpace was called. I am not completely > > sure how much this is a case to worry because it will help to check > > last block when the same is concurrently added and FSM is not updated > > for same. I am slightly worried because the unpatched code doesn't > > care for such case and we have no intention to change this behaviour. > > What do you think? > > I see what you mean. If the other backend extended by 1 block, the > intention is to keep it out of the FSM at first, and by extension, not > visible in other ways. The comment implies that's debatable, but I > agree we shouldn't change that without a reason to. One simple idea is > add a 3rd boolean parameter to GetPageWithFreeSpace() to control > whether it gives up if the FSM fork doesn't indicate free space, like > > if (target_block == InvalidBlockNumber && > rel->rd_rel->relkind == RELKIND_RELATION && > !check_fsm_only) > { > nblocks = RelationGetNumberOfBlocks(rel); > I have the exact fix in my mind, so let's do it that way. > > 4. You have mentioned above that "system catalogs created during > > bootstrap still have a FSM if they have any data." and I can also see > > this behavior, have you investigated this point further? > > Code reading didn't uncover the cause. I might have to step through > with a debugger or something similar. I should find time for that next > month. > Okay. > > 5. Your logic to update FSM on standby seems okay, but can you show > > some tests which proves its sanity? > > I believe to convince myself it was working, I used the individual > commands in the sql file in [3], then used the size function on the > secondary. I'll redo that to verify. > Okay. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
> Is the point 3 change related to pgindent? I think even if you want > these, then don't prepare other patches on top of this, keep it > entirely separate. Both removed. >> Also, we don't quite have a consensus on the threshold value, but I >> have set it to 4 pages for v8. If this is still considered too >> expensive (and basic tests show it shouldn't be), I suspect it'd be >> better to interleave the available block numbers as described a couple >> days ago than lower the threshold further. >> > > Can you please repeat the copy test you have done above with > fillfactor as 20 and 30? I will send the results in a separate email soon. > Few more comments: > ------------------------------- > 1. I think we can add some test(s) to test the new functionality, may > be something on the lines of what Robert has originally provided as an > example of this behavior [1]. Done. I tried adding it to several schedules, but for some reason vacuuming an empty table failed to truncate the heap to 0 blocks. Putting the test in its own group fixed the problem, but that doesn't seem ideal. > 2. > The similar call is required in AbortSubTransaction function as well. > I suggest to add it after pgstat_progress_end_command in both > functions. Done. > 3. >> agree we shouldn't change that without a reason to. One simple idea is >> add a 3rd boolean parameter to GetPageWithFreeSpace() to control >> whether it gives up if the FSM fork doesn't indicate free space, like > I have the exact fix in my mind, so let's do it that way. Done. This also reverts comments and variable names that referred to updating the local map after relation extension. While at it, I changed a couple conditionals to check the locally cached nblocks rather than the threshold. No functional change, but looks more precise. Might save a few cycles as well. >> > 5. Your logic to update FSM on standby seems okay, but can you show >> > some tests which proves its sanity? >> >> I believe to convince myself it was working, I used the individual >> commands in the sql file in [3], then used the size function on the >> secondary. I'll redo that to verify. I've verified the standby behaves precisely as the primary, as far as the aforementioned script goes. -John Naylor
Attachment
On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote: > > Few more comments: > > ------------------------------- > > 1. I think we can add some test(s) to test the new functionality, may > > be something on the lines of what Robert has originally provided as an > > example of this behavior [1]. > > Done. I tried adding it to several schedules, but for some reason > vacuuming an empty table failed to truncate the heap to 0 blocks. > Putting the test in its own group fixed the problem, but that doesn't > seem ideal. > It might be because it fails the should_attempt_truncation() check. See below code: if (should_attempt_truncation(vacrelstats)) lazy_truncate_heap(onerel, vacrelstats, vac_strategy); -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 11/29/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote: >> Done. I tried adding it to several schedules, but for some reason >> vacuuming an empty table failed to truncate the heap to 0 blocks. >> Putting the test in its own group fixed the problem, but that doesn't >> seem ideal. >> > > It might be because it fails the should_attempt_truncation() check. > See below code: > > if (should_attempt_truncation(vacrelstats)) > lazy_truncate_heap(onerel, vacrelstats, vac_strategy); I see. I think truncating the FSM is not essential to show either the old or new behavior -- I could skip that portion to enable running the test in a parallel group. >> Can you please repeat the copy test you have done above with >> fillfactor as 20 and 30? > > I will send the results in a separate email soon. I ran the attached scripts which populates 100 tables with either 4 or 8 blocks. The test tables were pre-filled with one tuple and vacuumed so that the FSMs were already created when testing the master branch. The patch branch was compiled with a threshold of 8, but testing inserts of 4 pages effectively simulates a threshold of 4. Config was stock, except for fsync = off. I took the average of 40 runs (2 complete tests of 20 runs each) after removing the 10% highest and lowest: fillfactor=20 # blocks master patch 4 19.1ms 17.5ms 8 33.4ms 30.9ms fillfactor=30 # blocks master patch 4 20.1ms 19.7ms 8 34.7ms 34.9ms It seems the patch might be a bit faster with fillfactor=20, but I'm at a loss as to why that would be. Previous testing with a higher threshold showed a significant performance penalty starting around 10 blocks [1], but that used truncation rather than deletion, and had a fill-factor of 10. -- [1] https://www.postgresql.org/message-id/CAJVSVGWCRMyi8sSqguf6PfFcpM3hwNY5YhPZTt-8Q3ZGv0UGYw%40mail.gmail.com -John Naylor
Attachment
On Sat, Dec 1, 2018 at 12:42 PM John Naylor <jcnaylor@gmail.com> wrote: > > On 11/29/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote: > >> Done. I tried adding it to several schedules, but for some reason > >> vacuuming an empty table failed to truncate the heap to 0 blocks. > >> Putting the test in its own group fixed the problem, but that doesn't > >> seem ideal. > >> > > > > It might be because it fails the should_attempt_truncation() check. > > See below code: > > > > if (should_attempt_truncation(vacrelstats)) > > lazy_truncate_heap(onerel, vacrelstats, vac_strategy); > > I see. I think truncating the FSM is not essential to show either the > old or new behavior -- I could skip that portion to enable running the > test in a parallel group. > > >> Can you please repeat the copy test you have done above with > >> fillfactor as 20 and 30? > > > > I will send the results in a separate email soon. > > I ran the attached scripts which populates 100 tables with either 4 or > 8 blocks. The test tables were pre-filled with one tuple and vacuumed > so that the FSMs were already created when testing the master branch. > The patch branch was compiled with a threshold of 8, but testing > inserts of 4 pages effectively simulates a threshold of 4. Config was > stock, except for fsync = off. I took the average of 40 runs (2 > complete tests of 20 runs each) after removing the 10% highest and > lowest: > > fillfactor=20 > # blocks master patch > 4 19.1ms 17.5ms > 8 33.4ms 30.9ms > > fillfactor=30 > # blocks master patch > 4 20.1ms 19.7ms > 8 34.7ms 34.9ms > > It seems the patch might be a bit faster with fillfactor=20, but I'm > at a loss as to why that would be. > I see that in your previous tests also with patch, the performance was slightly better. One probable reason could be that for small tables the total number of pages accessed via shared buffers is more without the patch (probably 3 FSM pages + 4 relation). With the patch, you need to only access 4 relation pages. The other overhead of patch (retrying each page) seems to be compensated with FSM search. I think you can once check perf profiles to confirm the same. > Previous testing with a higher > threshold showed a significant performance penalty starting around 10 > blocks [1], but that used truncation rather than deletion, and had a > fill-factor of 10. > Can you check whether the number of pages after test are the same with and without a patch in this setup? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 12/1/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > Can you check whether the number of pages after test are the same with > and without a patch in this setup? I did verify that the number of pages was as intended. -John Naylor
On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote: > - * Copy/link any fsm and vm files, if they exist + * Copy/link any fsm and vm files, if they exist and if they would + * be created in the new cluster. */ - transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit); + if (maps[mapnum].relkind != RELKIND_RELATION || + first_seg_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ || + GET_MAJOR_VERSION(new_cluster.major_version) <= 1100) + (void) transfer_relfile (&maps[mapnum], "_fsm", vm_must_add_frozenbit); > During pg_upgrade, skip transfer of FSMs if they wouldn't have been created on the new cluster. I think in some cases, it won't be consistent with HEAD's behavior. After truncate, we leave the FSM as it is, so the case where before upgrade the relation was truncated, we won't create the FSM in new cluster and that will be inconsistent with the behavior of HEAD. I think similar anomaly will be there when we delete rows from the table such that after deletion size of relation becomes smaller than HEAP_FSM_CREATION_THRESHOLD. I am not sure if it is a good idea to *not* transfer FSM files during upgrade unless we ensure that we remove FSM whenever the relation size falls below HEAP_FSM_CREATION_THRESHOLD. What do you think? BTW, what is your reasoning for not removing FSM on truncate? Anybody else has an opinion on this matter? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 3, 2018 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote: > > > v8 code: +fsm_local_set(Relation rel, BlockNumber new_nblocks) +{ + BlockNumber blkno, + cached_target_block; + + /* + * Mark blocks available starting after the last block number we have + * cached, and ending at the current last block in the relation. + * When we first set the map, this will flag all blocks as available + * to try. If we reset the map while waiting for a relation + * extension lock, this will only flag new blocks as available, + * if any were created by another backend. + */ + for (blkno = fsm_local_map.nblocks; blkno < new_nblocks; blkno++) + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL; v9 code: +static void +fsm_local_set(Relation rel, BlockNumber nblocks) +{ + BlockNumber blkno, + cached_target_block; + + for (blkno = 0; blkno < nblocks; blkno++) + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL; What is the reason for the above code change in the latest patch version? It would be good if you add few comments atop functions GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace and RecordPageWithFreeSpace about their interaction with local map. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote: >> During pg_upgrade, skip transfer of FSMs if they wouldn't have been >> created on the new cluster. > > I think in some cases, it won't be consistent with HEAD's behavior. > After truncate, we leave the FSM as it is, so the case where before > upgrade the relation was truncated, we won't create the FSM in new > cluster and that will be inconsistent with the behavior of HEAD. To be precise, with the TRUNCATE statement, the FSM (everything but the main relation fork, I think) is deleted, but using DELETE to remove all rows from the table will preserve the forks. In the latter case, when VACUUM truncates the FSM, it removes all leaf pages leaving behind the root page and one mid-level page. I haven't changed this behavior. > I think similar anomaly will be there when we delete rows from the table > such that after deletion size of relation becomes smaller than > HEAP_FSM_CREATION_THRESHOLD. Yes, in that case there will be inconsistency, but I'm comfortable with it. Others may not be. > I am not sure if it is a good idea to *not* transfer FSM files during > upgrade unless we ensure that we remove FSM whenever the relation size > falls below HEAP_FSM_CREATION_THRESHOLD. What do you think? BTW, > what is your reasoning for not removing FSM on truncate? My reasoning is that if we ever went past the threshold, it's likely we'll do so again, so I didn't feel it was worth the extra code complexity to remove the FSM. In the pg_upgrade case, however, it is simple to not copy the FSM. -John Naylor
On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Dec 3, 2018 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote: >> > >> > v8 code: > +fsm_local_set(Relation rel, BlockNumber new_nblocks) > +{ > + BlockNumber blkno, > + cached_target_block; > + > + /* > + * Mark blocks available starting after the last block number we have > + * cached, and ending at the current last block in the relation. > + * When we first set the map, this will flag all blocks as available > + * to try. If we reset the map while waiting for a relation > + * extension lock, this will only flag new blocks as available, > + * if any were created by another backend. > + */ > + for (blkno = fsm_local_map.nblocks; blkno < new_nblocks; blkno++) > + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL; > > v9 code: > +static void > +fsm_local_set(Relation rel, BlockNumber nblocks) > +{ > + BlockNumber blkno, > + cached_target_block; > + > + for (blkno = 0; blkno < nblocks; blkno++) > + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL; > > What is the reason for the above code change in the latest patch version? Per your recent comment, we no longer check relation size if we waited on a relation extension lock, so this is essentially a reversion to an earlier version. Keeping v8 would have the advantage that it'd be simple to change our minds about this. Do you have an opinion about that? > It would be good if you add few comments atop functions > GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace and > RecordPageWithFreeSpace about their interaction with local map. Good idea, will do. -John Naylor
On Mon, Dec 3, 2018 at 11:15 AM John Naylor <jcnaylor@gmail.com> wrote: > > On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Dec 3, 2018 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> On Thu, Nov 29, 2018 at 3:07 PM John Naylor <jcnaylor@gmail.com> wrote: > >> > > >> > > v8 code: > > +fsm_local_set(Relation rel, BlockNumber new_nblocks) > > +{ > > + BlockNumber blkno, > > + cached_target_block; > > + > > + /* > > + * Mark blocks available starting after the last block number we have > > + * cached, and ending at the current last block in the relation. > > + * When we first set the map, this will flag all blocks as available > > + * to try. If we reset the map while waiting for a relation > > + * extension lock, this will only flag new blocks as available, > > + * if any were created by another backend. > > + */ > > + for (blkno = fsm_local_map.nblocks; blkno < new_nblocks; blkno++) > > + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL; > > > > v9 code: > > +static void > > +fsm_local_set(Relation rel, BlockNumber nblocks) > > +{ > > + BlockNumber blkno, > > + cached_target_block; > > + > > + for (blkno = 0; blkno < nblocks; blkno++) > > + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL; > > > > What is the reason for the above code change in the latest patch version? > > Per your recent comment, we no longer check relation size if we waited > on a relation extension lock, so this is essentially a reversion to an > earlier version. > fsm_local_set is being called from RecordAndGetPageWithFreeSpace and GetPageWithFreeSpace whereas the change we have discussed was specific to GetPageWithFreeSpace, so not sure if we need any change in fsm_local_set. > Keeping v8 would have the advantage that it'd be > simple to change our minds about this. Do you have an opinion about > that? > > > It would be good if you add few comments atop functions > > GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace and > > RecordPageWithFreeSpace about their interaction with local map. > > Good idea, will do. > Thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Dec 3, 2018 at 11:15 AM John Naylor <jcnaylor@gmail.com> wrote: >> Per your recent comment, we no longer check relation size if we waited >> on a relation extension lock, so this is essentially a reversion to an >> earlier version. >> > > fsm_local_set is being called from RecordAndGetPageWithFreeSpace and > GetPageWithFreeSpace whereas the change we have discussed was specific > to GetPageWithFreeSpace, so not sure if we need any change in > fsm_local_set. Not needed, but I assumed wrongly you'd think it unclear otherwise. I've now restored the generality and updated the comments to be closer to v8. > It would be good if you add few comments atop functions > GetPageWithFreeSpace, RecordAndGetPageWithFreeSpace and > RecordPageWithFreeSpace about their interaction with local map. Done. Also additional minor comment editing. I've added an additional regression test for finding the right block and removed a test I thought was redundant. I've kept the test file in its own schedule. -John Naylor
Attachment
On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote: > > I've added an additional regression test for finding the right block > and removed a test I thought was redundant. I've kept the test file in > its own schedule. > +# ---------- +# fsm does a vacuum, and running it in parallel seems to prevent heap truncation. +# ---------- +test: fsm + It is not clear to me from the comment why running it in parallel prevents heap truncation, can you explain what behavior are you seeing and what makes you think that running it in parallel caused it? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 12/6/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote: >> >> I've added an additional regression test for finding the right block >> and removed a test I thought was redundant. I've kept the test file in >> its own schedule. >> > > +# ---------- > +# fsm does a vacuum, and running it in parallel seems to prevent heap > truncation. > +# ---------- > +test: fsm > + > > It is not clear to me from the comment why running it in parallel > prevents heap truncation, can you explain what behavior are you seeing > and what makes you think that running it in parallel caused it? One of the tests deletes all records from the relation and vacuums. In serial schedule, the heap and FSM are truncated; in parallel they are not. Make check fails, since since the tests measure relation size. Taking a closer look, I'm even more alarmed to discover that vacuum doesn't even seem to remove deleted rows in parallel schedule (that was in the last test I added), which makes no sense and causes that test to fail. I looked in vacuum.sql for possible clues, but didn't see any. I'll have to investigate further. -John Naylor
On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote: > > On 12/6/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote: > >> > >> I've added an additional regression test for finding the right block > >> and removed a test I thought was redundant. I've kept the test file in > >> its own schedule. > >> > > > > +# ---------- > > +# fsm does a vacuum, and running it in parallel seems to prevent heap > > truncation. > > +# ---------- > > +test: fsm > > + > > > > It is not clear to me from the comment why running it in parallel > > prevents heap truncation, can you explain what behavior are you seeing > > and what makes you think that running it in parallel caused it? > > One of the tests deletes all records from the relation and vacuums. In > serial schedule, the heap and FSM are truncated; in parallel they are > not. Make check fails, since since the tests measure relation size. > Taking a closer look, I'm even more alarmed to discover that vacuum > doesn't even seem to remove deleted rows in parallel schedule (that > was in the last test I added), which makes no sense and causes that > test to fail. I looked in vacuum.sql for possible clues, but didn't > see any. > I couldn't resist the temptation to figure out what's going on here. The newly added tests have deletes followed by vacuum and then you check whether the vacuum has removed the data by checking heap and or FSM size. Now, when you run such a test in parallel, the vacuum can sometimes skip removing the rows because there are parallel transactions open which can see the deleted rows. You can easily verify this phenomenon by running the newly added tests in one session in psql when there is another parallel session which has an open transaction. For example: Session-1 Begin; Insert into foo values(1); Session-2 \i fsm.sql Now, you should see the results similar to what you are seeing when you ran the fsm test by adding it to one of the parallel group. Can you test this at your end and confirm whether my analysis is correct or not. So, you can keep the test as you have in parallel_schedule, but comment needs to be changed. Also, you need to add the new test in serial_schedule. I have done both the changes in the attached patch, kindly confirm if this looks correct to you. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On 12/8/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote: > I couldn't resist the temptation to figure out what's going on here. > The newly added tests have deletes followed by vacuum and then you > check whether the vacuum has removed the data by checking heap and or > FSM size. Now, when you run such a test in parallel, the vacuum can > sometimes skip removing the rows because there are parallel > transactions open which can see the deleted rows. Ah yes, of course. > You can easily > verify this phenomenon by running the newly added tests in one session > in psql when there is another parallel session which has an open > transaction. For example: > > Session-1 > Begin; > Insert into foo values(1); > > Session-2 > \i fsm.sql > > Now, you should see the results similar to what you are seeing when > you ran the fsm test by adding it to one of the parallel group. Can > you test this at your end and confirm whether my analysis is correct > or not. Yes, I see the same behavior. > So, you can keep the test as you have in parallel_schedule, but > comment needs to be changed. Also, you need to add the new test in > serial_schedule. I have done both the changes in the attached patch, > kindly confirm if this looks correct to you. Looks good to me. I'll just note that the new line in the serial schedule has an extra space at the end. Thanks for looking into this. -John Naylor
On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > 4. You have mentioned above that "system catalogs created during > bootstrap still have a FSM if they have any data." and I can also see > this behavior, have you investigated this point further? I found the cause of this. There is some special-case code in md.c to create any file if it's opened in bootstrap mode. I removed this and a similar special case (attached), and make check still passes. After digging through the history, I'm guessing this has been useless code since about 2001, when certain special catalogs were removed. -John Naylor
Attachment
On Thu, Dec 13, 2018 at 3:18 AM John Naylor <jcnaylor@gmail.com> wrote: > > On 11/24/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > 4. You have mentioned above that "system catalogs created during > > bootstrap still have a FSM if they have any data." and I can also see > > this behavior, have you investigated this point further? > > I found the cause of this. There is some special-case code in md.c to > create any file if it's opened in bootstrap mode. I removed this and a > similar special case (attached), and make check still passes. After > digging through the history, I'm guessing this has been useless code > since about 2001, when certain special catalogs were removed. > Good finding, but I think it is better to discuss this part separately. I have started a new thread for this issue [1]. [1] - https://www.postgresql.org/message-id/CAA4eK1KsET6sotf%2BrzOTQfb83pzVEzVhbQi1nxGFYVstVWXUGw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Dec 8, 2018 at 6:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote:
> >
> > On 12/6/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
> > >>
> > >> I've added an additional regression test for finding the right block
I did run some performance tests on the latest patch v11, I see small regression in execution time of COPY statement. Tests I have used is same as provided in [1] just that I ran it for fill factor 20 and 70. Here are my results!
Machine : cthulhu (Intel based 8 numa machine)
Server setting is default, configured with HEAP_FSM_CREATION_THRESHOLD = 4,
Entire data directory was on HDD.
Results are execution time(unit ms) taken by copy statement when number of records equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4 pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70 till tid (3, 157). Result is taken as a median of 10 runs.
Fill factor 20
Tables Base Patch % of increase in execution time
500 121.97 125.315 2.7424776584
1000 246.592 253.789 2.9185861666
Fill factor 70
500 211.502 217.128 2.6600221275
1000 420.309 432.606 2.9257046601
So 2-3% consistent regression, And on every run I can see for patch v11 execution time is slightly more than base. I also tried to insert more records till 8 pages and same regression is observed! So I guess even HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
[1] https://www.postgresql.org/message-id/CAJVSVGX%3D2Q52fwijD9cjeq1UdiYGXns2_9WAPFf%3DE8cwbFCDvQ%40mail.gmail.com
--
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com
>
> On Fri, Dec 7, 2018 at 7:25 PM John Naylor <jcnaylor@gmail.com> wrote:
> >
> > On 12/6/18, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote:
> > >>
> > >> I've added an additional regression test for finding the right block
I did run some performance tests on the latest patch v11, I see small regression in execution time of COPY statement. Tests I have used is same as provided in [1] just that I ran it for fill factor 20 and 70. Here are my results!
Machine : cthulhu (Intel based 8 numa machine)
Server setting is default, configured with HEAP_FSM_CREATION_THRESHOLD = 4,
Entire data directory was on HDD.
Results are execution time(unit ms) taken by copy statement when number of records equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4 pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70 till tid (3, 157). Result is taken as a median of 10 runs.
Fill factor 20
Tables Base Patch % of increase in execution time
500 121.97 125.315 2.7424776584
1000 246.592 253.789 2.9185861666
Fill factor 70
500 211.502 217.128 2.6600221275
1000 420.309 432.606 2.9257046601
So 2-3% consistent regression, And on every run I can see for patch v11 execution time is slightly more than base. I also tried to insert more records till 8 pages and same regression is observed! So I guess even HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
[1] https://www.postgresql.org/message-id/CAJVSVGX%3D2Q52fwijD9cjeq1UdiYGXns2_9WAPFf%3DE8cwbFCDvQ%40mail.gmail.com
--
Thanks and Regards
Mithun Chicklore Yogendra
EnterpriseDB: http://www.enterprisedb.com
On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > Results are execution time(unit ms) taken by copy statement when number of > records equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4 > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70 > till tid (3, 157). Result is taken as a median of 10 runs. > So 2-3% consistent regression, And on every run I can see for patch v11 > execution time is slightly more than base. Thanks for testing! > I also tried to insert more > records till 8 pages and same regression is observed! So I guess even > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect! That's curious, because once the table exceeds the threshold, it would be allowed to update the FSM, and in the process write 3 pages that it didn't have to in the 4 page test. The master branch has the FSM already, so I would expect the 8 page case to regress more. What I can do later is provide a supplementary patch to go on top of mine that only checks the last block. If that improves performance, I'll alter my patch to only check every other page. -John Naylor
On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote: > > On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > > Results are execution time(unit ms) taken by copy statement when number of > > records equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4 > > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70 > > till tid (3, 157). Result is taken as a median of 10 runs. > > > So 2-3% consistent regression, And on every run I can see for patch v11 > > execution time is slightly more than base. > Have you by any chance checked at scale factor 80 or 100? > Thanks for testing! > > > I also tried to insert more > > records till 8 pages and same regression is observed! So I guess even > > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect! > > That's curious, because once the table exceeds the threshold, it would > be allowed to update the FSM, and in the process write 3 pages that it > didn't have to in the 4 page test. The master branch has the FSM > already, so I would expect the 8 page case to regress more. > It is not clear to me why you think there should be regression at 8 pages when HEAP_FSM_CREATION_THRESHOLD is 4. Basically, once FSM starts getting updated, we should be same as HEAD as it won't take any special path? > What I can do later is provide a supplementary patch to go on top of > mine that only checks the last block. If that improves performance, > I'll alter my patch to only check every other page. > Sure, but I guess first we should try to see what is exactly slowing down via perf report. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 6, 2018 at 10:53 PM John Naylor <jcnaylor@gmail.com> wrote: > On 12/3/18, Amit Kapila <amit.kapila16@gmail.com> wrote: > > fsm_local_set is being called from RecordAndGetPageWithFreeSpace and > > GetPageWithFreeSpace whereas the change we have discussed was specific > > to GetPageWithFreeSpace, so not sure if we need any change in > > fsm_local_set. I have some minor comments for pg_upgrade patch 1. Now we call stat main fork file in transfer_relfile() + sret = stat(old_file, &statbuf); + /* Save the size of the first segment of the main fork. */ + if (type_suffix[0] == '\0' && segno == 0) + first_seg_size = statbuf.st_size; But we do not handle the case if stat has returned any error! 2. src/bin/pg_upgrade/pg_upgrade.h char *relname; + + char relkind; /* relation relkind -- see pg_class.h */ I think we can remove the added empty line. -- Thanks and Regards Mithun Chicklore Yogendra EnterpriseDB: http://www.enterprisedb.com
Thanks, On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote: > On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > > Results are execution time(unit ms) taken by copy statement when number of > > records equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4 > > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70 > > till tid (3, 157). Result is taken as a median of 10 runs. > > > So 2-3% consistent regression, And on every run I can see for patch v11 > > execution time is slightly more than base. > > Thanks for testing! > > > I also tried to insert more > > records till 8 pages and same regression is observed! So I guess even > > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect! > > That's curious, because once the table exceeds the threshold, it would > be allowed to update the FSM, and in the process write 3 pages that it > didn't have to in the 4 page test. The master branch has the FSM > already, so I would expect the 8 page case to regress more. I tested with configuration HEAP_FSM_CREATION_THRESHOLD = 4 and just tried to insert till 8 blocks to see if regression is carried on with further inserts. > What I can do later is provide a supplementary patch to go on top of > mine that only checks the last block. If that improves performance, > I'll alter my patch to only check every other page. Running callgrind for same test shows below stats Before patch ========== Number of calls function_name 2000 heap_multi_insert 2000 RelationGetBufferForTuple 3500 ReadBufferBI After Patch ========= Number of calls function_name 2000 heap_multi_insert 2000 RelationGetBufferForTuple 5000 ReadBufferBI I guess Increase in ReadBufferBI() calls might be the reason which is causing regression. Sorry I have not investigated it. I will check same with your next patch! -- Thanks and Regards Mithun Chicklore Yogendra EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote: > On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > That's curious, because once the table exceeds the threshold, it would > be allowed to update the FSM, and in the process write 3 pages that it > didn't have to in the 4 page test. The master branch has the FSM > already, so I would expect the 8 page case to regress more. I tested with configuration HEAP_FSM_CREATION_THRESHOLD = 4 and just tried to insert till 8 blocks to see if regression is carried on with further inserts. > What I can do later is provide a supplementary patch to go on top of > mine that only checks the last block. If that improves performance, > I'll alter my patch to only check every other page. Running callgrind for same test shows below stats Before patch ========== Number of calls function_name 2000 heap_multi_insert 2000 RelationGetBufferForTuple 3500 ReadBufferBI After Patch ========= Number of calls function_name 2000 heap_multi_insert 2000 RelationGetBufferForTuple 5000 ReadBufferBI I guess Increase in ReadBufferBI() calls might be the reason which is causing regression. Sorry I have not investigated it. I will check same with your next patch! -- Thanks and Regards Mithun Chicklore Yogendra EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Jan 4, 2019 at 8:23 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote: > Thanks, > > On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote: > > On 12/29/18, Mithun Cy <mithun.cy@enterprisedb.com> wrote: > > > Results are execution time(unit ms) taken by copy statement when number of > > > records equal to exact number which fit HEAP_FSM_CREATION_THRESHOLD = 4 > > > pages. For fill factor 20 it is till tid (3, 43) and for scale factor 70 > > > till tid (3, 157). Result is taken as a median of 10 runs. > > > > > So 2-3% consistent regression, And on every run I can see for patch v11 > > > execution time is slightly more than base. > > > > Thanks for testing! > > > > > I also tried to insert more > > > records till 8 pages and same regression is observed! So I guess even > > > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect! > > > > That's curious, because once the table exceeds the threshold, it would > > be allowed to update the FSM, and in the process write 3 pages that it > > didn't have to in the 4 page test. The master branch has the FSM > > already, so I would expect the 8 page case to regress more. > > I tested with configuration HEAP_FSM_CREATION_THRESHOLD = 4 and just > tried to insert till 8 blocks to see if regression is carried on with > further inserts. > > > What I can do later is provide a supplementary patch to go on top of > > mine that only checks the last block. If that improves performance, > > I'll alter my patch to only check every other page. > > Running callgrind for same test shows below stats > Before patch > ========== > Number of calls function_name > 2000 heap_multi_insert > 2000 RelationGetBufferForTuple > 3500 ReadBufferBI > > After Patch > ========= > Number of calls function_name > 2000 heap_multi_insert > 2000 RelationGetBufferForTuple > 5000 ReadBufferBI > > I guess Increase in ReadBufferBI() calls might be the reason which is > causing regression. Sorry I have not investigated it. > I think the reason is that we are checking each block when blocks are less than HEAP_FSM_CREATION_THRESHOLD. Even though all the blocks are in memory, there is some cost to check them all. OTOH, without the patch, even if it accesses FSM, it won't have to make so many in-memory reads for blocks. BTW, have you check for scale_factor 80 or 100 as suggested last time? > I will check > same with your next patch! > Yeah, that makes sense, John, can you provide a patch on top of the current patch where we check either the last block or every other block. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 1/3/19, Amit Kapila <amit.kapila16@gmail.com> wrote: > Yeah, that makes sense, John, can you provide a patch on top of the > current patch where we check either the last block or every other > block. I've attached two patches for testing. Each one applies on top of the current patch. Mithun, I'll respond to your other review comments later this week. -John Naylor
Attachment
On Sun, Dec 30, 2018 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Dec 30, 2018 at 3:49 AM John Naylor <jcnaylor@gmail.com> wrote: > > > I also tried to insert more > > > records till 8 pages and same regression is observed! So I guess even > > > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect! > > > > That's curious, because once the table exceeds the threshold, it would > > be allowed to update the FSM, and in the process write 3 pages that it > > didn't have to in the 4 page test. The master branch has the FSM > > already, so I would expect the 8 page case to regress more. > > > > It is not clear to me why you think there should be regression at 8 > pages when HEAP_FSM_CREATION_THRESHOLD is 4. Basically, once FSM > starts getting updated, we should be same as HEAD as it won't take any > special path? In this particular test, the FSM is already created ahead of time for the master branch, so we can compare accessing FSM versus checking every page. My reasoning is that passing the threshold would take some time to create 3 FSM pages with the patch, leading to a larger regression. It seems we don't observe this, however. On Sun, Dec 30, 2018 at 10:59 PM Mithun Cy <mithun.cy@enterprisedb.com> wrote: > I have some minor comments for pg_upgrade patch > 1. Now we call stat main fork file in transfer_relfile() > + sret = stat(old_file, &statbuf); > > + /* Save the size of the first segment of the main fork. */ > + if (type_suffix[0] == '\0' && segno == 0) > + first_seg_size = statbuf.st_size; > > But we do not handle the case if stat has returned any error! How about this: /* Did file open fail? */ if (stat(old_file, &statbuf) != 0) { /* Extent, fsm, or vm does not exist? That's OK, just return */ if (errno == ENOENT && (type_suffix[0] != '\0' || segno != 0)) return first_seg_size; else pg_fatal("error while checking for file existence \"%s.%s\" (\"%s\" to \"%s\"): %s\n", map->nspname, map->relname, old_file, new_file, strerror(errno)); } /* Save the size of the first segment of the main fork. */ else if (type_suffix[0] == '\0' && segno == 0) first_seg_size = statbuf.st_size; /* If extent, fsm, or vm is empty, just return */ else if (statbuf.st_size == 0) return first_seg_size; > 2. src/bin/pg_upgrade/pg_upgrade.h > > char *relname; > + > + char relkind; /* relation relkind -- see pg_class.h */ > > I think we can remove the added empty line. In the full context: - /* the rest are used only for logging and error reporting */ + + /* These are used only for logging and error reporting. */ char *nspname; /* namespaces */ char *relname; + + char relkind; /* relation relkind -- see pg_class.h */ Relkind is not used for logging or error reporting, so the space sets it apart from the previous members. I could instead put relkind before those other two... -John Naylor
Hi John Naylor, On Tue, Jan 8, 2019 at 2:27 AM John Naylor <jcnaylor@gmail.com> wrote: > I've attached two patches for testing. Each one applies on top of the > current patch. Thanks for the patch, I did a quick test for both of the patches same tests as in [1], now for fillfactors 20, 70, 100 (Note for HEAP_FSM_CREATION_THRESHOLD = 4 highest tid inserted was 20 fillfactor is (3,43), for 70 fillfactor is (3, 157) and for 100 fillfactor is (3, 225), so exactly 4 pages are used) Machine : cthulhu, same as before [2] and server settings is default. Test: COPY command as in [1], for 500 tables. Fill factor 20 execution time in ms %increase in execution time Base 119.238 v11-all-pages 121.974 2.2945705228 v11-Every-other-page 114.455 -4.0113051209 v11-last-page 113.573 -4.7510021973 Fill factor 70 execution time in ms %increase in execution time Base 209.991 v11-all-pages 211.076 0.5166888105 v11-Every-other-page 206.476 -1.6738812616 v11-last-page 203.591 -3.0477496655 Fill factor 100 execution time in ms %increase in execution time Base 269.691 v11-all-pages 270.078 0.1434975583 v11-Every-other-page 262.691 -2.5955630703 v11-last-page 260.293 -3.4847288193 Observations 1. Execution time of both base and v11-all-pages patch has improved than my earlier results [2]. But still v11-all-pages is slightly behind base. 2. v11-Every-other-page and v11-last-page patches improve the performance from base. 3. IMHO v11-Every-other-page would be ideal to consider it improves the performance and also to an extent avoid expansion if space is already available. [1] https://www.postgresql.org/message-id/CAJVSVGX%3D2Q52fwijD9cjeq1UdiYGXns2_9WAPFf%3DE8cwbFCDvQ%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAD__Ouj%3Dat4hy2wYidK90v92qSRLjU%2BQe4y-PwfjLLeGkhc6ZA%40mail.gmail.com -- Thanks and Regards Mithun Chicklore Yogendra EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 9, 2019 at 7:33 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote: > 2. v11-Every-other-page and v11-last-page patches improve the > performance from base. > 3. IMHO v11-Every-other-page would be ideal to consider it improves > the performance and also to an extent avoid expansion if space is > already available. Good to hear. I'll clean up the every-other-page patch and include it in my next version. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 9, 2019 at 9:00 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Wed, Jan 9, 2019 at 7:33 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote: > > 2. v11-Every-other-page and v11-last-page patches improve the > > performance from base. > > 3. IMHO v11-Every-other-page would be ideal to consider it improves > > the performance and also to an extent avoid expansion if space is > > already available. > Thanks, Mithun for performance testing, it really helps us to choose the right strategy here. Once John provides next version, it would be good to see the results of regular pgbench (read-write) runs (say at 50 and 300 scale factor) and the results of large copy. I don't think there will be any problem, but we should just double check that. > Good to hear. I'll clean up the every-other-page patch and include it > in my next version. > +1. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Thanks, Mithun for performance testing, it really helps us to choose > the right strategy here. Once John provides next version, it would be > good to see the results of regular pgbench (read-write) runs (say at > 50 and 300 scale factor) and the results of large copy. I don't think > there will be any problem, but we should just double check that. Attached is v12 using the alternating-page strategy. I've updated the comments and README as needed. In addition, I've -handled a possible stat() call failure during pg_upgrade -added one more assertion -moved the new README material into a separate paragraph -added a comment to FSMClearLocalMap() about transaction abort -corrected an outdated comment that erroneously referred to extension rather than creation -fleshed out the draft commit messages
Attachment
On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Thanks, Mithun for performance testing, it really helps us to choose
> > the right strategy here. Once John provides next version, it would be
> > good to see the results of regular pgbench (read-write) runs (say at
> > 50 and 300 scale factor) and the results of large copy. I don't think
> > there will be any problem, but we should just double check that.
>
> Attached is v12 using the alternating-page strategy. I've updated the
> comments and README as needed. In addition, I've
Below are my performance tests and numbers
Machine : cthulhu
Tests and setups
Server settings:
max_connections = 200
shared_buffers=8GB
checkpoint_timeout =15min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
min_wal_size=15GB and max_wal_size=20GB.
pgbench settings:
-----------------------
read-write settings (TPCB like tests)
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared postgres
scale factor 50 -- median of 3 TPS
clients v12-patch base patch % diff
1 826.081588 834.328238 -0.9884179421
16 10805.807081 10800.662805 0.0476292621
32 19722.277019 19641.546628 0.4110185034
64 30232.681889 30263.616073 -0.1022157561
scale factor 300 -- median of 3 TPS
clients v12-patch base patch % diff
1 813.646062 822.18648 -1.038744641
16 11379.028702 11277.05586 0.9042505709
32 21688.084093 21613.044463 0.3471960192
64 36288.85711 36348.6178 -0.1644098005
Copy command
Test: setup
./psql -d postgres -c "COPY pgbench_accounts TO '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out' WITH csv"
./psql -d postgres -c "CREATE UNLOGGED TABLE pgbench_accounts_ulg (LIKE pgbench_accounts) WITH (fillfactor = 100);"
Test run:
TRUNCATE TABLE pgbench_accounts_ulg;
\timing
COPY pgbench_accounts_ulg FROM '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out' WITH csv;
\timing
execution time in ms. (scale factor indicates size of pgbench_accounts)
scale factor v12-patch base patch % diff
300 77166.407 77862.041 -0.8934186557
50 13329.233 13284.583 0.3361038882
>
> On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Thanks, Mithun for performance testing, it really helps us to choose
> > the right strategy here. Once John provides next version, it would be
> > good to see the results of regular pgbench (read-write) runs (say at
> > 50 and 300 scale factor) and the results of large copy. I don't think
> > there will be any problem, but we should just double check that.
>
> Attached is v12 using the alternating-page strategy. I've updated the
> comments and README as needed. In addition, I've
Below are my performance tests and numbers
Machine : cthulhu
Tests and setups
Server settings:
max_connections = 200
shared_buffers=8GB
checkpoint_timeout =15min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
min_wal_size=15GB and max_wal_size=20GB.
pgbench settings:
-----------------------
read-write settings (TPCB like tests)
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared postgres
scale factor 50 -- median of 3 TPS
clients v12-patch base patch % diff
1 826.081588 834.328238 -0.9884179421
16 10805.807081 10800.662805 0.0476292621
32 19722.277019 19641.546628 0.4110185034
64 30232.681889 30263.616073 -0.1022157561
scale factor 300 -- median of 3 TPS
clients v12-patch base patch % diff
1 813.646062 822.18648 -1.038744641
16 11379.028702 11277.05586 0.9042505709
32 21688.084093 21613.044463 0.3471960192
64 36288.85711 36348.6178 -0.1644098005
Copy command
Test: setup
./psql -d postgres -c "COPY pgbench_accounts TO '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out' WITH csv"
./psql -d postgres -c "CREATE UNLOGGED TABLE pgbench_accounts_ulg (LIKE pgbench_accounts) WITH (fillfactor = 100);"
Test run:
TRUNCATE TABLE pgbench_accounts_ulg;
\timing
COPY pgbench_accounts_ulg FROM '/mnt/data-mag/mithun.cy/fsmbin/bin/dump.out' WITH csv;
\timing
execution time in ms. (scale factor indicates size of pgbench_accounts)
scale factor v12-patch base patch % diff
300 77166.407 77862.041 -0.8934186557
50 13329.233 13284.583 0.3361038882
So for large table tests do not show any considerable performance variance from base code!
On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote:
On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> Thanks, Mithun for performance testing, it really helps us to choose
> the right strategy here. Once John provides next version, it would be
> good to see the results of regular pgbench (read-write) runs (say at
> 50 and 300 scale factor) and the results of large copy. I don't think
> there will be any problem, but we should just double check that.
Attached is v12 using the alternating-page strategy. I've updated the
comments and README as needed. In addition, I've
-handled a possible stat() call failure during pg_upgrade
-added one more assertion
-moved the new README material into a separate paragraph
-added a comment to FSMClearLocalMap() about transaction abort
-corrected an outdated comment that erroneously referred to extension
rather than creation
-fleshed out the draft commit messages
--
On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Thanks, Mithun for performance testing, it really helps us to choose > > the right strategy here. Once John provides next version, it would be > > good to see the results of regular pgbench (read-write) runs (say at > > 50 and 300 scale factor) and the results of large copy. I don't think > > there will be any problem, but we should just double check that. > > Attached is v12 using the alternating-page strategy. I've updated the > comments and README as needed. In addition, I've > Few comments: --------------------------- 1. Commit message: > Any pages with wasted free space become visible at next relation extension, so we still control table bloat. I think the free space will be available after the next pass of vacuum, no? How can relation extension make it available? 2. +2. For very small heap relations, the FSM would be relatively large and +wasteful, so as of PostgreSQL 12 we refrain from creating the FSM for +heaps with HEAP_FSM_CREATION_THRESHOLD pages or fewer, both to save space +and to improve performance. To locate free space in this case, we simply +iterate over the heap, trying alternating pages in turn. There may be some +wasted free space in this case, but it becomes visible again upon next +relation extension. a. Again, how space becomes available at next relation extension. b. I think there is no use of mentioning the version number in the above comment, this code will be present from PG-12, so one can find out from which version this optimization is added. 3. BlockNumber RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage, Size oldSpaceAvail, Size spaceNeeded) { .. + /* First try the local map, if it exists. */ + if (oldPage < fsm_local_map.nblocks) + { .. } The comment doesn't appear to be completely in sync with the code. Can't we just check whether "fsm_local_map.nblocks > 0", if so, we can use a macro for the same? I have changed this in the attached patch, see what you think about it. I have used it at a few other places as well. 4. + * When we initialize the map, the whole heap is potentially available to + * try. If a caller wanted to reset the map after another backend extends + * the relation, this will only flag new blocks as available. No callers + * do this currently, however. + */ +static void +fsm_local_set(Relation rel, BlockNumber curr_nblocks) { .. + if (blkno >= fsm_local_map.nblocks + 2) .. } The way you have tried to support the case as quoted in the comment "If a caller wanted to reset the map after another backend extends .." doesn't appear to be solid and I am not sure if it is correct either. We don't have any way to test the same, so I suggest let's try to simplify the case w.r.t current requirement of this API. I think we should some simple logic to try every other block like: + blkno = cur_nblocks - 1; + while (true) + { + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL; + if (blkno >= 2) + blkno -= 2; + else + break; + } I have changed this in the attached patch. 5. +/* + * Search the local map for an available block to try, in descending order. + * + * For use when there is no FSM. + */ +static BlockNumber +fsm_local_search(void) We should give a brief explanation as to why we try in descending order. I have added some explanation in the attached patch, see what you think about it? Apart from the above, I have modified a few comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Jan 16, 2019 at 9:25 AM Mithun Cy <mithun.cy@enterprisedb.com> wrote: > > On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > On Wed, Jan 9, 2019 at 10:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Thanks, Mithun for performance testing, it really helps us to choose > > > the right strategy here. Once John provides next version, it would be > > > good to see the results of regular pgbench (read-write) runs (say at > > > 50 and 300 scale factor) and the results of large copy. I don't think > > > there will be any problem, but we should just double check that. > > > > Attached is v12 using the alternating-page strategy. I've updated the > > comments and README as needed. In addition, I've > > > execution time in ms. (scale factor indicates size of pgbench_accounts) > scale factor v12-patch base patch % diff > 300 77166.407 77862.041 -0.8934186557 > 50 13329.233 13284.583 0.3361038882 > > So for large table tests do not show any considerable performance variance from base code! > I think with these results, we can conclude this patch doesn't seem to have any noticeable regression for all the tests we have done, right? Thanks a lot for doing various performance tests. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 16, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > 1. > Commit message: > > Any pages with wasted free space become visible at next relation extension, so we still control table bloat. > > I think the free space will be available after the next pass of > vacuum, no? How can relation extension make it available? To explain, this diagram shows the map as it looks for different small table sizes: 0123 A NA ANA NANA So for a 3-block table, the alternating strategy never checks block 1. Any free space block 1 has acquired via delete-and-vacuum will become visible if it extends to 4 blocks. We are accepting a small amount of bloat for improved performance, as discussed. Would it help to include this diagram in the README? > 2. > +2. For very small heap relations, the FSM would be relatively large and > +wasteful, so as of PostgreSQL 12 we refrain from creating the FSM for > +heaps with HEAP_FSM_CREATION_THRESHOLD pages or fewer, both to save space > +and to improve performance. To locate free space in this case, we simply > +iterate over the heap, trying alternating pages in turn. There may be some > +wasted free space in this case, but it becomes visible again upon next > +relation extension. > > a. Again, how space becomes available at next relation extension. > b. I think there is no use of mentioning the version number in the > above comment, this code will be present from PG-12, so one can find > out from which version this optimization is added. It fits with the reference to PG 8.4 earlier in the document. I chose to be consistent, but to be honest, I'm not much in favor of a lot of version references in code/READMEs. > 3. > BlockNumber > RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage, > Size oldSpaceAvail, Size spaceNeeded) > { > .. > + /* First try the local map, if it exists. */ > + if (oldPage < fsm_local_map.nblocks) > + { > .. > } > > The comment doesn't appear to be completely in sync with the code. > Can't we just check whether "fsm_local_map.nblocks > 0", if so, we > can use a macro for the same? I have changed this in the attached > patch, see what you think about it. I have used it at a few other > places as well. The macro adds clarity, so I'm in favor of using it. > 4. > + * When we initialize the map, the whole heap is potentially available to > + * try. If a caller wanted to reset the map after another backend extends > + * the relation, this will only flag new blocks as available. No callers > + * do this currently, however. > + */ > +static void > +fsm_local_set(Relation rel, BlockNumber curr_nblocks) > { > .. > + if (blkno >= fsm_local_map.nblocks + 2) > .. > } > > > The way you have tried to support the case as quoted in the comment > "If a caller wanted to reset the map after another backend extends .." > doesn't appear to be solid and I am not sure if it is correct either. I removed this case in v9 and you objected to that as unnecessary, so I reverted it for v10. > We don't have any way to test the same, so I suggest let's try to > simplify the case w.r.t current requirement of this API. I think we > should > some simple logic to try every other block like: > > + blkno = cur_nblocks - 1; > + while (true) > + { > + fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL; > + if (blkno >= 2) > + blkno -= 2; > + else > + break; > + } > > I have changed this in the attached patch. Fine by me. > 5. > +/* > + * Search the local map for an available block to try, in descending order. > + * > + * For use when there is no FSM. > + */ > +static BlockNumber > +fsm_local_search(void) > > We should give a brief explanation as to why we try in descending > order. I have added some explanation in the attached patch, see what > you think about it? > > Apart from the above, I have modified a few comments. I'll include these with some grammar corrections in the next version. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 16, 2019 at 11:40 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > On Wed, Jan 16, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > can use a macro for the same? I have changed this in the attached > > patch, see what you think about it. I have used it at a few other > > places as well. > > The macro adds clarity, so I'm in favor of using it. It just occured to me that the style FSM_LOCAL_MAP_EXISTS seems more common for macros that refer to constants, and FSMLocalMapExists for expressions, but I've only seen a small amount of the code base. Do we have a style preference here, or is it more a matter of matching the surrounding code? </amit.kapila16@gmail.com></john.naylor@2ndquadrant.com> -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
John Naylor <john.naylor@2ndquadrant.com> writes: > It just occured to me that the style FSM_LOCAL_MAP_EXISTS seems more > common for macros that refer to constants, and FSMLocalMapExists for > expressions, but I've only seen a small amount of the code base. Do we > have a style preference here, or is it more a matter of matching the > surrounding code? I believe there's a pretty longstanding tradition in C coding to use all-caps names for macros representing constants. Some people think that goes for all macros period, but I'm not on board with that for function-like macros. Different parts of the PG code base make different choices between camel-case and underscore-separation for multiword function names. For that, I'd say match the style of nearby code. regards, tom lane
On Wed, Jan 16, 2019 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > John Naylor <john.naylor@2ndquadrant.com> writes: > > It just occured to me that the style FSM_LOCAL_MAP_EXISTS seems more > > common for macros that refer to constants, and FSMLocalMapExists for > > expressions, but I've only seen a small amount of the code base. Do we > > have a style preference here, or is it more a matter of matching the > > surrounding code? > I am fine with the style (FSMLocalMapExists) you are suggesting, but see the similar macros in nearby code like: #define FSM_TREE_DEPTH ((SlotsPerFSMPage >= 1626) ? 3 : 4) I think the above is not an exact match. So, I have looked around and found few other macros which serve a somewhat similar purpose, see below: #define ATT_IS_PACKABLE(att) \ ((att)->attlen == -1 && (att)->attstorage != 'p') #define VARLENA_ATT_IS_PACKABLE(att) \ ((att)->attstorage != 'p') #define CHECK_REL_PROCEDURE(pname) #define SPTEST(f, x, y) \ DatumGetBool(DirectFunctionCall2(f, PointPGetDatum(x), PointPGetDatum(y))) > I believe there's a pretty longstanding tradition in C coding to use > all-caps names for macros representing constants. Some people think > that goes for all macros period, but I'm not on board with that for > function-like macros. > > Different parts of the PG code base make different choices between > camel-case and underscore-separation for multiword function names. > For that, I'd say match the style of nearby code. > Yes, that is what we normally do. However, in some cases, we might need to refer to other places as well which I think is the case here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 16, 2019 at 10:10 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Wed, Jan 16, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Fri, Jan 11, 2019 at 3:54 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > 1. > > Commit message: > > > Any pages with wasted free space become visible at next relation extension, so we still control table bloat. > > > > I think the free space will be available after the next pass of > > vacuum, no? How can relation extension make it available? > > To explain, this diagram shows the map as it looks for different small > table sizes: > > 0123 > A > NA > ANA > NANA > > So for a 3-block table, the alternating strategy never checks block 1. > Any free space block 1 has acquired via delete-and-vacuum will become > visible if it extends to 4 blocks. We are accepting a small amount of > bloat for improved performance, as discussed. Would it help to include > this diagram in the README? > Yes, I think it would be good if you can explain the concept of local-map with the help of this example. > > 2. > > +2. For very small heap relations, the FSM would be relatively large and > > +wasteful, so as of PostgreSQL 12 we refrain from creating the FSM for > > +heaps with HEAP_FSM_CREATION_THRESHOLD pages or fewer, both to save space > > +and to improve performance. To locate free space in this case, we simply > > +iterate over the heap, trying alternating pages in turn. There may be some > > +wasted free space in this case, but it becomes visible again upon next > > +relation extension. > > > > a. Again, how space becomes available at next relation extension. > > b. I think there is no use of mentioning the version number in the > > above comment, this code will be present from PG-12, so one can find > > out from which version this optimization is added. > > It fits with the reference to PG 8.4 earlier in the document. I chose > to be consistent, but to be honest, I'm not much in favor of a lot of > version references in code/READMEs. > Then let's not add a reference to the version number in this case. I also don't see much advantage of adding version number at least in this case. > > I'll include these with some grammar corrections in the next version. > Okay, thanks! -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 16, 2019 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Yes, I think it would be good if you can explain the concept of > local-map with the help of this example. > Then let's not add a reference to the version number in this case. I Okay, done in v14. I kept your spelling of the new macro. One minor detail added: use uint8 rather than char for the local map array. This seems to be preferred, especially in this file. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Jan 17, 2019 at 11:13 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Wed, Jan 16, 2019 at 10:35 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Yes, I think it would be good if you can explain the concept of > > local-map with the help of this example. > > > Then let's not add a reference to the version number in this case. I > > Okay, done in v14. I kept your spelling of the new macro. One minor > detail added: use uint8 rather than char for the local map array. This > seems to be preferred, especially in this file. > I am fine with your change. Few more comments: 1. I think we should not allow to create FSM for toast tables as well till there size reaches HEAP_FSM_CREATION_THRESHOLD. If you try below test, you can see that FSM will be created for the toast table even if the size of toast relation is 1 page. CREATE OR REPLACE FUNCTION random_text(length INTEGER) RETURNS TEXT LANGUAGE SQL AS $$ select string_agg(chr (32+(random()*96)::int), '') from generate_series(1,length); $$; create table tt(c1 int, c2 text); insert into tt values(1, random_text(2500)); Vacuum tt; I have fixed this in the attached patch, kindly verify it once and see if you can add the test for same as well. 2. -CREATE TABLE test1 (a int, b int); -INSERT INTO test1 VALUES (16777217, 131584); +CREATE TABLE test_rel_forks (a int); +-- Make sure there are enough blocks in the heap for the FSM to be created. +INSERT INTO test_rel_forks SELECT g from generate_series(1,10000) g; -VACUUM test1; -- set up FSM +-- set up FSM and VM +VACUUM test_rel_forks; This test will create 45 pages instead of 1. I know that to create FSM, we now need more than 4 pages, but 45 seems to be on the higher side. I think we should not unnecessarily populate more data if there is no particular need for it, let's restrict the number of pages to 5 if possible. 3. -SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1; - fsm_1 -------- - 8192 -(1 row) - -SELECT octet_length (get_raw_page('test1', 'vm', 0)) AS vm_0; +SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10; +ERROR: block number 10 is out of range for relation "test_rel_forks" Why have you changed the test definition here? Previously test checks the existing FSM page, but now it tries to access out of range page. Apart from the above, I have changed one sentence in README. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Jan 19, 2019 at 8:06 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 17, 2019 at 11:13 PM John Naylor > Few more comments: > 1. > I think we should not allow to create FSM for toast tables as well > till there size reaches HEAP_FSM_CREATION_THRESHOLD. If you try below > test, you can see that FSM will be created for the toast table even if > the size of toast relation is 1 page. ... > I have fixed this in the attached patch, kindly verify it once and see > if you can add the test for same as well. Works for me. For v16, I've added and tested similar logic to pg_upgrade and verified that toast tables work the same as normal tables in recovery. I used a slightly different method to generate the long random string to avoid creating a function. Also, some cosmetic adjustments -- I changed the regression test to use 'i' instead of 'g' to match the use of generate_series in most other tests, and made capitalization more consistent. > 2. > -CREATE TABLE test1 (a int, b int); > -INSERT INTO test1 VALUES (16777217, 131584); > +CREATE TABLE test_rel_forks (a > int); > +-- Make sure there are enough blocks in the heap for the FSM to be created. > +INSERT INTO test_rel_forks SELECT g > from generate_series(1,10000) g; > > -VACUUM test1; -- set up FSM > +-- set up FSM and VM > +VACUUM test_rel_forks; > > This test will create 45 pages instead of 1. I know that to create > FSM, we now need more than 4 pages, but 45 seems to be on the higher > side. I think we should not unnecessarily populate more data if there > is no particular need for it, let's restrict the number of pages to 5 > if possible. Good idea, done here and in the fsm regression test. > 3. > -SELECT octet_length(get_raw_page('test1', 'fsm', 1)) AS fsm_1; > - fsm_1 > -------- > - 8192 > -(1 row) > - > -SELECT octet_length > (get_raw_page('test1', 'vm', 0)) AS vm_0; > +SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10; > +ERROR: block number 10 is out of range for relation "test_rel_forks" > > Why have you changed the test definition here? Previously test checks > the existing FSM page, but now it tries to access out of range page. The patch is hard to read here, but I still have a test for the existing FSM page: -SELECT octet_length(get_raw_page('test1', 'fsm', 0)) AS fsm_0; +SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100; +ERROR: block number 100 is out of range for relation "test_rel_forks" +SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0; fsm_0 ------- 8192 (1 row) I have a test for in-range and out-of-range for each relation fork. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Sun, Jan 20, 2019 at 5:19 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > Review of v16-0002-During-pg_upgrade-conditionally-skip-transfer-of: - * Copy/link any fsm and vm files, if they exist + * Copy/link any fsm and vm files, if they exist and if they would + * be created in the new cluster. */ - transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit); + if ((maps[mapnum].relkind != RELKIND_RELATION && + maps[mapnum].relkind != RELKIND_TOASTVALUE) || + first_seg_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ || + GET_MAJOR_VERSION (new_cluster.major_version) <= 1100) + (void) transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit); So we won't allow transfer of FSM files if their size is below HEAP_FSM_CREATION_THRESHOLD. What will be its behavior in link mode? It seems that the old files will remain there. Will it create any problem when we try to create the files via the new server, can you once test this case? Also, another case to think in this regard is the upgrade for standby servers, if you read below paragraph from the user manual [1], you will see what I am worried about? "What this does is to record the links created by pg_upgrade's link mode that connect files in the old and new clusters on the primary server. It then finds matching files in the standby's old cluster and creates links for them in the standby's new cluster. Files that were not linked on the primary are copied from the primary to the standby. (They are usually small.)" [1] - https://www.postgresql.org/docs/devel/pgupgrade.html -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 21, 2019 at 6:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > So we won't allow transfer of FSM files if their size is below > HEAP_FSM_CREATION_THRESHOLD. What will be its behavior in link mode? > It seems that the old files will remain there. Will it create any > problem when we try to create the files via the new server, can you > once test this case? I tried upgrading in --link mode, and on the new cluster, enlarging the table past the threshold causes a new FSM to be created as expected. > Also, another case to think in this regard is the upgrade for standby > servers, if you read below paragraph from the user manual [1], you > will see what I am worried about? > > "What this does is to record the links created by pg_upgrade's link > mode that connect files in the old and new clusters on the primary > server. It then finds matching files in the standby's old cluster and > creates links for them in the standby's new cluster. Files that were > not linked on the primary are copied from the primary to the standby. > (They are usually small.)" > > [1] - https://www.postgresql.org/docs/devel/pgupgrade.html Trying this, I ran into a couple problems. I'm probably doing something wrong, but I can't help but think there's a pg_upgrade bug/feature I'm unaware of: I set up my test to have primary directory data1 and for the secondary standby/data1. I instructed pg_upgrade to upgrade data1 into data1u, and I tried the rsync recipe in the docs quoted above, and the upgraded standby wouldn't go into recovery. While debugging that, I found surprisingly that pg_upgrade also went further and upgraded standby/data1 into standby/data1u. I tried deleting standby/data1u before running the rsync command and still nothing. Because the upgraded secondary is non-functional, I can't really answer your question. Not sure if this is normal, but the pg_upgraded new cluster no longer had the replication slot. Re-adding it didn't allow my upgraded secondary to go into recovery, either. (I made sure to copy the recovery settings, so that can't be the problem) -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Jan 20, 2019 at 5:19 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > I have a test for in-range and out-of-range for each relation fork. > I think the first two patches (a) removal of dead code in bootstrap and (b) the core patch to avoid creation of FSM file for the small table are good now. I have prepared the patches along with commit message. There is no change except for some changes in README and commit message of the second patch. Kindly let me know what you think about them? I think these two patches can go even without the upgrade patch (during pg_upgrade, conditionally skip transfer of FSMs.) which is still under discussion. However, I am not in a hurry if you or other thinks that upgrade patch must be committed along with the second patch. I think the upgrade patch is generally going on track but might need some more review. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Jan 23, 2019 at 7:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I think the first two patches (a) removal of dead code in bootstrap > and (b) the core patch to avoid creation of FSM file for the small > table are good now. I have prepared the patches along with commit > message. There is no change except for some changes in README and > commit message of the second patch. Kindly let me know what you think > about them? Good to hear! The additional language is fine. In "Once the FSM is created for heap", I would just change that to "...for a heap". > I think these two patches can go even without the upgrade patch > (during pg_upgrade, conditionally skip transfer of FSMs.) which is > still under discussion. However, I am not in a hurry if you or other > thinks that upgrade patch must be committed along with the second > patch. I think the upgrade patch is generally going on track but > might need some more review. The pg_upgrade piece is a nice-to-have feature and not essential, so can go in later. Additional review is also welcome. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jan 21, 2019 at 6:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > Also, another case to think in this regard is the upgrade for standby > servers, if you read below paragraph from the user manual [1], you > will see what I am worried about? > > "What this does is to record the links created by pg_upgrade's link > mode that connect files in the old and new clusters on the primary > server. It then finds matching files in the standby's old cluster and > creates links for them in the standby's new cluster. Files that were > not linked on the primary are copied from the primary to the standby. > (They are usually small.)" > > [1] - https://www.postgresql.org/docs/devel/pgupgrade.html I am still not able to get the upgraded standby to go into recovery without resorting to pg_basebackup, but in another attempt to investigate your question I tried the following (data1 = old cluster, data2 = new cluster): mkdir -p data1 data2 standby echo 'heap' > data1/foo echo 'fsm' > data1/foo_fsm # simulate streaming replication rsync --archive data1 standby # simulate pg_upgrade, skipping FSM ln data1/foo -t data2/ rsync --archive --delete --hard-links --size-only --no-inc-recursive data1 data2 standby # result ls standby/data1 ls standby/data2 The result is that foo_fsm is not copied to standby/data2, contrary to what the docs above imply for other unlinked files. Can anyone shed light on this? -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jan 23, 2019 at 9:18 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Wed, Jan 23, 2019 at 7:09 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think the first two patches (a) removal of dead code in bootstrap > > and (b) the core patch to avoid creation of FSM file for the small > > table are good now. I have prepared the patches along with commit > > message. There is no change except for some changes in README and > > commit message of the second patch. Kindly let me know what you think > > about them? > > Good to hear! The additional language is fine. In "Once the FSM is > created for heap", I would just change that to "...for a heap". > Sure, apart from this I have run pgindent on the patches and make some changes accordingly. Latest patches attached (only second patch has some changes). I will take one more pass on Monday morning (28th Jan) and will commit unless you or others see any problem. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Jan 24, 2019 at 3:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Mon, Jan 21, 2019 at 6:32 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Also, another case to think in this regard is the upgrade for standby > > servers, if you read below paragraph from the user manual [1], you > > will see what I am worried about? > > > > "What this does is to record the links created by pg_upgrade's link > > mode that connect files in the old and new clusters on the primary > > server. It then finds matching files in the standby's old cluster and > > creates links for them in the standby's new cluster. Files that were > > not linked on the primary are copied from the primary to the standby. > > (They are usually small.)" > > > > [1] - https://www.postgresql.org/docs/devel/pgupgrade.html > > I am still not able to get the upgraded standby to go into recovery > without resorting to pg_basebackup, but in another attempt to > investigate your question I tried the following (data1 = old cluster, > data2 = new cluster): > > > mkdir -p data1 data2 standby > > echo 'heap' > data1/foo > echo 'fsm' > data1/foo_fsm > > # simulate streaming replication > rsync --archive data1 standby > > # simulate pg_upgrade, skipping FSM > ln data1/foo -t data2/ > > rsync --archive --delete --hard-links --size-only --no-inc-recursive > data1 data2 standby > > # result > ls standby/data1 > ls standby/data2 > > > The result is that foo_fsm is not copied to standby/data2, contrary to > what the docs above imply for other unlinked files. Can anyone shed > light on this? > Is foo_fsm present in standby/data1? I think what doc means to say is that it copies any unlinked files present in primary's new cluster (which in your case will be data2). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 24, 2019 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 24, 2019 at 3:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > Few comments related to pg_upgrade patch: 1. + if ((maps[mapnum].relkind != RELKIND_RELATION && + maps[mapnum].relkind != RELKIND_TOASTVALUE) || + first_seg_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ || + GET_MAJOR_VERSION(new_cluster.major_version) <= 1100) + (void) transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit); I think this check will needlessly be performed for future versions as well, say when wants to upgrade from PG12 to PG13. That might not create any problem, but let's try to be more precise. Can you try to rewrite this check? You might want to encapsulate it inside a function. I have thought of doing something similar to what we do for vm, see checks relate to VISIBILITY_MAP_FROZEN_BIT_CAT_VER, but I guess for this patch it is not important to check catalog version as even if someone tries to upgrade to the same version. 2. transfer_relfile() { .. - /* Is it an extent, fsm, or vm file? */ - if (type_suffix[0] != '\0' || segno != 0) + /* Did file open fail? */ + if (stat(old_file, &statbuf) != 0) .. } So from now onwards, we will call stat for even 0th segment which means there is one additional system call for each relation, not sure if that matters, but I think there is no harm in once testing with a large number of relations say 10K to 50K relations which have FSM. The other alternative is we can fetch pg_class.relpages and rely on that to take this decision, but again if that is not updated, we might take the wrong decision. Anyone else has any thoughts on this point? 3. -static void +static Size transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit) If we decide to go with the approach proposed by you, we should add some comments atop this function for return value change? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 23, 2019 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 24, 2019 at 3:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > mkdir -p data1 data2 standby > > > > echo 'heap' > data1/foo > > echo 'fsm' > data1/foo_fsm > > > > # simulate streaming replication > > rsync --archive data1 standby > > > > # simulate pg_upgrade, skipping FSM > > ln data1/foo -t data2/ > > > > rsync --archive --delete --hard-links --size-only --no-inc-recursive > > data1 data2 standby > > > > # result > > ls standby/data1 > > ls standby/data2 > > > > > > The result is that foo_fsm is not copied to standby/data2, contrary to > > what the docs above imply for other unlinked files. Can anyone shed > > light on this? > > > > Is foo_fsm present in standby/data1? Yes it is. > I think what doc means to say is > that it copies any unlinked files present in primary's new cluster > (which in your case will be data2). In that case, I'm still confused why that doc says, "Unfortunately, rsync needlessly copies files associated with temporary and unlogged tables because these files don't normally exist on standby servers." I fail to see why the primary's new cluster would have these if they weren't linked. And in the case we're discussing here, the skipped FSMs won't be on data2, so won't end up in standby/data2. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 25, 2019 at 1:03 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Wed, Jan 23, 2019 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think what doc means to say is > > that it copies any unlinked files present in primary's new cluster > > (which in your case will be data2). > > In that case, I'm still confused why that doc says, "Unfortunately, > rsync needlessly copies files associated with temporary and unlogged > tables because these files don't normally exist on standby servers." > I fail to see why the primary's new cluster would have these if they > weren't linked. > Why unlogged files won't be in primary's new cluster? After the upgrade, they should be present in a new cluster if they were present in the old cluster. > And in the case we're discussing here, the skipped > FSMs won't be on data2, so won't end up in standby/data2. > Right. I think we are safe with respect to rsync because I have seen that we do rewrite the vm files in link mode and rsync will copy them from primary's new cluster. I think you can try to address my other comments on your pg_upgrade patch. Once we agree on the code, we need to test below scenarios: (a) upgrade from all supported versions to the latest version (b) upgrade standby with and without using rsync. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 24, 2019 at 5:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > 1. > + if ((maps[mapnum].relkind != RELKIND_RELATION && > + maps[mapnum].relkind != RELKIND_TOASTVALUE) || > + first_seg_size > HEAP_FSM_CREATION_THRESHOLD * BLCKSZ || > + GET_MAJOR_VERSION(new_cluster.major_version) <= 1100) > + (void) transfer_relfile(&maps[mapnum], "_fsm", vm_must_add_frozenbit); > > I think this check will needlessly be performed for future versions as > well, say when wants to upgrade from PG12 to PG13. That might not > create any problem, but let's try to be more precise. Can you try to > rewrite this check? You might want to encapsulate it inside a > function. I have thought of doing something similar to what we do for > vm, see checks relate to VISIBILITY_MAP_FROZEN_BIT_CAT_VER, but I > guess for this patch it is not important to check catalog version as > even if someone tries to upgrade to the same version. Agreed, done for v19 (I've only attached the pg_upgrade patch). > 2. > transfer_relfile() > { > .. > - /* Is it an extent, fsm, or vm file? */ > - if (type_suffix[0] != '\0' || segno != 0) > + /* Did file open fail? */ > + if (stat(old_file, &statbuf) != 0) > .. > } > > So from now onwards, we will call stat for even 0th segment which > means there is one additional system call for each relation, not sure > if that matters, but I think there is no harm in once testing with a > large number of relations say 10K to 50K relations which have FSM. Performance testing is probably a good idea anyway, but I went ahead and implemented your next idea: > The other alternative is we can fetch pg_class.relpages and rely on > that to take this decision, but again if that is not updated, we might > take the wrong decision. We can think of it this way: Which is worse, 1. Transferring a FSM we don't need, or 2. Skipping a FSM we need I'd say #2 is worse. So, in v19 we check pg_class.relpages and if it's a heap and less than or equal the threshold we call stat on the 0th segment to verify. In the common case, the cost of the stat call is offset by not linking the FSM. Despite needing another pg_class field, I think this code is actually easier to read than my earlier versions. > 3. > -static void > +static Size > transfer_relfile(FileNameMap *map, const char *type_suffix, bool > vm_must_add_frozenbit) > > If we decide to go with the approach proposed by you, we should add > some comments atop this function for return value change? Done, as well as other comment edits. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Jan 24, 2019 at 9:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 25, 2019 at 1:03 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > On Wed, Jan 23, 2019 at 11:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I think what doc means to say is > > > that it copies any unlinked files present in primary's new cluster > > > (which in your case will be data2). > > > > In that case, I'm still confused why that doc says, "Unfortunately, > > rsync needlessly copies files associated with temporary and unlogged > > tables because these files don't normally exist on standby servers." > > I fail to see why the primary's new cluster would have these if they > > weren't linked. > > > > Why unlogged files won't be in primary's new cluster? After the > upgrade, they should be present in a new cluster if they were present > in the old cluster. I assume they would be linked, however (I haven't checked this). I did think rewritten VM files would fall under this, but I was confused about unlogged files. > > And in the case we're discussing here, the skipped > > FSMs won't be on data2, so won't end up in standby/data2. > > > > Right. I think we are safe with respect to rsync because I have seen > that we do rewrite the vm files in link mode and rsync will copy them > from primary's new cluster. Okay. > I think you can try to address my other comments on your pg_upgrade > patch. Once we agree on the code, we need to test below scenarios: > (a) upgrade from all supported versions to the latest version > (b) upgrade standby with and without using rsync. Sounds good. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jan 26, 2019 at 5:05 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Thu, Jan 24, 2019 at 5:19 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Performance testing is probably a good idea anyway, but I went ahead > and implemented your next idea: > > > The other alternative is we can fetch pg_class.relpages and rely on > > that to take this decision, but again if that is not updated, we might > > take the wrong decision. > > We can think of it this way: Which is worse, > 1. Transferring a FSM we don't need, or > 2. Skipping a FSM we need > > I'd say #2 is worse. > Agreed. > So, in v19 we check pg_class.relpages and if it's > a heap and less than or equal the threshold we call stat on the 0th > segment to verify. > Okay, but the way logic is implemented appears clumsy to me. @@ -234,16 +243,40 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro { /* File does not exist? That's OK, just return */ if (errno == ENOENT) - return; + return first_seg_size; else - pg_fatal("error while checking for file existence \"%s.%s\" (\"%s\" to \"%s\"): %s\n", - map->nspname, map->relname, old_file, new_file, - strerror(errno)); + goto fatal; } /* If file is empty, just return */ if (statbuf.st_size == 0) - return; + return first_seg_size; + } + + /* Save size of the first segment of the main fork. */ + + else if (map->relpages <= HEAP_FSM_CREATION_THRESHOLD && + (map->relkind == RELKIND_RELATION || + map->relkind == RELKIND_TOASTVALUE)) + { + /* + * In this case, if pg_class.relpages is wrong, it's possible + * that a FSM will be skipped when we actually need it. To guard + * against this, we verify the size of the first segment. + */ + if (stat(old_file, &statbuf) != 0) + goto fatal; + else + first_seg_size = statbuf.st_size; + } + else + { + /* + * For indexes etc., we don't care if pg_class.relpages is wrong, + * since we always transfer their FSMs. For heaps, we might + * transfer a FSM when we don't need to, but this is harmless. + */ + first_seg_size = Min(map->relpages, RELSEG_SIZE) * BLCKSZ; } The function transfer_relfile has no clue about skipping of FSM stuff, but it contains comments about it. The check "if (map->relpages <= HEAP_FSM_CREATION_THRESHOLD ..." will needlessly be executed for each segment. I think there is some value in using the information from this function to skip fsm files, but the code doesn't appear to fit well, how about moving this check to new function new_cluster_needs_fsm()? > In the common case, the cost of the stat call is > offset by not linking the FSM. > Agreed. > Despite needing another pg_class field, > I think this code is actually easier to read than my earlier versions. > Yeah, the code appears cleaner from the last version, but I think we can do more in that regards. One more minor comment: snprintf(query + strlen(query), sizeof(query) - strlen(query), "SELECT all_rels.*, n.nspname, c.relname, " - " c.relfilenode, c.reltablespace, %s " + " c.relfilenode, c.reltablespace, c.relpages, c.relkind, %s " "FROM (SELECT * FROM regular_heap " " UNION ALL " " SELECT * FROM toast_heap " @@ -525,6 +530,8 @@ get_rel_infos(ClusterInfo *cluster, DbInfo *dbinfo) i_relname = PQfnumber(res, "relname"); i_relfilenode = PQfnumber(res, "relfilenode"); i_reltablespace = PQfnumber(res, "reltablespace"); + i_relpages = PQfnumber(res, "relpages"); + i_relkind = PQfnumber(res, "relkind"); i_spclocation = PQfnumber(res, "spclocation"); The order in which relkind and relpages is used in the above code is different from the order in which it is mentioned in the query, it won't matter, but keeping in order will make look code consistent. I have made this and some more minor code adjustments in the attached patch. If you like those, you can include them in the next version of your patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Jan 26, 2019 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Jan 26, 2019 at 5:05 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > So, in v19 we check pg_class.relpages and if it's > > a heap and less than or equal the threshold we call stat on the 0th > > segment to verify. > > > > Okay, but the way logic is implemented appears clumsy to me. > The function transfer_relfile has no clue about skipping of FSM stuff, > but it contains comments about it. Yeah, I wasn't entirely happy with how that turned out. > I think there is some value in using the information from > this function to skip fsm files, but the code doesn't appear to fit > well, how about moving this check to new function > new_cluster_needs_fsm()? For v21, new_cluster_needs_fsm() has all responsibility for obtaining the info it needs. I think this is much cleaner, but there is a small bit of code duplication since it now has to form the file name. One thing we could do is form the the base old/new file names in transfer_single_new_db() and pass those to transfer_relfile(), which will only add suffixes and segment numbers. We could then pass the base old file name to new_cluster_needs_fsm() and use it as is. Not sure if that's worthwhile, though. > The order in which relkind and relpages is used in the above code is > different from the order in which it is mentioned in the query, it > won't matter, but keeping in order will make look code consistent. I > have made this and some more minor code adjustments in the attached > patch. If you like those, you can include them in the next version of > your patch. Okay, done. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Jan 24, 2019 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Sure, apart from this I have run pgindent on the patches and make some > changes accordingly. Latest patches attached (only second patch has > some changes). I will take one more pass on Monday morning (28th Jan) > and will commit unless you or others see any problem. > Pushed these two patches. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 28, 2019 at 3:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jan 24, 2019 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Sure, apart from this I have run pgindent on the patches and make some > > changes accordingly. Latest patches attached (only second patch has > > some changes). I will take one more pass on Monday morning (28th Jan) > > and will commit unless you or others see any problem. > > Pushed these two patches. Thank you for your input and detailed review! Thank you Mithun for testing! -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jan 28, 2019 at 9:16 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Mon, Jan 28, 2019 at 3:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 24, 2019 at 9:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Sure, apart from this I have run pgindent on the patches and make some > > > changes accordingly. Latest patches attached (only second patch has > > > some changes). I will take one more pass on Monday morning (28th Jan) > > > and will commit unless you or others see any problem. > > > > Pushed these two patches. > > Thank you for your input and detailed review! Thank you Mithun for testing! > There are a few buildfarm failures due to this commit, see my email on pgsql-committers. If you have time, you can also once look into those. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > There are a few buildfarm failures due to this commit, see my email on > pgsql-committers. If you have time, you can also once look into > those. I didn't see anything in common with the configs of the failed members. None have a non-default BLCKSZ that I can see. Looking at this typical example from woodlouse: ================== pgsql.build/src/test/regress/regression.diffs ================== --- C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/expected/fsm.out 2019-01-28 04:43:09.031456700 +0100 +++ C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/results/fsm.out 2019-01-28 05:06:20.351100400 +0100 @@ -26,7 +26,7 @@ pg_relation_size('fsm_check_size', 'fsm') AS fsm_size; heap_size | fsm_size -----------+---------- - 24576 | 0 + 32768 | 0 (1 row) ***It seems like the relation extended when the new records should have gone into block 0. -- Extend table with enough blocks to exceed the FSM threshold @@ -56,7 +56,7 @@ SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size; fsm_size ---------- - 16384 + 24576 (1 row) ***And here it seems vacuum didn't truncate the FSM. I wonder if the heap didn't get truncated either. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jan 28, 2019 at 10:03 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > There are a few buildfarm failures due to this commit, see my email on > > pgsql-committers. If you have time, you can also once look into > > those. > > I didn't see anything in common with the configs of the failed > members. None have a non-default BLCKSZ that I can see. > > Looking at this typical example from woodlouse: > > ================== pgsql.build/src/test/regress/regression.diffs > ================== > --- C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/expected/fsm.out > 2019-01-28 04:43:09.031456700 +0100 > +++ C:/buildfarm/buildenv/HEAD/pgsql.build/src/test/regress/results/fsm.out > 2019-01-28 05:06:20.351100400 +0100 > @@ -26,7 +26,7 @@ > pg_relation_size('fsm_check_size', 'fsm') AS fsm_size; > heap_size | fsm_size > -----------+---------- > - 24576 | 0 > + 32768 | 0 > (1 row) > > ***It seems like the relation extended when the new records should > have gone into block 0. > > -- Extend table with enough blocks to exceed the FSM threshold > @@ -56,7 +56,7 @@ > SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size; > fsm_size > ---------- > - 16384 > + 24576 > (1 row) > > ***And here it seems vacuum didn't truncate the FSM. I wonder if the > heap didn't get truncated either. > Yeah, it seems to me that vacuum is not able to truncate the relation, see my latest reply on another thread [1]. [1] - https://www.postgresql.org/message-id/CAA4eK1JntHd7X6dLJVPGYV917HejjhbMKXn9m_RnnCE162LbLA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 28, 2019 at 10:03 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > There are a few buildfarm failures due to this commit, see my email on > > pgsql-committers. If you have time, you can also once look into > > those. > > I didn't see anything in common with the configs of the failed > members. None have a non-default BLCKSZ that I can see. > I have done an analysis of the different failures on buildfarm. 1. @@ -26,7 +26,7 @@ pg_relation_size('fsm_check_size', 'fsm') AS fsm_size; heap_size | fsm_size -----------+---------- - 24576 | 0 + 32768 | 0 (1 row) -- Extend table with enough blocks to exceed the FSM threshold @@ -56,7 +56,7 @@ SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size; fsm_size ---------- - 16384 + 24576 (1 row) As discussed on another thread, this seems to be due to the reason that a parallel auto-analyze doesn't allow vacuum to remove dead-row versions. To fix this, I think we should avoid having a dependency on vacuum to remove dead rows. 2. @@ -15,13 +15,9 @@ SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100; ERROR: block number 100 is out of range for relation "test_rel_forks" SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0; - fsm_0 -------- - 8192 -(1 row) - +ERROR: could not open file "base/50769/50798_fsm": No such file or directory SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10; -ERROR: block number 10 is out of range for relation "test_rel_forks" +ERROR: could not open file "base/50769/50798_fsm": No such file or directory This indicates that even though the Vacuum is executed, but the FSM doesn't get created. This could be due to different BLCKSZ, but the failed machines don't seem to have a non-default value of it. I am not sure why this could happen, maybe we need to check once in the failed regression database to see the size of relation? 3. Failure on 'mantid' 2019-01-28 00:13:55.191 EST [123979] 001_pgbench_with_server.pl LOG: statement: CREATE UNLOGGED TABLE insert_tbl (id serial primary key); 2019-01-28 00:13:55.218 EST [123982] 001_pgbench_with_server.pl LOG: execute P0_0: INSERT INTO insert_tbl SELECT FROM generate_series(1,1000); 2019-01-28 00:13:55.219 EST [123983] 001_pgbench_with_server.pl LOG: execute P0_0: INSERT INTO insert_tbl SELECT FROM generate_series(1,1000); 2019-01-28 00:13:55.220 EST [123984] 001_pgbench_with_server.pl LOG: execute P0_0: INSERT INTO insert_tbl SELECT FROM generate_series(1,1000); .. .. TRAP: FailedAssertion("!((rel->rd_rel->relkind == 'r' || rel->rd_rel->relkind == 't') && fsm_local_map.map[oldPage] == 0x01)", File: "freespace.c", Line: 223) I think this can happen if we forget to clear the local map after we get the block with space in function RelationGetBufferForTuple(). I see the race condition in the code where that can happen. Say, we tried all the blocks in the local map and then tried to extend the relation and we didn't get ConditionalLockRelationForExtension, in the meantime, another backend has extended the relation and updated the FSM (via RelationAddExtraBlocks). Now, when the backend that didn't get the extension lock will get the target block from FSM which will be greater than HEAP_FSM_CREATION_THRESHOLD. Next, it will find that the block can be used to insert a new row and return the buffer, but won't clear the local map due to below condition in code: @@ -377,20 +383,9 @@ RelationGetBufferForTuple(Relation relation, Size len, + + /* + * In case we used an in-memory map of available blocks, reset it + * for next use. + */ + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD) + FSMClearLocalMap(); + I think here you need to clear the map if it exists or clear it unconditionally, the earlier one would be better. This test gets executed concurrently by 5 clients, so it can hit the above race condition. 4. Failure on jacana: --- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/box.out 2018-09-26 17:53:33 -0400 +++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/results/box.out 2019-01-27 23:14:35 -0500 @@ -252,332 +252,7 @@ ('(0,100)(0,infinity)'), ('(-infinity,0)(0,infinity)'), ('(-infinity,-infinity)(infinity,infinity)'); -SET enable_seqscan = false; -SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)'; .. .. TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File: "c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/freespace/freespace.c", Line: 1118) .. 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:4] LOG: server process (PID 14388) exited with exit code 3 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:5] DETAIL: Failed process was running: INSERT INTO box_temp VALUES (NULL), I think the reason for this failure is same as previous (as mentioned in point-3), but this can happen in a different way. Say, we have searched the local map and then try to extend a relation 'X' and in the meantime, another backend has extended such that it creates FSM. Now, we will reuse that page and won't clear local map. Now, say we try to insert in relation 'Y' which doesn't have FSM. It will try to set the local map and will find that it already exists, so will fail. Now, the question is how it can happen in this box.sql test. I guess that is happening for some system table which is being populated by Create Index statement executed just before the failing Insert. I think both 3 and 4 are timing issues, so we didn't got in our local regression runs. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 28, 2019 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Jan 28, 2019 at 10:03 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> > On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > There are a few buildfarm failures due to this commit, see my email on
> > > pgsql-committers. If you have time, you can also once look into
> > > those.
> >
> > I didn't see anything in common with the configs of the failed
> > members. None have a non-default BLCKSZ that I can see.
> >
>
> I have done an analysis of the different failures on buildfarm.
>
>
> 2.
> @@ -15,13 +15,9 @@
> SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
> ERROR: block number 100 is out of range for relation "test_rel_forks"
> SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> - fsm_0
> --------
> - 8192
> -(1 row)
> -
> +ERROR: could not open file "base/50769/50798_fsm": No such file or directory
> SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> -ERROR: block number 10 is out of range for relation "test_rel_forks"
> +ERROR: could not open file "base/50769/50798_fsm": No such file or directory
>
> This indicates that even though the Vacuum is executed, but the FSM
> doesn't get created. This could be due to different BLCKSZ, but the
> failed machines don't seem to have a non-default value of it. I am
> not sure why this could happen, maybe we need to check once in the
> failed regression database to see the size of relation?
>
This symptom is shown in the below buildfarm critters:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-01-28%2005%3A05%3A22
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-01-28%2003%3A20%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2019-01-28%2003%3A13%3A47
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-01-28%2003%3A07%3A39
>
> On Mon, Jan 28, 2019 at 10:03 AM John Naylor
> <john.naylor@2ndquadrant.com> wrote:
> >
> > On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > There are a few buildfarm failures due to this commit, see my email on
> > > pgsql-committers. If you have time, you can also once look into
> > > those.
> >
> > I didn't see anything in common with the configs of the failed
> > members. None have a non-default BLCKSZ that I can see.
> >
>
> I have done an analysis of the different failures on buildfarm.
>
>
> 2.
> @@ -15,13 +15,9 @@
> SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100;
> ERROR: block number 100 is out of range for relation "test_rel_forks"
> SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0;
> - fsm_0
> --------
> - 8192
> -(1 row)
> -
> +ERROR: could not open file "base/50769/50798_fsm": No such file or directory
> SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10;
> -ERROR: block number 10 is out of range for relation "test_rel_forks"
> +ERROR: could not open file "base/50769/50798_fsm": No such file or directory
>
> This indicates that even though the Vacuum is executed, but the FSM
> doesn't get created. This could be due to different BLCKSZ, but the
> failed machines don't seem to have a non-default value of it. I am
> not sure why this could happen, maybe we need to check once in the
> failed regression database to see the size of relation?
>
This symptom is shown in the below buildfarm critters:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-01-28%2005%3A05%3A22
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-01-28%2003%3A20%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust&dt=2019-01-28%2003%3A13%3A47
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-01-28%2003%3A07%3A39
All of these seems to run with fsync=off. Is it possible that vacuum has updated FSM, but the same is not synced to disk and when we try to read it, we didn't get the required page? This is just a guess.
I have checked all the buildfarm failures and I see only 4 symptoms for which I have sent some initial analysis. I think you can also once cross-verify the same.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
>>>>> "Amit" == Amit Kapila <amit.kapila16@gmail.com> writes: Amit> All of these seems to run with fsync=off. Is it possible that Amit> vacuum has updated FSM, but the same is not synced to disk and Amit> when we try to read it, we didn't get the required page? No. fsync never affects what programs see while the system is running, only what happens after an OS crash. -- Andrew (irc:RhodiumToad)
On Mon, Jan 28, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 28, 2019 at 10:03 AM John Naylor > <john.naylor@2ndquadrant.com> wrote: > > > 1. > @@ -26,7 +26,7 @@ > pg_relation_size('fsm_check_size', 'fsm') AS fsm_size; > heap_size | fsm_size > -----------+---------- > - 24576 | 0 > + 32768 | 0 > (1 row) > > -- Extend table with enough blocks to exceed the FSM threshold > @@ -56,7 +56,7 @@ > SELECT pg_relation_size('fsm_check_size', 'fsm') AS fsm_size; > fsm_size > ---------- > - 16384 > + 24576 > (1 row) > > > As discussed on another thread, this seems to be due to the reason > that a parallel auto-analyze doesn't allow vacuum to remove dead-row > versions. To fix this, I think we should avoid having a dependency on > vacuum to remove dead rows. Ok, to make the first test here more reliable I will try Andrew's idea to use fillfactor to save free space. As I said earlier, I think that second test isn't helpful and can be dropped. > 2. > @@ -15,13 +15,9 @@ > SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100; > ERROR: block number 100 is out of range for relation "test_rel_forks" > SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0; > - fsm_0 > -------- > - 8192 > -(1 row) > - > +ERROR: could not open file "base/50769/50798_fsm": No such file or directory > SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10; > -ERROR: block number 10 is out of range for relation "test_rel_forks" > +ERROR: could not open file "base/50769/50798_fsm": No such file or directory > > This indicates that even though the Vacuum is executed, but the FSM > doesn't get created. This could be due to different BLCKSZ, but the > failed machines don't seem to have a non-default value of it. I am > not sure why this could happen, maybe we need to check once in the > failed regression database to see the size of relation? I'm also having a hard time imagining why this failed. Just in case, we could return ctid in a plpgsql loop and stop as soon as we see the 5th block. I've done that for some tests during development and is a safer method anyway. > <timing failures in 3 and 4> > > @@ -377,20 +383,9 @@ RelationGetBufferForTuple(Relation relation, Size len, > + > + /* > + * In case we used an in-memory map of available blocks, reset it > + * for next use. > + */ > + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD) > + FSMClearLocalMap(); > + > > I think here you need to clear the map if it exists or clear it > unconditionally, the earlier one would be better. Ok, maybe all callers should call it unconditonally, but within the function, check "if (FSM_LOCAL_MAP_EXISTS)"? Thanks for investigating the failures -- I'm a bit pressed for time this week. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 29, 2019 at 12:37 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Mon, Jan 28, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 2. > > @@ -15,13 +15,9 @@ > > SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100; > > ERROR: block number 100 is out of range for relation "test_rel_forks" > > SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0; > > - fsm_0 > > -------- > > - 8192 > > -(1 row) > > - > > +ERROR: could not open file "base/50769/50798_fsm": No such file or directory > > SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10; > > -ERROR: block number 10 is out of range for relation "test_rel_forks" > > +ERROR: could not open file "base/50769/50798_fsm": No such file or directory > > > > This indicates that even though the Vacuum is executed, but the FSM > > doesn't get created. This could be due to different BLCKSZ, but the > > failed machines don't seem to have a non-default value of it. I am > > not sure why this could happen, maybe we need to check once in the > > failed regression database to see the size of relation? > > I'm also having a hard time imagining why this failed. Just in case, > we could return ctid in a plpgsql loop and stop as soon as we see the > 5th block. I've done that for some tests during development and is a > safer method anyway. > I think we can devise some concrete way, but it is better first we try to understand why it failed, otherwise there is always a chance that we will repeat the mistake in some other case. I think we have no other choice, but to request the buildfarm owners to either give us the access to see what happens or help us in investigating the problem. The four buildfarms where it failed were lapwing, locust, dromedary, prairiedog. Among these, the owner of last two is Tom Lane and others I don't recognize. Tom, Andrew, can you help us in getting the access of one of those four? Yet another alternative is the owner can apply the patch attached (this is same what got committed) or reset to commit ac88d2962a and execute below statements and share the results: CREATE EXTENSION pageinspect; CREATE TABLE test_rel_forks (a int); INSERT INTO test_rel_forks SELECT i from generate_series(1,1000) i; VACUUM test_rel_forks; SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0; SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100; SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0; SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10; SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0; SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 1)) AS vm_1; If the above statements give error: "ERROR: could not open file ...", then run: Analyze test_rel_forks; Select oid, relname, relpages, reltuples from pg_class where relname like 'test%'; The result of the above tests will tell us whether there are 5 pages in the table or not. If the table contains 5 pages and throws an error, then there is some bug in our code, otherwise, there is something specific to those systems where the above insert doesn't result in 5 pages. > > I think here you need to clear the map if it exists or clear it > > unconditionally, the earlier one would be better. > > Ok, maybe all callers should call it unconditonally, but within the > function, check "if (FSM_LOCAL_MAP_EXISTS)"? > Sounds sensible. I think we should try to reproduce these failures, for ex. for pgbench failure, we can try the same test with more clients. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jan 29, 2019 at 5:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 29, 2019 at 12:37 AM John Naylor > <john.naylor@2ndquadrant.com> wrote: > > > I think here you need to clear the map if it exists or clear it > > > unconditionally, the earlier one would be better. > > > > Ok, maybe all callers should call it unconditonally, but within the > > function, check "if (FSM_LOCAL_MAP_EXISTS)"? > > > > Sounds sensible. I think we should try to reproduce these failures, > for ex. for pgbench failure, we can try the same test with more > clients. > I am able to reproduce this by changing pgbench test as below: --- a/src/bin/pgbench/t/001_pgbench_with_server.pl +++ b/src/bin/pgbench/t/001_pgbench_with_server.pl @@ -56,9 +56,9 @@ $node->safe_psql('postgres', 'CREATE UNLOGGED TABLE insert_tbl (id serial primary key); '); pgbench( - '--no-vacuum --client=5 --protocol=prepared --transactions=25', + '--no-vacuum --client=10 --protocol=prepared --transactions=25', 0, - [qr{processed: 125/125}], + [qr{processed: 250/250}], You can find this change in attached patch. Then, I ran the make check in src/bin/pgbench multiple times using test_conc_insert.sh. You can vary the number of times the test should run, if you are not able to reproduce it with this. The attached patch (clear_local_map_if_exists_1.patch) atop the main patch fixes the issue for me. Kindly verify the same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jan 29, 2019 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 29, 2019 at 12:37 AM John Naylor > <john.naylor@2ndquadrant.com> wrote: > > > > On Mon, Jan 28, 2019 at 12:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > 2. > > > @@ -15,13 +15,9 @@ > > > SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100; > > > ERROR: block number 100 is out of range for relation "test_rel_forks" > > > SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0; > > > - fsm_0 > > > -------- > > > - 8192 > > > -(1 row) > > > - > > > +ERROR: could not open file "base/50769/50798_fsm": No such file or directory > > > SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10; > > > -ERROR: block number 10 is out of range for relation "test_rel_forks" > > > +ERROR: could not open file "base/50769/50798_fsm": No such file or directory > > > > > > This indicates that even though the Vacuum is executed, but the FSM > > > doesn't get created. This could be due to different BLCKSZ, but the > > > failed machines don't seem to have a non-default value of it. I am > > > not sure why this could happen, maybe we need to check once in the > > > failed regression database to see the size of relation? > > > > I'm also having a hard time imagining why this failed. Just in case, > > we could return ctid in a plpgsql loop and stop as soon as we see the > > 5th block. I've done that for some tests during development and is a > > safer method anyway. > > > > I think we can devise some concrete way, but it is better first we try > to understand why it failed, otherwise there is always a chance that > we will repeat the mistake in some other case. I think we have no > other choice, but to request the buildfarm owners to either give us > the access to see what happens or help us in investigating the > problem. The four buildfarms where it failed were lapwing, locust, > dromedary, prairiedog. Among these, the owner of last two is Tom > Lane and others I don't recognize. Tom, Andrew, can you help us in > getting the access of one of those four? Yet another alternative is > the owner can apply the patch attached (this is same what got > committed) or reset to commit ac88d2962a and execute below statements > and share the results: > > CREATE EXTENSION pageinspect; > > CREATE TABLE test_rel_forks (a int); > INSERT INTO test_rel_forks SELECT i from generate_series(1,1000) i; > VACUUM test_rel_forks; > SELECT octet_length(get_raw_page('test_rel_forks', 'main', 0)) AS main_0; > SELECT octet_length(get_raw_page('test_rel_forks', 'main', 100)) AS main_100; > > SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 0)) AS fsm_0; > SELECT octet_length(get_raw_page('test_rel_forks', 'fsm', 10)) AS fsm_10; > > SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 0)) AS vm_0; > SELECT octet_length(get_raw_page('test_rel_forks', 'vm', 1)) AS vm_1; > > If the above statements give error: "ERROR: could not open file ...", then run: > Analyze test_rel_forks; > Select oid, relname, relpages, reltuples from pg_class where relname > like 'test%'; > > The result of the above tests will tell us whether there are 5 pages > in the table or not. If the table contains 5 pages and throws an > error, then there is some bug in our code, otherwise, there is > something specific to those systems where the above insert doesn't > result in 5 pages. I'd suspect the alignment of integer. In my environemnt, the tuple actual size is 28 bytes but the aligned size is 32 bytes (= MAXALIGN(28)), so we can store 226 tuples to single page. But if MAXALIGN(28) = 28 then we can store 255 tuples and 1000 tuples fits within 4 pages. The MAXALIGN of four buildfarms seem 4 accroding to the configure script so MAXALIGN(28) might be 28 on these buildfarms. configure:16816: checking alignment of short configure:16839: result: 2 configure:16851: checking alignment of int configure:16874: result: 4 configure:16886: checking alignment of long configure:16909: result: 4 configure:16922: checking alignment of long long int configure:16945: result: 4 Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Jan 29, 2019 at 5:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Jan 29, 2019 at 9:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I'd suspect the alignment of integer. In my environemnt, the tuple > actual size is 28 bytes but the aligned size is 32 bytes (= > MAXALIGN(28)), so we can store 226 tuples to single page. But if > MAXALIGN(28) = 28 then we can store 255 tuples and 1000 tuples fits > within 4 pages. The MAXALIGN of four buildfarms seem 4 accroding to > the configure script so MAXALIGN(28) might be 28 on these buildfarms. > Good finding. I was also wondering along these lines and wanted to verify. Thanks a lot. So, this clearly states why we have a second failure in my email above [1]. I think this means for the fsm test also we have to be careful when relying on the number of pages in the test. I think now we have found the reasons and solutions for the first three problems mentioned in my email [1]. For the problem-4 (Failure on jacana:), I have speculated some theory, but not sure how can we confirm? Can we try the patch on Jacana before considering the patch for commit? Is there any other way we can replicate that error? [1] - https://www.postgresql.org/message-id/CAA4eK1L%3DqWp_bJ5aTc9%2Bfy4Ewx2LPaLWY-RbR4a60g_rupCKnQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > You can find this change in attached patch. Then, I ran the make > check in src/bin/pgbench multiple times using test_conc_insert.sh. > You can vary the number of times the test should run, if you are not > able to reproduce it with this. > > The attached patch (clear_local_map_if_exists_1.patch) atop the main > patch fixes the issue for me. Kindly verify the same. I got one failure in 50 runs. With the new patch, I didn't get any failures in 300 runs. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > You can find this change in attached patch. Then, I ran the make > > check in src/bin/pgbench multiple times using test_conc_insert.sh. > > You can vary the number of times the test should run, if you are not > > able to reproduce it with this. > > > > The attached patch (clear_local_map_if_exists_1.patch) atop the main > > patch fixes the issue for me. Kindly verify the same. > > I got one failure in 50 runs. With the new patch, I didn't get any > failures in 300 runs. > Thanks for verification. I have included it in the attached patch and I have also modified the page.sql test to have enough number of pages in relation so that FSM will get created irrespective of alignment boundaries. Masahiko San, can you verify if this now works for you? There are two more failures which we need to something about. 1. Make fsm.sql independent of vacuum without much losing on coverage of newly added code. John, I guess you have an idea, see if you can take care of it, otherwise, I will see what I can do for it. 2. I still could not figure out how to verify if the failure on Jacana will be fixed. I have posted some theory above and the attached patch has a solution for it, but I think it would be better if find out some way to verify the same. Note - you might see some cosmetic changes in freespace.c due to pgindent. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > There are two more failures which we need to something about. > 1. Make fsm.sql independent of vacuum without much losing on coverage > of newly added code. John, I guess you have an idea, see if you can > take care of it, otherwise, I will see what I can do for it. I've attached a patch that applies on top of v19 that uses Andrew Gierth's idea to use fillfactor to control free space. I've also removed tests that relied on truncation and weren't very useful to begin with. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Jan 30, 2019 at 3:26 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > There are two more failures which we need to something about. > > 1. Make fsm.sql independent of vacuum without much losing on coverage > > of newly added code. John, I guess you have an idea, see if you can > > take care of it, otherwise, I will see what I can do for it. > > I've attached a patch that applies on top of v19 that uses Andrew > Gierth's idea to use fillfactor to control free space. I've also > removed tests that relied on truncation and weren't very useful to > begin with. > This is much better than the earlier version of test and there is no dependency on the vacuum. However, I feel still there is some dependency on how the rows will fit in a page and we have seen some related failures due to alignment stuff. By looking at the test, I can't envision any such problem, but how about if we just write some simple tests where we can check that the FSM won't be created for very small number of records say one or two and then when we increase the records FSM gets created, here if we want, we can even use vacuum to ensure FSM gets created. Once we are sure that the main patch passes all the buildfarm tests, we can extend the test to something advanced as you are proposing now. I think that will reduce the chances of failure, what do you think? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Jan 30, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > This is much better than the earlier version of test and there is no > dependency on the vacuum. However, I feel still there is some > dependency on how the rows will fit in a page and we have seen some > related failures due to alignment stuff. By looking at the test, I > can't envision any such problem, but how about if we just write some > simple tests where we can check that the FSM won't be created for very > small number of records say one or two and then when we increase the > records FSM gets created, here if we want, we can even use vacuum to > ensure FSM gets created. Once we are sure that the main patch passes > all the buildfarm tests, we can extend the test to something advanced > as you are proposing now. I think that will reduce the chances of > failure, what do you think? That's probably a good idea to limit risk. I just very basic tests now, and vacuum before every relation size check to make sure any FSM extension (whether desired or not) is invoked. Also, in my last patch I forgot to implement explicit checks of the block number instead of assuming how many rows will fit on a page. I've used a plpgsql code block to do this. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > You can find this change in attached patch. Then, I ran the make > > > check in src/bin/pgbench multiple times using test_conc_insert.sh. > > > You can vary the number of times the test should run, if you are not > > > able to reproduce it with this. > > > > > > The attached patch (clear_local_map_if_exists_1.patch) atop the main > > > patch fixes the issue for me. Kindly verify the same. > > > > I got one failure in 50 runs. With the new patch, I didn't get any > > failures in 300 runs. > > > > Thanks for verification. I have included it in the attached patch and > I have also modified the page.sql test to have enough number of pages > in relation so that FSM will get created irrespective of alignment > boundaries. Masahiko San, can you verify if this now works for you? > Thank you for updating the patch! The modified page.sql test could fail if the block size is more than 8kB? We can ensure the number of pages are more than 4 by checking it and adding more data if no enough but I'm really not sure we should care the bigger-block size cases. However maybe it's good to check the number of pages after insertion so that we can break down the issue in case the test failed again. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jan 30, 2019 at 8:11 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Wed, Jan 30, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > This is much better than the earlier version of test and there is no > > dependency on the vacuum. However, I feel still there is some > > dependency on how the rows will fit in a page and we have seen some > > related failures due to alignment stuff. By looking at the test, I > > can't envision any such problem, but how about if we just write some > > simple tests where we can check that the FSM won't be created for very > > small number of records say one or two and then when we increase the > > records FSM gets created, here if we want, we can even use vacuum to > > ensure FSM gets created. Once we are sure that the main patch passes > > all the buildfarm tests, we can extend the test to something advanced > > as you are proposing now. I think that will reduce the chances of > > failure, what do you think? > > That's probably a good idea to limit risk. I just very basic tests > now, and vacuum before every relation size check to make sure any FSM > extension (whether desired or not) is invoked. Also, in my last patch > I forgot to implement explicit checks of the block number instead of > assuming how many rows will fit on a page. I've used a plpgsql code > block to do this. > -- Extend table with enough blocks to exceed the FSM threshold -- FSM is created and extended to 3 blocks The second comment line seems redundant to me, so I have removed that and integrated it in the main patch. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > > > On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > You can find this change in attached patch. Then, I ran the make > > > > check in src/bin/pgbench multiple times using test_conc_insert.sh. > > > > You can vary the number of times the test should run, if you are not > > > > able to reproduce it with this. > > > > > > > > The attached patch (clear_local_map_if_exists_1.patch) atop the main > > > > patch fixes the issue for me. Kindly verify the same. > > > > > > I got one failure in 50 runs. With the new patch, I didn't get any > > > failures in 300 runs. > > > > > > > Thanks for verification. I have included it in the attached patch and > > I have also modified the page.sql test to have enough number of pages > > in relation so that FSM will get created irrespective of alignment > > boundaries. Masahiko San, can you verify if this now works for you? > > > > Thank you for updating the patch! > > The modified page.sql test could fail if the block size is more than > 8kB? That's right, but I don't think current regression tests will work for block size greater than 8KB. I have tried with 16 and 32 as block size, there were few failures on the head itself. > We can ensure the number of pages are more than 4 by checking it > and adding more data if no enough but I'm really not sure we should > care the bigger-block size cases. > Yeah, I am not sure either. I think as this is an existing test, we should not try to change it too much. However, if both you and John feel it is better to change, we can go with that. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 31, 2019 at 6:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 30, 2019 at 8:11 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > That's probably a good idea to limit risk. I just very basic tests > > now, and vacuum before every relation size check to make sure any FSM > > extension (whether desired or not) is invoked. Also, in my last patch > > I forgot to implement explicit checks of the block number instead of > > assuming how many rows will fit on a page. I've used a plpgsql code > > block to do this. > > > > -- Extend table with enough blocks to exceed the FSM threshold > -- FSM is created and extended to 3 blocks > > The second comment line seems redundant to me, so I have removed that > and integrated it in the main patch. FYI, the second comment is still present in v20. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 31, 2019 at 6:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > The modified page.sql test could fail if the block size is more than > > 8kB? > > That's right, but I don't think current regression tests will work for > block size greater than 8KB. I have tried with 16 and 32 as block > size, there were few failures on the head itself. > > > We can ensure the number of pages are more than 4 by checking it > > and adding more data if no enough but I'm really not sure we should > > care the bigger-block size cases. > > > > Yeah, I am not sure either. I think as this is an existing test, we > should not try to change it too much. However, if both you and John > feel it is better to change, we can go with that. I have an idea -- instead of adding a bunch of records and hoping that the relation size and free space is consistent across platforms, how about we revert to the original test input, and add a BRIN index? That should have a FSM even with one record. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 31, 2019 at 2:02 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Thu, Jan 31, 2019 at 6:37 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jan 30, 2019 at 8:11 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > > > That's probably a good idea to limit risk. I just very basic tests > > > now, and vacuum before every relation size check to make sure any FSM > > > extension (whether desired or not) is invoked. Also, in my last patch > > > I forgot to implement explicit checks of the block number instead of > > > assuming how many rows will fit on a page. I've used a plpgsql code > > > block to do this. > > > > > > > -- Extend table with enough blocks to exceed the FSM threshold > > -- FSM is created and extended to 3 blocks > > > > The second comment line seems redundant to me, so I have removed that > > and integrated it in the main patch. > > FYI, the second comment is still present in v20. > oops, forgot to include in commit after making a change, done now. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Jan 31, 2019 at 2:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Thu, Jan 31, 2019 at 6:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > The modified page.sql test could fail if the block size is more than > > > 8kB? > > > > That's right, but I don't think current regression tests will work for > > block size greater than 8KB. I have tried with 16 and 32 as block > > size, there were few failures on the head itself. > > > > > We can ensure the number of pages are more than 4 by checking it > > > and adding more data if no enough but I'm really not sure we should > > > care the bigger-block size cases. > > > > > > > Yeah, I am not sure either. I think as this is an existing test, we > > should not try to change it too much. However, if both you and John > > feel it is better to change, we can go with that. > > I have an idea -- instead of adding a bunch of records and hoping that > the relation size and free space is consistent across platforms, how > about we revert to the original test input, and add a BRIN index? That > should have a FSM even with one record. > Why would BRIN index allow having FSM for heap relation? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 31, 2019 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have an idea -- instead of adding a bunch of records and hoping that > > the relation size and free space is consistent across platforms, how > > about we revert to the original test input, and add a BRIN index? That > > should have a FSM even with one record. > > > > Why would BRIN index allow having FSM for heap relation? Oops, I forgot this file is for testing heaps only. That said, we could possibly put most of the FSM tests such as SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0)); into brin.sql since we know a non-empty BRIN index will have a FSM. And in page.sql we could just have a test that the table has no FSM. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 31, 2019 at 1:52 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Thu, Jan 31, 2019 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I have an idea -- instead of adding a bunch of records and hoping that > > > the relation size and free space is consistent across platforms, how > > > about we revert to the original test input, and add a BRIN index? That > > > should have a FSM even with one record. > > > > > > > Why would BRIN index allow having FSM for heap relation? > > Oops, I forgot this file is for testing heaps only. That said, we > could possibly put most of the FSM tests such as > > SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0)); > > into brin.sql since we know a non-empty BRIN index will have a FSM. As in the attached. Applies on top of v20. First to revert to HEAD, second to move FSM tests to brin.sql. This is a much less invasive and more readable patch, in addition to being hopefully more portable. > And in page.sql we could just have a test that the table has no FSM. This is not possible, since we don't know the relfilenode for the error text, and it's not important. Better to have everything in brin.sql. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Jan 31, 2019 at 6:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Jan 30, 2019 at 10:41 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Wed, Jan 30, 2019 at 4:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Tue, Jan 29, 2019 at 8:12 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > > > > > On Tue, Jan 29, 2019 at 11:56 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > You can find this change in attached patch. Then, I ran the make > > > > > check in src/bin/pgbench multiple times using test_conc_insert.sh. > > > > > You can vary the number of times the test should run, if you are not > > > > > able to reproduce it with this. > > > > > > > > > > The attached patch (clear_local_map_if_exists_1.patch) atop the main > > > > > patch fixes the issue for me. Kindly verify the same. > > > > > > > > I got one failure in 50 runs. With the new patch, I didn't get any > > > > failures in 300 runs. > > > > > > > > > > Thanks for verification. I have included it in the attached patch and > > > I have also modified the page.sql test to have enough number of pages > > > in relation so that FSM will get created irrespective of alignment > > > boundaries. Masahiko San, can you verify if this now works for you? > > > > > > > Thank you for updating the patch! > > > > The modified page.sql test could fail if the block size is more than > > 8kB? > > That's right, but I don't think current regression tests will work for > block size greater than 8KB. I have tried with 16 and 32 as block > size, there were few failures on the head itself. Understood. That means that no build farm configures other block size than 8kB. > > > We can ensure the number of pages are more than 4 by checking it > > and adding more data if no enough but I'm really not sure we should > > care the bigger-block size cases. > > > > Yeah, I am not sure either. I think as this is an existing test, we > should not try to change it too much. However, if both you and John > feel it is better to change, we can go with that. > So I think the patch you proposed looks good to me but how about adding the check whether the table is more than 4 pages? For example, SELECT (pg_relation_size('test_rel_forks') / current_setting('block_size')::int) > 4; Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Jan 31, 2019 at 7:23 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Thu, Jan 31, 2019 at 1:52 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > On Thu, Jan 31, 2019 at 1:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I have an idea -- instead of adding a bunch of records and hoping that > > > > the relation size and free space is consistent across platforms, how > > > > about we revert to the original test input, and add a BRIN index? That > > > > should have a FSM even with one record. > > > > > > > > > > Why would BRIN index allow having FSM for heap relation? > > > > Oops, I forgot this file is for testing heaps only. That said, we > > could possibly put most of the FSM tests such as > > > > SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0)); > > > > into brin.sql since we know a non-empty BRIN index will have a FSM. > > As in the attached. Applies on top of v20. First to revert to HEAD, > second to move FSM tests to brin.sql. This is a much less invasive and > more readable patch, in addition to being hopefully more portable. > I don't think that moving fsm tests to brin would be a good approach. We want to have a separate test for each access method. I think if we want to do something to avoid portability issues, maybe we can do what Masahiko San has just suggested. OTOH, I think we are just good w.r.t this issue with the last patch I sent. I think unless we see some problem here, we should put energy into having a reproducible test for the fourth problem mentioned in my mail up thread [1]. Do you think it makes sense to run make check in loop for multiple times or do you have any idea how we can have a reproducible test? [1] - https://www.postgresql.org/message-id/CAA4eK1L%3DqWp_bJ5aTc9%2Bfy4Ewx2LPaLWY-RbR4a60g_rupCKnQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 31, 2019 at 4:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I don't think that moving fsm tests to brin would be a good approach. > We want to have a separate test for each access method. I think if we > want to do something to avoid portability issues, maybe we can do what > Masahiko San has just suggested. We could also use the same plpgsql loop as in fsm.sql to check the ctid, right? > OTOH, I think we are just good w.r.t > this issue with the last patch I sent. I think unless we see some > problem here, we should put energy into having a reproducible test for > the fourth problem mentioned in my mail up thread [1]. Do you think > it makes sense to run make check in loop for multiple times or do you > have any idea how we can have a reproducible test? Okay. Earlier I tried running make installcheck with force_parallel_mode='regress', but didn't get a failure. I may not have run enough times, though. I'll have to think about how to induce it. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jan 31, 2019 at 9:18 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Thu, Jan 31, 2019 at 4:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I don't think that moving fsm tests to brin would be a good approach. > > We want to have a separate test for each access method. I think if we > > want to do something to avoid portability issues, maybe we can do what > > Masahiko San has just suggested. > > We could also use the same plpgsql loop as in fsm.sql to check the ctid, right? > Yes, however, I feel we should leave it as it is for now unless we see any risk of portability issues. The only reason to do that way is to avoid any failure for bigger block size (say BLCKSZ is 16KB or 32KB). Does anyone else have any opinion on whether we should try to write tests which should care for bigger block size? I see that existing regression tests fail if we configure with bigger block size, so not sure if we should try to avoid that here. In an ideal scenario, I think it would be good if we can write tests which pass on all kind of block sizes, that will make the life easier if tomorrow one wants to set up a buildfarm or do the testing for bigger block sizes. > > OTOH, I think we are just good w.r.t > > this issue with the last patch I sent. I think unless we see some > > problem here, we should put energy into having a reproducible test for > > the fourth problem mentioned in my mail up thread [1]. Do you think > > it makes sense to run make check in loop for multiple times or do you > > have any idea how we can have a reproducible test? > > Okay. Earlier I tried running make installcheck with > force_parallel_mode='regress', but didn't get a failure. > AFAICS, this failure was not for force_parallel_mode='regress'. See the config at [1]. > I may not > have run enough times, though. > Yeah, probably running make check or make installcheck many times would help, but not sure. > I'll have to think about how to induce > it. > Thanks! [1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-01-28%2004%3A00%3A23 -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 28, 2019 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 28, 2019 at 10:03 AM John Naylor > <john.naylor@2ndquadrant.com> wrote: > > > > On Mon, Jan 28, 2019 at 4:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > There are a few buildfarm failures due to this commit, see my email on > > > pgsql-committers. If you have time, you can also once look into > > > those. > > > > I didn't see anything in common with the configs of the failed > > members. None have a non-default BLCKSZ that I can see. > > > > I have done an analysis of the different failures on buildfarm. > In the past few days, we have done a further analysis of each problem and tried to reproduce it. We are successful in generating some form of reproducer for 3 out of 4 problems in the same way as it was failed in the buildfarm. For the fourth symptom, we have tried a lot (even Andrew Dunstan has helped us to run the regression tests with the faulty commit on Jacana for many hours, but it didn't got reproduced) but not able to regenerate a failure in a similar way. However, I have a theory as mentioned below why the particular test could fail and the fix for the same is done in the patch. I am planning to push the latest version of the patch [1] which has fixes for all the symptoms. Does anybody have any opinion here? > > 4. Failure on jacana: > --- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/box.out > 2018-09-26 > 17:53:33 -0400 > +++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/results/box.out > 2019-01-27 23:14:35 > -0500 > @@ -252,332 +252,7 @@ > ('(0,100)(0,infinity)'), > ('(-infinity,0)(0,infinity)'), > ('(-infinity,-infinity)(infinity,infinity)'); > -SET enable_seqscan = false; > -SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)'; > .. > .. > TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File: > "c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/freespace/freespace.c", > Line: > 1118) > .. > 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:4] LOG: server process > (PID 14388) exited with exit code 3 > 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:5] DETAIL: Failed process > was running: INSERT INTO box_temp > VALUES (NULL), > > I think the reason for this failure is same as previous (as mentioned > in point-3), but this can happen in a different way. Say, we have > searched the local map and then try to extend a relation 'X' and in > the meantime, another backend has extended such that it creates FSM. > Now, we will reuse that page and won't clear local map. Now, say we > try to insert in relation 'Y' which doesn't have FSM. It will try to > set the local map and will find that it already exists, so will fail. > Now, the question is how it can happen in this box.sql test. I guess > that is happening for some system table which is being populated by > Create Index statement executed just before the failing Insert. > > I think both 3 and 4 are timing issues, so we didn't got in our local > regression runs. > [1] - https://www.postgresql.org/message-id/CAA4eK1%2B3ajhRPC0jvUi6p_aMrTUpB568OBH10LrbHtvOLNTgqQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Feb 2, 2019 at 7:30 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 28, 2019 at 4:40 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jan 28, 2019 at 10:03 AM John Naylor > > <john.naylor@2ndquadrant.com> wrote: > > In the past few days, we have done a further analysis of each problem > and tried to reproduce it. We are successful in generating some form > of reproducer for 3 out of 4 problems in the same way as it was failed > in the buildfarm. For the fourth symptom, we have tried a lot (even > Andrew Dunstan has helped us to run the regression tests with the > faulty commit on Jacana for many hours, but it didn't got reproduced) > but not able to regenerate a failure in a similar way. However, I > have a theory as mentioned below why the particular test could fail > and the fix for the same is done in the patch. I am planning to push > the latest version of the patch [1] which has fixes for all the > symptoms. > Today, I have spent some more time to generate a test which can reproduce the failure though it is with the help of breakpoints. See below: > > > > 4. Failure on jacana: > > --- c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/test/regress/expected/box.out > > 2018-09-26 > > 17:53:33 -0400 > > +++ c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/src/test/regress/results/box.out > > 2019-01-27 23:14:35 > > -0500 > > @@ -252,332 +252,7 @@ > > ('(0,100)(0,infinity)'), > > ('(-infinity,0)(0,infinity)'), > > ('(-infinity,-infinity)(infinity,infinity)'); > > -SET enable_seqscan = false; > > -SELECT * FROM box_temp WHERE f1 << '(10,20),(30,40)'; > > .. > > .. > > TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File: > > "c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/storage/freespace/freespace.c", > > Line: > > 1118) > > .. > > 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:4] LOG: server process > > (PID 14388) exited with exit code 3 > > 2019-01-27 23:14:35.495 EST [5c4e81a0.2e28:5] DETAIL: Failed process > > was running: INSERT INTO box_temp > > VALUES (NULL), > > > > I think the reason for this failure is same as previous (as mentioned > > in point-3), but this can happen in a different way. Say, we have > > searched the local map and then try to extend a relation 'X' and in > > the meantime, another backend has extended such that it creates FSM. > > Now, we will reuse that page and won't clear local map. Now, say we > > try to insert in relation 'Y' which doesn't have FSM. It will try to > > set the local map and will find that it already exists, so will fail. > > Now, the question is how it can happen in this box.sql test. I guess > > that is happening for some system table which is being populated by > > Create Index statement executed just before the failing Insert. > > Based on the above theory, the test is as below: Session-1 --------------- postgres=# create table test_1(c1 int, c2 char(1500)); CREATE TABLE postgres=# create table test_2(c1 int, c2 char(1500)); CREATE TABLE postgres=# insert into test_1 values(generate_series(1,20),'aaaa'); INSERT 0 20 postgres=# insert into test_2 values(1,'aaaa'); INSERT 0 1 Session-2 ---------------- postgres=# analyze test_1; ANALYZE postgres=# analyze test_2; ANALYZE postgres=# select oid, relname, relpages from pg_class where relname like 'test%'; oid | relname | relpages -------+----------------+---------- 41835 | test_1 | 4 41838 | test_2 | 1 Till here we can see that test_1 has 4 pages and test2_1 has 1 page. At this stage, even one more record insertion in test_1 will create a new page. So, now we have to hit the scenario with the help of debugger. For session-1, attach debugger and put breakpoint in RelationGetBufferForTuple(). Session-1 ---------------- postgres=# insert into test_1 values(21,'aaaa'); It will hit the breakpoint in RelationGetBufferForTuple(). Now, add one more breakpoint on line 542 in hio.c, aka below line: RelationGetBufferForTuple { .. else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock)) Press continue and stop the debugger at line 542. Attach the debugger for session-2 and add a breakpoint on line 580 in hio.c, aka below line: RelationGetBufferForTuple { .. buffer = ReadBufferBI(relation, P_NEW, bistate); Session-2 --------------- postgres=# insert into test_1 values(22,'aaaa'); It will hit the breakpoint in RelationGetBufferForTuple(). Now proceed with debugging on session-1 by one step. This is to ensure that session-1 doesn't get a conditional lock. Now, continue the debugger in session-2 and after that run vacuum test_1 from session-2, this will ensure that FSM is created for relation test_1. So the state of session-2 will be as below: Session-2 ---------------- postgres=# insert into test_1 values(22,'aaaa'); INSERT 0 1 postgres=# vacuum test_1; VACUUM postgres=# select oid, relname, relpages from pg_class where relname like 'test%'; oid | relname | relpages -------+----------------+---------- 41835 | test_1 | 5 41838 | test_2 | 1 Continue the debugger in session-1. Now insert one row in test_2 from session-1 and kaboom: Session-1 --------------- postgres=# insert into test_2 values(2,'aaaa'); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> Server logs is as below: TRAP: FailedAssertion("!(!(fsm_local_map.nblocks > 0))", File: "..\postgresql\src\backend\storage\freespace\freespace.c", Line: 1118) 2019-02-02 16:48:06.216 IST [4044] LOG: server process (PID 3540) exited with exit code 3 2019-02-02 16:48:06.216 IST [4044] DETAIL: Failed process was running: insert into test_2 values(2,'aaaa'); This looks exactly the same as the failure in Jacana. The patch fixes this case. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jan 31, 2019 at 2:02 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > > > FYI, the second comment is still present in v20. > > > > oops, forgot to include in commit after making a change, done now. > This doesn't get applied cleanly after recent commit 0d1fe9f74e. Attached is a rebased version. I have checked once that the changes done by 0d1fe9f74e don't impact this patch. John, see if you can also once confirm whether the recent commit (0d1fe9f74e) has any impact. I am planning to push this tomorrow morning (IST) unless you or anyone see any problem with this. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sun, Feb 3, 2019 at 2:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > This doesn't get applied cleanly after recent commit 0d1fe9f74e. > Attached is a rebased version. I have checked once that the changes > done by 0d1fe9f74e don't impact this patch. John, see if you can also > once confirm whether the recent commit (0d1fe9f74e) has any impact. I > am planning to push this tomorrow morning (IST) unless you or anyone > see any problem with this. Since that commit changes RelationAddExtraBlocks(), which can be induces by your pgbench adjustment upthread, I ran make check with that adjustment in the pgbench dir 300 times without triggering asserts. I also tried to follow the logic in 0d1fe9f74e, and I believe it will be correct without a FSM. [1] https://www.postgresql.org/message-id/CAA4eK1KRByXY03qR2JvUjUxKBzpBnCSO5H19oAC%3D_v4r5dzTwQ%40mail.gmail.com -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Feb 4, 2019 at 12:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Sun, Feb 3, 2019 at 2:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > This doesn't get applied cleanly after recent commit 0d1fe9f74e. > > Attached is a rebased version. I have checked once that the changes > > done by 0d1fe9f74e don't impact this patch. John, see if you can also > > once confirm whether the recent commit (0d1fe9f74e) has any impact. I > > am planning to push this tomorrow morning (IST) unless you or anyone > > see any problem with this. > > Since that commit changes RelationAddExtraBlocks(), which can be > induces by your pgbench adjustment upthread, I ran make check with > that adjustment in the pgbench dir 300 times without triggering > asserts. > > I also tried to follow the logic in 0d1fe9f74e, and I believe it will > be correct without a FSM. > I have just pushed it and buildfarm has shown two failures: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-02-04%2002%3A27%3A26 --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out 2019-02-03 21:27:29.000000000 -0500 +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out 2019-02-03 21:41:32.000000000 -0500 @@ -38,19 +38,19 @@ SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0)); fsm_page_contents ------------------- - 0: 39 + - 1: 39 + - 3: 39 + - 7: 39 + - 15: 39 + - 31: 39 + - 63: 39 + .. This one seems to be FSM test portability issue (due to different page contents, maybe). Looking into it, John, see if you are around and have some thoughts on it. 2. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2019-02-04%2002%3A30%3A25 select explain_parallel_append('execute ab_q5 (33, 44, 55)'); - explain_parallel_append -------------------------------------------------------------------------------- - Finalize Aggregate (actual rows=1 loops=1) - -> Gather (actual rows=3 loops=1) - Workers Planned: 2 - Workers Launched: 2 - -> Partial Aggregate (actual rows=1 loops=3) - -> Parallel Append (actual rows=0 loops=N) - Subplans Removed: 8 - -> Parallel Seq Scan on ab_a1_b1 (never executed) - Filter: ((b < 4) AND (a = ANY (ARRAY[$1, $2, $3]))) -(9 rows) - +ERROR: lost connection to parallel worker +CONTEXT: PL/pgSQL function explain_parallel_append(text) line 5 at FOR over EXECUTE statement -- Test Parallel Append with PARAM_EXEC Params select explain_parallel_append('select count(*) from ab where (a = (select 1) or a = (select 3)) and b = 2'); explain_parallel_append Failure is something like: 2019-02-03 21:44:42.456 EST [2812:327] pg_regress/partition_prune LOG: statement: select explain_parallel_append('execute ab_q5 (1, 1, 1)'); 2019-02-03 21:44:42.493 EST [2812:328] pg_regress/partition_prune LOG: statement: select explain_parallel_append('execute ab_q5 (2, 3, 3)'); 2019-02-03 21:44:42.531 EST [2812:329] pg_regress/partition_prune LOG: statement: select explain_parallel_append('execute ab_q5 (33, 44, 55)'); 2019-02-04 02:44:42.552 GMT [4172] FATAL: could not reattach to shared memory (key=00000000000001B4, addr=0000000001980000): error code 487 2019-02-03 21:44:42.555 EST [5116:6] LOG: background worker "parallel worker" (PID 4172) exited with exit code 1 2019-02-03 21:44:42.560 EST [2812:330] pg_regress/partition_prune ERROR: lost connection to parallel worker I don't think this is related to this commit. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 4, 2019 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 4, 2019 at 12:39 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > On Sun, Feb 3, 2019 at 2:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Thu, Jan 31, 2019 at 6:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > This doesn't get applied cleanly after recent commit 0d1fe9f74e. > > > Attached is a rebased version. I have checked once that the changes > > > done by 0d1fe9f74e don't impact this patch. John, see if you can also > > > once confirm whether the recent commit (0d1fe9f74e) has any impact. I > > > am planning to push this tomorrow morning (IST) unless you or anyone > > > see any problem with this. > > > > Since that commit changes RelationAddExtraBlocks(), which can be > > induces by your pgbench adjustment upthread, I ran make check with > > that adjustment in the pgbench dir 300 times without triggering > > asserts. > > > > I also tried to follow the logic in 0d1fe9f74e, and I believe it will > > be correct without a FSM. > > > > I have just pushed it and buildfarm has shown two failures: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dromedary&dt=2019-02-04%2002%3A27%3A26 > > --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/expected/page.out > 2019-02-03 21:27:29.000000000 -0500 > +++ /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/pageinspect/results/page.out > 2019-02-03 21:41:32.000000000 -0500 > @@ -38,19 +38,19 @@ > SELECT * FROM fsm_page_contents(get_raw_page('test_rel_forks', 'fsm', 0)); > fsm_page_contents > ------------------- > - 0: 39 + > - 1: 39 + > - 3: 39 + > - 7: 39 + > - 15: 39 + > - 31: 39 + > - 63: 39 + > .. > > This one seems to be FSM test portability issue (due to different page > contents, maybe). Looking into it, John, see if you are around and > have some thoughts on it. > One more similar failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-02-04%2003%3A20%3A01 So, basically, this is due to difference in the number of tuples that can fit on a page. The freespace in FSM for the page is shown different because of available space on a particular page. This can vary due to alignment. It seems to me we can't rely on FSM contents if there are many tuples in a relation. One idea is to get rid of dependency on FSM contents in this test, can you think of any better way to have consistent FSM contents across different platforms? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 4, 2019 at 9:24 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Feb 4, 2019 at 8:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > One more similar failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lapwing&dt=2019-02-04%2003%3A20%3A01 > > So, basically, this is due to difference in the number of tuples that > can fit on a page. The freespace in FSM for the page is shown > different because of available space on a particular page. This can > vary due to alignment. It seems to me we can't rely on FSM contents > if there are many tuples in a relation. One idea is to get rid of > dependency on FSM contents in this test, can you think of any better > way to have consistent FSM contents across different platforms? > One more idea could be that we devise a test (say with a char/varchar) such that it always consume same space in a page irrespective of its alignment. Yet another way could be we use explain (costs off, analyze on, timing off, summary off) ..., this will ensure that we will have test coverage for function fsm_page_contents, but we don't rely on its contents. What do you think? I will go with last option to stablize the buildfarm tests unless anyone thinks otherwise or has better idea. I willprobably wait for 20 minutes or so to see if anyone has inputs. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 4, 2019 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > This one seems to be FSM test portability issue (due to different page > contents, maybe). Looking into it, John, see if you are around and > have some thoughts on it. Maybe we can use the same plpgsql loop as fsm.sql that exits after 1 tuple has inserted into the 5th page. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Feb 4, 2019 at 10:18 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Mon, Feb 4, 2019 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > This one seems to be FSM test portability issue (due to different page > > contents, maybe). Looking into it, John, see if you are around and > > have some thoughts on it. > > Maybe we can use the same plpgsql loop as fsm.sql that exits after 1 > tuple has inserted into the 5th page. > Yeah that can also work, but we still need to be careful about the alignment of that one tuple, otherwise, there will could be different free space on the fifth page. The probably easier way could be to use an even number of integers in the table say(int, int). Anyway, for now, I have avoided the dependency on FSM contents without losing on coverage of test. I have pushed my latest suggestion in the previous email. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 4, 2019 at 10:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 4, 2019 at 10:18 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > On Mon, Feb 4, 2019 at 4:17 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > This one seems to be FSM test portability issue (due to different page > > > contents, maybe). Looking into it, John, see if you are around and > > > have some thoughts on it. > > > > Maybe we can use the same plpgsql loop as fsm.sql that exits after 1 > > tuple has inserted into the 5th page. > > > > Yeah that can also work, but we still need to be careful about the > alignment of that one tuple, otherwise, there will could be different > free space on the fifth page. The probably easier way could be to use > an even number of integers in the table say(int, int). Anyway, for > now, I have avoided the dependency on FSM contents without losing on > coverage of test. I have pushed my latest suggestion in the previous > email. > The change seems to have worked. All the buildfarm machines that were showing the failure are passed now. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 4, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Yeah that can also work, but we still need to be careful about the > > alignment of that one tuple, otherwise, there will could be different > > free space on the fifth page. The probably easier way could be to use > > an even number of integers in the table say(int, int). Anyway, for > > now, I have avoided the dependency on FSM contents without losing on > > coverage of test. I have pushed my latest suggestion in the previous > > email. > > > > The change seems to have worked. All the buildfarm machines that were > showing the failure are passed now. Excellent! Now that the buildfarm is green as far as this patch goes, I will touch on a few details to round out development in this area: 1. Earlier, I had a test to ensure that free space towards the front of the relation was visible with no FSM. In [1], I rewrote it without using vacuum, so we can consider adding it back now if desired. I can prepare a patch for this. 2. As a follow-on, since we don't rely on vacuum to remove dead rows, we could try putting the fsm.sql test in some existing group in the parallel schedule, rather than its own group is it is now. 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this patch's effects on GetRecordedFreeSpace(), which will return zero for tables with no FSM. The other callers are in: contrib/pg_freespacemap/pg_freespacemap.c contrib/pgstattuple/pgstatapprox.c For pg_freespacemap, this doesn't matter, since it's just reporting the facts. For pgstattuple_approx(), it might under-estimate the free space and over-estimate the number of live tuples. This might be fine, since it is approximate after all, but maybe a comment would be helpful. If this is a problem, we could tweak it to be more precise for tables without FSMs. Thoughts? 4. The latest patch for the pg_upgrade piece was in [2] Anything else? [1] https://www.postgresql.org/message-id/CACPNZCvEXLUx10pFvNcOs88RvqemMEjOv7D9MhL3ac86EzjAOA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CACPNZCu4cOdm3uGnNEGXivy7Gz8UWyQjynDpdkPGabQ18_zK6g%40mail.gmail.com -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Feb 4, 2019 at 2:27 PM John Naylor <john.naylor@2ndquadrant.com> wrote:
>
> On Mon, Feb 4, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > The change seems to have worked. All the buildfarm machines that were
> > showing the failure are passed now.
>
> Excellent!
>
> Now that the buildfarm is green as far as this patch goes,
>
There is still one recent failure which I don't think is related to commit of this patch:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-02-04%2016%3A38%3A48
================== pgsql.build/src/bin/pg_ctl/tmp_check/log/004_logrotate_primary.log ===================
TRAP: FailedAssertion("!(UsedShmemSegAddr != ((void *)0))", File: "g:\prog\bf\root\head\pgsql.build\src\backend\port\win32_shmem.c", Line: 513)
> I will
> touch on a few details to round out development in this area:
>
> 1. Earlier, I had a test to ensure that free space towards the front
> of the relation was visible with no FSM. In [1], I rewrote it without
> using vacuum, so we can consider adding it back now if desired. I can
> prepare a patch for this.
>
> 2. As a follow-on, since we don't rely on vacuum to remove dead rows,
> we could try putting the fsm.sql test in some existing group in the
> parallel schedule, rather than its own group is it is now.
>
> 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
> patch's effects on GetRecordedFreeSpace(), which will return zero for
> tables with no FSM.
>
> On Mon, Feb 4, 2019 at 8:41 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > The change seems to have worked. All the buildfarm machines that were
> > showing the failure are passed now.
>
> Excellent!
>
> Now that the buildfarm is green as far as this patch goes,
>
There is still one recent failure which I don't think is related to commit of this patch:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-02-04%2016%3A38%3A48
================== pgsql.build/src/bin/pg_ctl/tmp_check/log/004_logrotate_primary.log ===================
TRAP: FailedAssertion("!(UsedShmemSegAddr != ((void *)0))", File: "g:\prog\bf\root\head\pgsql.build\src\backend\port\win32_shmem.c", Line: 513)
I think we need to do something about this random failure, but not as part of this thread/patch.
> I will
> touch on a few details to round out development in this area:
>
> 1. Earlier, I had a test to ensure that free space towards the front
> of the relation was visible with no FSM. In [1], I rewrote it without
> using vacuum, so we can consider adding it back now if desired. I can
> prepare a patch for this.
>
Yes, this is required. It is generally a good practise to add test (unless it takes a lot of time) which covers new code/functionality.
> 2. As a follow-on, since we don't rely on vacuum to remove dead rows,
> we could try putting the fsm.sql test in some existing group in the
> parallel schedule, rather than its own group is it is now.
>
+1.
> 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this
> patch's effects on GetRecordedFreeSpace(), which will return zero for
> tables with no FSM.
>
Right, but what exactly we want to do for it? Do you want to add a comment atop of this function?
> The other callers are in:
> contrib/pg_freespacemap/pg_freespacemap.c
> contrib/pgstattuple/pgstatapprox.c
>
> For pg_freespacemap, this doesn't matter, since it's just reporting
> the facts. For pgstattuple_approx(), it might under-estimate the free
> space and over-estimate the number of live tuples.
> contrib/pg_freespacemap/pg_freespacemap.c
> contrib/pgstattuple/pgstatapprox.c
>
> For pg_freespacemap, this doesn't matter, since it's just reporting
> the facts. For pgstattuple_approx(), it might under-estimate the free
> space and over-estimate the number of live tuples.
>
Sure, but without patch also, it can do so, if the vacuum hasn't updated freespace map.
> This might be fine,
> since it is approximate after all, but maybe a comment would be
> helpful. If this is a problem, we could tweak it to be more precise
> for tables without FSMs.
> since it is approximate after all, but maybe a comment would be
> helpful. If this is a problem, we could tweak it to be more precise
> for tables without FSMs.
>
Sounds reasonable to me.
> Thoughts?
>
> 4. The latest patch for the pg_upgrade piece was in [2]
>
>
> 4. The latest patch for the pg_upgrade piece was in [2]
>
It will be good if we get this one as well. I will look into it once we are done with the other points you have mentioned.
On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 4, 2019 at 2:27 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > 1. Earlier, I had a test to ensure that free space towards the front > > of the relation was visible with no FSM. In [1], I rewrote it without > > using vacuum, so we can consider adding it back now if desired. I can > > prepare a patch for this. > > > > Yes, this is required. It is generally a good practise to add test (unless it takes a lot of time) which covers new code/functionality. > > > 2. As a follow-on, since we don't rely on vacuum to remove dead rows, > > we could try putting the fsm.sql test in some existing group in the > > parallel schedule, rather than its own group is it is now. > > > > +1. This is done in 0001. > > 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this > > patch's effects on GetRecordedFreeSpace(), which will return zero for > > tables with no FSM. > > > > Right, but what exactly we want to do for it? Do you want to add a comment atop of this function? Hmm, the comment already says "according to the FSM", so maybe it's already obvious. I was thinking more about maybe commenting the callsite where it's helpful, as in 0002. > > The other callers are in: > > contrib/pg_freespacemap/pg_freespacemap.c > > contrib/pgstattuple/pgstatapprox.c > > > > For pg_freespacemap, this doesn't matter, since it's just reporting > > the facts. For pgstattuple_approx(), it might under-estimate the free > > space and over-estimate the number of live tuples. > > > > Sure, but without patch also, it can do so, if the vacuum hasn't updated freespace map. Okay, then maybe we don't need to do anything else here. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Feb 4, 2019 at 2:27 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > > > 1. Earlier, I had a test to ensure that free space towards the front > > > of the relation was visible with no FSM. In [1], I rewrote it without > > > using vacuum, so we can consider adding it back now if desired. I can > > > prepare a patch for this. > > > > > > > Yes, this is required. It is generally a good practise to add test (unless it takes a lot of time) which covers newcode/functionality. > > > > > 2. As a follow-on, since we don't rely on vacuum to remove dead rows, > > > we could try putting the fsm.sql test in some existing group in the > > > parallel schedule, rather than its own group is it is now. > > > > > > > +1. > > This is done in 0001. > This is certainly a good test w.r.t code coverage of new code, but I have few comments: 1. The size of records in test still depends on alignment (MAXALIGN). Though it doesn't seem to be a problematic case, I still suggest we can avoid using records whose size depends on alignment. If you change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int, str text);, then you can avoid alignment related issues for the records being used in test. 2. +-- Fill most of the last block .. +-- Make sure records can go into any block but the last one .. +-- Insert large record and make sure it does not cause the relation to extend The comments in some part of the test seems too focussed towards the algorithm used for in-memory map. I think we can keep these if we want, but it is required to write a more generic comment stating what is the actual motive of additional tests (basically we are testing the functionality of in-memory map (LSM) for the heap, so we should write about it.). > > > 3. While looking at 0d1fe9f74e, it occurred to me that I ignored this > > > patch's effects on GetRecordedFreeSpace(), which will return zero for > > > tables with no FSM. > > > > > > > Right, but what exactly we want to do for it? Do you want to add a comment atop of this function? > > Hmm, the comment already says "according to the FSM", so maybe it's > already obvious. I was thinking more about maybe commenting the > callsite where it's helpful, as in 0002. > > > > The other callers are in: > > > contrib/pg_freespacemap/pg_freespacemap.c > > > contrib/pgstattuple/pgstatapprox.c > > > > > > For pg_freespacemap, this doesn't matter, since it's just reporting > > > the facts. For pgstattuple_approx(), it might under-estimate the free > > > space and over-estimate the number of live tuples. > > > > > > > Sure, but without patch also, it can do so, if the vacuum hasn't updated freespace map. > > Okay, then maybe we don't need to do anything else here. > Shall we add a note to the docs of pg_freespacemap and pgstattuple_approx indicating that for small relations, FSM won't be created, so these functions won't give appropriate value? Or other possibility could be that we return an error if the block number is less than the threshold value, but not sure if that is a good alternative as that can happen today also if the vacuum hasn't run on the table. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com> > wrote: >> >> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com> >> wrote: > This is certainly a good test w.r.t code coverage of new code, but I > have few comments: > 1. The size of records in test still depends on alignment (MAXALIGN). > Though it doesn't seem to be a problematic case, I still suggest we > can avoid using records whose size depends on alignment. If you > change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int, > str text);, then you can avoid alignment related issues for the > records being used in test. Done. > 2. > +-- Fill most of the last block > .. > +-- Make sure records can go into any block but the last one > .. > +-- Insert large record and make sure it does not cause the relation to > extend > > The comments in some part of the test seems too focussed towards the > algorithm used for in-memory map. I think we can keep these if we > want, but it is required to write a more generic comment stating what > is the actual motive of additional tests (basically we are testing the > functionality of in-memory map (LSM) for the heap, so we should write > about it.). Done. > Shall we add a note to the docs of pg_freespacemap and > pgstattuple_approx indicating that for small relations, FSM won't be > created, so these functions won't give appropriate value? I've given this a try in 0002. > Or other > possibility could be that we return an error if the block number is > less than the threshold value, but not sure if that is a good > alternative as that can happen today also if the vacuum hasn't run on > the table. Yeah, an error doesn't seem helpful. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Mon, Feb 11, 2019 at 10:48 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Shall we add a note to the docs of pg_freespacemap and > > pgstattuple_approx indicating that for small relations, FSM won't be > > created, so these functions won't give appropriate value? > > I've given this a try in 0002. > This looks mostly correct, but I have a few observations: 1. - tuples. + tuples. Small tables don't have a free space map, so in that case + this function will report zero free space, likewise inflating the + estimated number of live tuples. The last part of the sentence "likewise inflating the estimated number of live tuples." seems incorrect to me because live tuples are computed based on the pages scanned, live tuples in them and total blocks in the relation. So, I think it should be "likewise inflating the approximate tuple length". 2. + In addition, small tables don't have a free space map, so this function + will return zero even if free space is available. Actually, the paragraph you have modified applies to both the functions mentioned on that page. So instead of saying "this function ..", we can say "these functions .." 3. * space from the FSM and move on. + * Note: If a relation has no FSM, GetRecordedFreeSpace() will report + * zero free space. This is fine for the purposes of approximation. */ It is better to have an empty line before Note: ... I have modified the patch for the above observations and added a commit message as well, see if it looks okay to you. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Feb 20, 2019 at 6:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I have modified the patch for the above observations and added a > commit message as well, see if it looks okay to you. Looks good to me, thanks. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Feb 11, 2019 at 10:48 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com> > > wrote: > >> > >> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com> > >> wrote: > > This is certainly a good test w.r.t code coverage of new code, but I > > have few comments: > > 1. The size of records in test still depends on alignment (MAXALIGN). > > Though it doesn't seem to be a problematic case, I still suggest we > > can avoid using records whose size depends on alignment. If you > > change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int, > > str text);, then you can avoid alignment related issues for the > > records being used in test. > > Done. > > > 2. > > +-- Fill most of the last block > > .. > > +-- Make sure records can go into any block but the last one > > .. > > +-- Insert large record and make sure it does not cause the relation to > > extend > > > > The comments in some part of the test seems too focussed towards the > > algorithm used for in-memory map. I think we can keep these if we > > want, but it is required to write a more generic comment stating what > > is the actual motive of additional tests (basically we are testing the > > functionality of in-memory map (LSM) for the heap, so we should write > > about it.). > > Done. > Thanks, the modification looks good. I have slightly changed the commit message in the attached patch. I will spend some more time tomorrow morning on this and will commit unless I see any new problem. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
Please remember to keep serial_schedule in sync. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Please remember to keep serial_schedule in sync. > I don't understand what you mean by this? It is already present in serial_schedule. In parallel_schedule, we are just moving this test to one of the parallel groups. Do we need to take care of something else? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2019-Feb-21, Amit Kapila wrote: > On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > Please remember to keep serial_schedule in sync. > > I don't understand what you mean by this? It is already present in > serial_schedule. In parallel_schedule, we are just moving this test > to one of the parallel groups. Do we need to take care of something > else? Just to make sure it's in the same relative position. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Feb 11, 2019 at 10:48 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On 2/9/19, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Feb 5, 2019 at 3:25 PM John Naylor <john.naylor@2ndquadrant.com> > > wrote: > >> > >> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila <amit.kapila16@gmail.com> > >> wrote: > > This is certainly a good test w.r.t code coverage of new code, but I > > have few comments: > > 1. The size of records in test still depends on alignment (MAXALIGN). > > Though it doesn't seem to be a problematic case, I still suggest we > > can avoid using records whose size depends on alignment. If you > > change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int, > > str text);, then you can avoid alignment related issues for the > > records being used in test. > > Done. > Oops, on again carefully studying the test, I realized my above comment was wrong. Let me explain with a test this time: CREATE TABLE fsm_check_size (num int, str text); INSERT INTO fsm_check_size SELECT i, rpad('', 1024, 'a') FROM generate_series(1,3) i; So here you are inserting 4-byte integer and 1024-bytes variable length record. So the tuple length will be tuple_header (24-bytes) + 4-bytes for integer + 4-bytes header for variable length data + 1024 bytes of actual data. So, the length will be 1056 which is already MAXALIGN. I took the new comments added in your latest version of the patch and added them to the previous version of the patch. Kindly see if I have not missed anything while merging the patch-versions? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Feb 21, 2019 at 7:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > So here you are inserting 4-byte integer and 1024-bytes variable > length record. So the tuple length will be tuple_header (24-bytes) + > 4-bytes for integer + 4-bytes header for variable length data + 1024 > bytes of actual data. So, the length will be 1056 which is already > MAXALIGN. I took the new comments added in your latest version of the > patch and added them to the previous version of the patch. Kindly > see if I have not missed anything while merging the patch-versions? OK, that makes sense. Looks fine to me. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 21, 2019 at 6:39 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2019-Feb-21, Amit Kapila wrote: > > > On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > > > Please remember to keep serial_schedule in sync. > > > > I don't understand what you mean by this? It is already present in > > serial_schedule. In parallel_schedule, we are just moving this test > > to one of the parallel groups. Do we need to take care of something > > else? > > Just to make sure it's in the same relative position. > Okay, thanks for the input. Attached, find the updated patch, will try to commit it tomorrow unless you or someone else has any other comments. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
I think this test is going to break on nonstandard block sizes. While we don't promise that all tests work on such installs (particularly planner ones), it seems fairly easy to cope with this one -- just use a record size expressed as a fraction of current_setting('block_size'). So instead of "1024" you'd write current_setting('block_size') / 8. And then display the relation size in terms of pages, not bytes, so divide pg_relation_size by block size. (I see that there was already a motion for this, but was dismissed because of lack of interest.) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Feb 22, 2019 at 1:57 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I think this test is going to break on nonstandard block sizes. While > we don't promise that all tests work on such installs (particularly > planner ones), > The reason for not pushing much on making the test pass for nonstandard block sizes is that when I tried existing tests, there were already some failures. For example, see the failures in the attached regression diff files (for block_size as 16K and 32K respectively). I saw those failures during the previous investigation, the situation on HEAD might or might not be exactly the same. Whereas I see the value in trying to make sure that tests pass for nonstandard block sizes, but that doesn't seem to be followed for all the tests. > it seems fairly easy to cope with this one -- just use a > record size expressed as a fraction of current_setting('block_size'). > So instead of "1024" you'd write current_setting('block_size') / 8. > And then display the relation size in terms of pages, not bytes, so > divide pg_relation_size by block size. > The idea sounds good. John, would you like to give it a try? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Feb 21, 2019 at 9:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > The reason for not pushing much on making the test pass for > nonstandard block sizes is that when I tried existing tests, there > were already some failures. Sure, but let's not make things worse. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-Feb-22, Amit Kapila wrote: > On Fri, Feb 22, 2019 at 1:57 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > I think this test is going to break on nonstandard block sizes. While > > we don't promise that all tests work on such installs (particularly > > planner ones), > > The reason for not pushing much on making the test pass for > nonstandard block sizes is that when I tried existing tests, there > were already some failures. For example, see the failures in the > attached regression diff files (for block_size as 16K and 32K > respectively). I saw those failures during the previous > investigation, the situation on HEAD might or might not be exactly the > same. Whereas I see the value in trying to make sure that tests pass > for nonstandard block sizes, but that doesn't seem to be followed for > all the tests. Wow, there's a lot less tests failing there than I thought there would be. That increases hope that we can someday have them pass. +1 on not making things worse. I think the crash in the amcheck test should be studied, one way or another; CCing Peter. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Feb 22, 2019 at 8:04 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Wow, there's a lot less tests failing there than I thought there would > be. That increases hope that we can someday have them pass. +1 on not > making things worse. > > I think the crash in the amcheck test should be studied, one way or > another; CCing Peter. I built Postgres with "--with-blocksize=16" and "--with-blocksize=32", and tested amcheck with both builds. All tests passed. I have a hard time imagining what the problem could be here. If there was a problem with amcheck relying on there being an 8KiB block size specifically, then it would almost certainly have been there since the initial commit from March 2017. Not much has changed since then, and the crash that Amit reported occurs at the earliest possible point. I find it suspicious that there is another crash in pageinspect's brin_page_items(), since like amcheck, pageinspect is a contrib module that relies on BLCKSZ when allocating a local temp buffer. -- Peter Geoghegan
On 2019-Feb-22, Peter Geoghegan wrote: > I find it suspicious that there is another crash in pageinspect's > brin_page_items(), since like amcheck, pageinspect is a contrib module > that relies on BLCKSZ when allocating a local temp buffer. Ah. Maybe they just weren't rebuilt. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 21, 2019 at 9:27 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I think this test is going to break on nonstandard block sizes. While > we don't promise that all tests work on such installs (particularly > planner ones), it seems fairly easy to cope with this one -- just use a > record size expressed as a fraction of current_setting('block_size'). > So instead of "1024" you'd write current_setting('block_size') / 8. > And then display the relation size in terms of pages, not bytes, so > divide pg_relation_size by block size. I've done this for v6, tested on 16k block size. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > The reason for not pushing much on making the test pass for > nonstandard block sizes is that when I tried existing tests, there > were already some failures. FWIW, I currently see 8 failures (attached). -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2019-Feb-23, John Naylor wrote: > On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > The reason for not pushing much on making the test pass for > > nonstandard block sizes is that when I tried existing tests, there > > were already some failures. > > FWIW, I currently see 8 failures (attached). Hmm, not great -- even the strings test fails, which seems to try to handle the case explicitly. I did expect the plan shape ones to fail, but I'm surprised about the tablesample one. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 26/02/2019 16:20, Alvaro Herrera wrote: > On 2019-Feb-23, John Naylor wrote: > >> On Fri, Feb 22, 2019 at 3:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote: >>> The reason for not pushing much on making the test pass for >>> nonstandard block sizes is that when I tried existing tests, there >>> were already some failures. >> >> FWIW, I currently see 8 failures (attached). > > Hmm, not great -- even the strings test fails, which seems to try to handle > the case explicitly. I did expect the plan shape ones to fail, but I'm > surprised about the tablesample one. > The SYSTEM table sampling is basically per-page sampling so it depends heavily on which rows are on which page. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 25, 2019 at 9:50 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > Once we agree on the code, we need to test below scenarios: > (a) upgrade from all supported versions to the latest version > (b) upgrade standby with and without using rsync. Although the code hasn't been reviewed yet, I went ahead and tested (a) on v21 of the pg_upgrade patch [1]. To do this I dumped out a 9.4 instance with the regression database and restored it to all supported versions. To make it work with pg_upgrade, I first had to drop tables with oids, drop functions referring to C libraries, and drop the later-removed '=>' operator. Then I pg_upgrade'd in copy mode from all versions to HEAD with the patch applied. pg_upgrade worked without error, and the following query returned 0 bytes on all the new clusters: select sum(pg_relation_size(oid, 'fsm')) as total_fsm_size from pg_class where relkind in ('r', 't') and pg_relation_size(oid, 'main') <= 4 * 8192; The complementary query (> 4 * 8192) returned the same number of bytes as in the original 9.4 instance. To make it easy to find, the latest regression test patch, which is intended to be independent of block-size, was in [2]. [1] https://www.postgresql.org/message-id/CACPNZCu4cOdm3uGnNEGXivy7Gz8UWyQjynDpdkPGabQ18_zK6g%40mail.gmail.com [2] https://www.postgresql.org/message-id/CACPNZCsWa%3Ddd0K%2BFiODwM%3DLEsepAHVJCoSx2Avew%3DxBEX3Ywjw%40mail.gmail.com -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Feb 23, 2019 at 1:24 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Thu, Feb 21, 2019 at 9:27 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > I think this test is going to break on nonstandard block sizes. While > > we don't promise that all tests work on such installs (particularly > > planner ones), it seems fairly easy to cope with this one -- just use a > > record size expressed as a fraction of current_setting('block_size'). > > So instead of "1024" you'd write current_setting('block_size') / 8. > > And then display the relation size in terms of pages, not bytes, so > > divide pg_relation_size by block size. > > I've done this for v6, tested on 16k block size. > Thanks, the patch looks good to me. I have additionally tested it 32K and 1K sized blocks and the test passes. I will commit this early next week. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Jan 28, 2019 at 2:33 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Sat, Jan 26, 2019 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > I think there is some value in using the information from > > this function to skip fsm files, but the code doesn't appear to fit > > well, how about moving this check to new function > > new_cluster_needs_fsm()? > > For v21, new_cluster_needs_fsm() has all responsibility for obtaining > the info it needs. I think this is much cleaner, > Right, now the code looks much better. > but there is a small > bit of code duplication since it now has to form the file name. One > thing we could do is form the the base old/new file names in > transfer_single_new_db() and pass those to transfer_relfile(), which > will only add suffixes and segment numbers. We could then pass the > base old file name to new_cluster_needs_fsm() and use it as is. Not > sure if that's worthwhile, though. > I don't think it is worth. Few minor comments: 1. warning C4715: 'new_cluster_needs_fsm': not all control paths return a value Getting this new warning in the patch. 2. + + /* Transfer any VM files if we can trust their contents. */ if (vm_crashsafe_match) 3. Can we add a note about this in the Notes section of pg_upgrade documentation [1]? This comment line doesn't seem to be related to this patch. If so, I think we can avoid having any additional change which is not related to the functionality of this patch. Feel free to submit it separately, if you think it is an improvement. Have you done any performance testing of this patch? I mean to say now that we added a new stat call for each table, we should see if that has any impact. Ideally, that should be compensated by the fact that we are now not transferring *fsm files for small relations. How about constructing a test where all relations are greater than 4 pages and then try to upgrade them. We can check for a cluster with a different number of relations say 10K, 20K, 50K, 100K. In general, the patch looks okay to me. I would like to know if anybody else has any opinion whether pg_upgrade should skip transferring fsm files for small relations or not? I think both me and John thinks that it is good to have feature and now that patch turns out to be simpler, I feel we can go ahead with this optimization in pg_upgrade. [1] - https://www.postgresql.org/docs/devel/pgupgrade.html -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 6, 2019 at 5:19 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Fri, Jan 25, 2019 at 9:50 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Once we agree on the code, we need to test below scenarios: > > (a) upgrade from all supported versions to the latest version > > (b) upgrade standby with and without using rsync. > > Although the code hasn't been reviewed yet, I went ahead and tested > (a) on v21 of the pg_upgrade patch [1]. To do this I dumped out a 9.4 > instance with the regression database and restored it to all supported > versions. To make it work with pg_upgrade, I first had to drop tables > with oids, drop functions referring to C libraries, and drop the > later-removed '=>' operator. Then I pg_upgrade'd in copy mode from all > versions to HEAD with the patch applied. pg_upgrade worked without > error, and the following query returned 0 bytes on all the new > clusters: > > select sum(pg_relation_size(oid, 'fsm')) as total_fsm_size > from pg_class where relkind in ('r', 't') > and pg_relation_size(oid, 'main') <= 4 * 8192; > > The complementary query (> 4 * 8192) returned the same number of bytes > as in the original 9.4 instance. > Thanks, the tests done by you are quite useful. I have given a few comments as a response to your previous email. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 8, 2019 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Few minor comments: .. > > 2. > + > + /* Transfer any VM files if we can trust their > contents. */ > if (vm_crashsafe_match) > > 3. Can we add a note about this in the Notes section of pg_upgrade > documentation [1]? > > This comment line doesn't seem to be related to this patch. If so, I > think we can avoid having any additional change which is not related > to the functionality of this patch. Feel free to submit it > separately, if you think it is an improvement. > oops, I have messed up the above comments. The paragraph starting with "This comment line doesn't ..." is for comment number-2. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Few minor comments: > 1. > warning C4715: 'new_cluster_needs_fsm': not all control paths return a value > > Getting this new warning in the patch. Hmm, I don't get that in a couple versions of gcc. Your compiler must not know that pg_fatal() cannot return. I blindly added a fix. > 2. > > This comment line doesn't seem to be related to this patch. If so, I > think we can avoid having any additional change which is not related > to the functionality of this patch. Feel free to submit it > separately, if you think it is an improvement. > + > + /* Transfer any VM files if we can trust their contents. */ > if (vm_crashsafe_match) Well, I guess the current comment is still ok, so reverted. If I were to do a separate cleanup patch, I would rather remove the vm_must_add_frozenbit parameter -- there's no reason I can see for calls that transfer the heap and FSM to know about this. I also changed references to the 'first segment of the main fork' where there will almost always only be one segment. This was a vestige of the earlier algorithm I had. > 3. Can we add a note about this in the Notes section of pg_upgrade > documentation [1]? Done. > Have you done any performance testing of this patch? I mean to say > now that we added a new stat call for each table, we should see if > that has any impact. Ideally, that should be compensated by the fact > that we are now not transferring *fsm files for small relations. How > about constructing a test where all relations are greater than 4 pages > and then try to upgrade them. We can check for a cluster with a > different number of relations say 10K, 20K, 50K, 100K. I have not, but I agree it should be done. I will try to do so soon. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Mar 7, 2019 at 7:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Feb 23, 2019 at 1:24 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > > On Thu, Feb 21, 2019 at 9:27 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > > > > I think this test is going to break on nonstandard block sizes. While > > > we don't promise that all tests work on such installs (particularly > > > planner ones), it seems fairly easy to cope with this one -- just use a > > > record size expressed as a fraction of current_setting('block_size'). > > > So instead of "1024" you'd write current_setting('block_size') / 8. > > > And then display the relation size in terms of pages, not bytes, so > > > divide pg_relation_size by block size. > > > > I've done this for v6, tested on 16k block size. > > > > Thanks, the patch looks good to me. I have additionally tested it 32K > and 1K sized blocks and the test passes. I will commit this early > next week. > Pushed this patch. Last time, we have seen a few portability issues with this test. Both John and me with the help of others tried to ensure that there are no more such issues, but there is always a chance that we missed something. Anyway, I will keep an eye on buildfarm to see if there is any problem related to this patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Mar 10, 2019 at 7:47 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Have you done any performance testing of this patch? I mean to say > > now that we added a new stat call for each table, we should see if > > that has any impact. Ideally, that should be compensated by the fact > > that we are now not transferring *fsm files for small relations. How > > about constructing a test where all relations are greater than 4 pages > > and then try to upgrade them. We can check for a cluster with a > > different number of relations say 10K, 20K, 50K, 100K. > > I have not, but I agree it should be done. I will try to do so soon. > Thanks, I will wait for your test results. I believe this is the last patch in this project and we should try to get it done soon. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > Have you done any performance testing of this patch? I mean to say > now that we added a new stat call for each table, we should see if > that has any impact. Ideally, that should be compensated by the fact > that we are now not transferring *fsm files for small relations. To be precise, it will only call stat if pg_class.relpages is below the threshold. I suppose I could hack a database where all the relpages values are wrong, but that seems like a waste of time. > How > about constructing a test where all relations are greater than 4 pages > and then try to upgrade them. We can check for a cluster with a > different number of relations say 10K, 20K, 50K, 100K. I did both greater and less than 4 pages for 10k relations. Since pg_upgrade is O(# relations), I don't see a point in going higher. First, I had a problem: On MacOS with their "gcc" wrapper around clang, I got a segfault 11 when compiled with no debugging symbols. I added "CFLAGS=-O0" and it worked fine. Since this doesn't happen in a debugging build, I'm not sure how to investigate this. IIRC, this doesn't happen for me on Linux gcc. Since it was at least running now, I measured by putting gettimeofday() calls around transfer_all_new_tablespaces(). I did 10 runs each and took the average, except for patch/1-page case since it was obviously faster after a couple runs. 5 pages: master patch 5.59s 5.64s The variation within the builds is up to +/- 0.2s, so there is no difference, as expected. 1 page: master patch 5.62s 4.25s Clearly, linking is much slower than stat. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 13, 2019 at 4:57 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Fri, Mar 8, 2019 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > Have you done any performance testing of this patch? I mean to say > > now that we added a new stat call for each table, we should see if > > that has any impact. Ideally, that should be compensated by the fact > > that we are now not transferring *fsm files for small relations. > > To be precise, it will only call stat if pg_class.relpages is below > the threshold. I suppose I could hack a database where all the > relpages values are wrong, but that seems like a waste of time. > Right. > > How > > about constructing a test where all relations are greater than 4 pages > > and then try to upgrade them. We can check for a cluster with a > > different number of relations say 10K, 20K, 50K, 100K. > > I did both greater and less than 4 pages for 10k relations. Since > pg_upgrade is O(# relations), I don't see a point in going higher. > > First, I had a problem: On MacOS with their "gcc" wrapper around > clang, I got a segfault 11 when compiled with no debugging symbols. > Did you get this problem with the patch or both with and without the patch? If it is only with patch, then we definitely need to investigate. > I > added "CFLAGS=-O0" and it worked fine. Since this doesn't happen in a > debugging build, I'm not sure how to investigate this. IIRC, this > doesn't happen for me on Linux gcc. > > Since it was at least running now, I measured by putting > gettimeofday() calls around transfer_all_new_tablespaces(). I did 10 > runs each and took the average, except for patch/1-page case since it > was obviously faster after a couple runs. > > 5 pages: > master patch > 5.59s 5.64s > > The variation within the builds is up to +/- 0.2s, so there is no > difference, as expected. > > 1 page: > master patch > 5.62s 4.25s > > Clearly, linking is much slower than stat. > The results are fine. Thanks for doing the tests. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 13, 2019 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > First, I had a problem: On MacOS with their "gcc" wrapper around > > clang, I got a segfault 11 when compiled with no debugging symbols. > > > > Did you get this problem with the patch or both with and without the > patch? If it is only with patch, then we definitely need to > investigate. Only with the patch. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 13, 2019 at 7:42 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Wed, Mar 13, 2019 at 8:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > First, I had a problem: On MacOS with their "gcc" wrapper around > > > clang, I got a segfault 11 when compiled with no debugging symbols. > > > > > > > Did you get this problem with the patch or both with and without the > > patch? If it is only with patch, then we definitely need to > > investigate. > > Only with the patch. > If the problem is reproducible, then I think we can try out a few things to narrow down or get some clue about the problem: (a) modify function new_cluster_needs_fsm() such that it always returns true as its first statement. This will tell us if the code in that function has some problem. (b) run with Valgrind -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
> [segfault problems] This now seems spurious. I ran make distclean, git pull, reapplied the patch (leaving out the gettimeofday() calls), and now my upgrade perf test works with default compiler settings. Not sure what happened, but hopefully we can move forward. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Mar 14, 2019 at 7:08 AM John Naylor <john.naylor@2ndquadrant.com> wrote: > > > [segfault problems] > > This now seems spurious. I ran make distclean, git pull, reapplied the > patch (leaving out the gettimeofday() calls), and now my upgrade perf > test works with default compiler settings. Not sure what happened, but > hopefully we can move forward. > Yeah, I took another pass through the patch and change minor things as you can see in the attached patch. 1. Added an Assert in new_cluster_needs_fsm() that old cluster version should be >= 804 as that is where fsm support has been added. 2. Reverted the old cluster version check to <= 1100. There was nothing wrong in the way you have written a check, but I find most of the other places in the code uses that way. 3. At one place changed the function header to be consistent with the nearby code, run pgindent on the code and changed the commit message. Let me know what you think? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Mar 14, 2019 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > 1. Added an Assert in new_cluster_needs_fsm() that old cluster version > should be >= 804 as that is where fsm support has been added. There is already an explicit check for 804 in the caller, and the HEAD code is already resilient to FSMs not existing, so I think this is superfluous. > 2. Reverted the old cluster version check to <= 1100. There was > nothing wrong in the way you have written a check, but I find most of > the other places in the code uses that way. > 3. At one place changed the function header to be consistent with the > nearby code, run pgindent on the code and changed the commit message. Looks good to me, thanks. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Mar 14, 2019 at 12:37 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Thu, Mar 14, 2019 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 1. Added an Assert in new_cluster_needs_fsm() that old cluster version > > should be >= 804 as that is where fsm support has been added. > > There is already an explicit check for 804 in the caller, > Yeah, I know that, but I have added it to prevent this function being used elsewhere. OTOH, maybe you are right that as per current code it is superfluous, so we shouldn't add this assert. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 14, 2019 at 7:46 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Mar 14, 2019 at 12:37 PM John Naylor > <john.naylor@2ndquadrant.com> wrote: > > > > On Thu, Mar 14, 2019 at 2:17 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 1. Added an Assert in new_cluster_needs_fsm() that old cluster version > > > should be >= 804 as that is where fsm support has been added. > > > > There is already an explicit check for 804 in the caller, > > > > Yeah, I know that, but I have added it to prevent this function being > used elsewhere. OTOH, maybe you are right that as per current code it > is superfluous, so we shouldn't add this assert. > I have committed the latest version of this patch. I think we can wait for a day or two see if there is any complain from buildfarm or in general and then we can close this CF entry. IIRC, this was the last patch in the series, right? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 15, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I have committed the latest version of this patch. I think we can > wait for a day or two see if there is any complain from buildfarm or > in general and then we can close this CF entry. IIRC, this was the > last patch in the series, right? Great, thanks! I'll keep an eye on the buildfarm as well. I just spotted two comments in freespace.c that were true during earlier patch revisions, but are no longer true, so I've attached a fix for those. There are no other patches in the series. -- John Naylor https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Fri, Mar 15, 2019 at 3:40 PM John Naylor <john.naylor@2ndquadrant.com> wrote: > > On Fri, Mar 15, 2019 at 5:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have committed the latest version of this patch. I think we can > > wait for a day or two see if there is any complain from buildfarm or > > in general and then we can close this CF entry. IIRC, this was the > > last patch in the series, right? > > Great, thanks! I'll keep an eye on the buildfarm as well. > > I just spotted two comments in freespace.c that were true during > earlier patch revisions, but are no longer true, so I've attached a > fix for those. > LGTM, so committed. > There are no other patches in the series. > Okay. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 15, 2019 at 3:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I have committed the latest version of this patch. I think we can > wait for a day or two see if there is any complain from buildfarm or > in general and then we can close this CF entry. > Closed this CF entry. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com