Thread: Reduce pinning in btree indexes
We have a customer who was unable to migrate from a well-known commercial database product to pg because they have a very large software base that holds cursors open for very long periods of time, preventing GlobalXmin values from advancing, leading to bloat. They have a standard test that exercises hundreds of client connections for days at a time, and saw disk space and CPU usage climbing to unacceptable levels[1]. The other database product had no such problems. We suggested the obvious solutions that we always suggest on the community lists, and they had perfectly good reasons why those were not viable for them -- we either needed to prevent bloat under their current software or they could not migrate. What they wanted was what happened in the other database product -- if a snapshot got sufficiently old that cleaning up the MVCC data was a problem *and* the snapshot was used again *and* it read a page which had been modified far enough back that it was not possible to return correct data, then they wanted to receive a "snapshot too old" error. I wrote a patch to do that, but they didn't seem much improvement, because all (auto)vacuum processes blocked indefinitely on the btree index pages where a cursor was parked. So they still got massive bloat and consequent problems. It seemed odd to me that a btree implementation based on the Lehman & Yao techniques would block anything, because the concept is that it reads everything it needs off the index page and pauses "between" pages. We sorta do that, but the "interlock" between vacuum processes and other index usages basically involves holding a pin on the page we just read until we are done using the values we read off that page, and treating the pin as a lighter lock than a shared (or READ) lock -- which only conflicts with a "super exclusive" lock, which consists of getting an exclusive lock only once there are no other processes holding a pin on the page. This ensures that at the end of a vacuum pass over a btree index there are no TIDs in process-local memory from before the start of the pass, and consequently no scan can read a new tuple assigned to the same TID value after the rest of the vacuum phases run. So a pin on a btree page blocks a vacuum process indefinitely. Interestingly, the btree README points out that using the old TID with a new tuple poses no hazard for a scan using an MVCC snapshot, because the new tuple would not be visible to a snapshot created that long ago. The customer's cursors which were causing the problems all use MVCC snapshots, so I went looking to see whether we couldn't take advantage of this fact. For the most part it seemed we were OK if we dropped pins with the READ lock for a btree scan which used an MVCC snapshot. I found that the LP_DEAD hinting would be a problem with an old TID, but I figured we could work around that by storing the page LSN into the scan position structure when the page contents were read, and only doing hinting if that matched when we were ready to do the hinting. That wouldn't work for an index which was not WAL-logged, so the patch still holds pins for those. Robert pointed out that the visibility information for an index-only scan wasn't checked while the index page READ lock was held, so those scans also still hold the pins. There was also a problem with the fact that the code conflated the notion of having a valid scan position with holding a pin on a block in the index. Those two concepts needed to be teased apart, which I did using some new macros and figuring out which one to use where. Without a pin on the block, we also needed to remember what block we had been processing to be able to do the LP_DEAD hinting; for that the block number was added to the structure which tracks scan position. Finally, there was an "optimization" for marking buffer position for possible restore that was incompatible with releasing the pin. I use quotes because the optimization adds overhead to every move to the next page in order set some variables in a structure when a mark is requested instead of running two fairly small memcpy() statements. The two-day benchmark of the customer showed no performance hit, and looking at the code I would be amazed if the optimization yielded a measurable benefit. In general, optimization by adding overhead to moving through a scan to save time in a mark operation seems dubious. At some point we could consider building on this patch to recheck index conditions for heap access when a non-MVCC snapshot is used, check the visibility map for referenced heap pages when the TIDs are read for an index-only scan, and/or skip LP_DEAD hinting for non-WAL-logged indexes. But all those are speculative future work; this is a conservative implementation that just didn't modify pinning where there were any confounding factors. In combination with the snapshot-too-old patch the 2-day customer benchmark ran without problem levels of bloat, controlling disk space growth and keeping CPU usage level. Without the other patch this patch would at least allow autovacuum to maintain statistics, which probably makes it worthwhile even without the other patch, but will probably not have a very big impact on bloat without both. At least two README files and a lot of comments need to be modified before commit to reflect the change in approach if this is accepted by the community. Since this work has been very recent I haven't had time to do that yet. git diff --stat tells me this patch has: 4 files changed, 208 insertions(+), 128 deletions(-) ... and the related "snapshot too old" patch has: 16 files changed, 145 insertions(+), 19 deletions(-) Given that these are going into the last CF and the related patch is really still at "proof of concept" phase, it may be too late to consider them for PostgreSQL 9.5, but I would still like a serious review now in case people feel strongly about controlling bloat in the face of old snapshots, and to know whether to work on this as just for the Postgres Plus Advanced Server fork or for the community. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] This test maintains a certain transaction rate, with user think times on the connections, so performance isn't measured by a tps rate but by how much CPU is consumed to maintain that rate.
Attachment
On 02/15/2015 02:19 AM, Kevin Grittner wrote: > Interestingly, the btree README points out that using the old TID > with a new tuple poses no hazard for a scan using an MVCC snapshot, > because the new tuple would not be visible to a snapshot created > that long ago. The first question is: Do we really need that interlock for the non-MVCC snapshots either? If we do: For non-MVCC snapshots, we need to ensure that all index scans that started before the VACUUM did complete before the VACUUM does. I wonder if we could find a different mechanism to enforce that. Using the pin-interlock for that has always seemed a bit silly to me. Perhaps grab a new heavy-weight lock on the table whenever a non-MVCC index scan on the table begins, and have VACUUM wait on it. > I found that the LP_DEAD hinting > would be a problem with an old TID, but I figured we could work > around that by storing the page LSN into the scan position structure > when the page contents were read, and only doing hinting if that > matched when we were ready to do the hinting. That wouldn't work > for an index which was not WAL-logged, so the patch still holds > pins for those. Or you could use GetFakeLSNForUnloggedRel(). > Robert pointed out that the visibility information > for an index-only scan wasn't checked while the index page READ > lock was held, so those scans also still hold the pins. Why does an index-only scan need to hold the pin? > Finally, there was an "optimization" for marking buffer position > for possible restore that was incompatible with releasing the pin. > I use quotes because the optimization adds overhead to every move > to the next page in order set some variables in a structure when a > mark is requested instead of running two fairly small memcpy() > statements. The two-day benchmark of the customer showed no > performance hit, and looking at the code I would be amazed if the > optimization yielded a measurable benefit. In general, optimization > by adding overhead to moving through a scan to save time in a mark > operation seems dubious. Hmm. Did your test case actually exercise mark/restore? The memcpy()s are not that small. Especially if it's an index-only scan, you're copying a large portion of the page. Some scans call markpos on every tuple, so that could add up. > At some point we could consider building on this patch to recheck > index conditions for heap access when a non-MVCC snapshot is used, > check the visibility map for referenced heap pages when the TIDs > are read for an index-only scan, and/or skip LP_DEAD hinting for > non-WAL-logged indexes. But all those are speculative future work; > this is a conservative implementation that just didn't modify > pinning where there were any confounding factors. Understood. Still, I'd like to see if we can easily get rid of the pinning altogether. - Heikki
On Mon, Feb 23, 2015 at 2:48 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> Robert pointed out that the visibility information >> for an index-only scan wasn't checked while the index page READ >> lock was held, so those scans also still hold the pins. > > Why does an index-only scan need to hold the pin? Suppose there's a dead tuple in the heap someplace, and an index pointer pointing to that dead tuple. An index scan reaches the relevant page and copies all of the index pointers. VACUUM removes the index pointer and marks the heap page all-visible. The scan now uses the index pointers copied to backend-local memory and notes that the heap-page is all-visible, so the scan sees the tuple even though that tuple is totally gone from the heap by that point. Holding the pin prevents this, because we can't reach the second heap pass until the scan is done with the page, and therefore the dead line pointer in the heap page can't be marked unused, and therefore the page can't be marked all-visible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Heikki Linnakangas <hlinnakangas@vmware.com> wrote:> On 02/15/2015 02:19 AM, Kevin Grittner wrote: >> Interestingly, the btree README points out that using the old TID >> with a new tuple poses no hazard for a scan using an MVCC snapshot, >> because the new tuple would not be visible to a snapshot created >> that long ago. > > The first question is: Do we really need that interlock for the non-MVCC > snapshots either? We don't need the interlock for non-MVCC snapshots if the conditions we limit or filter on at the index level are rechecked when we get to the heap tuple; otherwise we do. > If we do: For non-MVCC snapshots, we need to ensure that all index scans > that started before the VACUUM did complete before the VACUUM does. Isn't it enough to be sure that if we're not going to recheck any index tests against the heap tuple that any process-local copies of index entries which exist at the start of the index scan phase of the vacuum are not used after the vacuum enters the phase of freeing line pointers -- that is, any scan which has read a page when the vacuum starts to scan the index moves on to another page before vacuum finishes scanning that index? > I wonder if we could find a different mechanism to enforce that. Using the > pin-interlock for that has always seemed a bit silly to me. It certainly is abusing the semantics of the pin to treat it as a type of lock on the contents. > Perhaps grab a new heavy-weight lock on the table whenever a non-MVCC index > scan on the table begins, and have VACUUM wait on it. -1 The problem I'm most interested in fixing based on user feedback is a vacuum blocking when a cursor which is using an index scan is sitting idle. Not only does the vacuum of that table stall, but if it is an autovacuum worker, that worker is prevented from cleaning up any other tables. I have seen all autovacuum workers blocked in exactly this way, leading to massive bloat. What you suggest is just using a less awkward way to keep the problem. >> I found that the LP_DEAD hinting >> would be a problem with an old TID, but I figured we could work >> around that by storing the page LSN into the scan position structure >> when the page contents were read, and only doing hinting if that >> matched when we were ready to do the hinting. That wouldn't work >> for an index which was not WAL-logged, so the patch still holds >> pins for those. > > Or you could use GetFakeLSNForUnloggedRel(). I'm unfamiliar with that, but will take a look. (I will be back at my usual development machine late tomorrow.) >> Robert pointed out that the visibility information >> for an index-only scan wasn't checked while the index page READ >> lock was held, so those scans also still hold the pins. > > Why does an index-only scan need to hold the pin? Robert already answered this one, but there is a possible solution -- provide some way to check the heap's visibility map for the TIDs read from an index page before releasing the read lock on that page. >> Finally, there was an "optimization" for marking buffer position >> for possible restore that was incompatible with releasing the pin. >> I use quotes because the optimization adds overhead to every move >> to the next page in order set some variables in a structure when a >> mark is requested instead of running two fairly small memcpy() >> statements. The two-day benchmark of the customer showed no >> performance hit, and looking at the code I would be amazed if the >> optimization yielded a measurable benefit. In general, optimization >> by adding overhead to moving through a scan to save time in a mark >> operation seems dubious. > > Hmm. Did your test case actually exercise mark/restore? The memcpy()s > are not that small. Especially if it's an index-only scan, you're > copying a large portion of the page. Some scans call markpos on every > tuple, so that could add up. I have results from the `make world` regression tests and a 48-hour customer test. Unfortunately I don't know how heavily either of those exercise this code. Do you have a suggestion for a way to test whether there is a serious regression for something approaching a "worst case"? >> At some point we could consider building on this patch to recheck >> index conditions for heap access when a non-MVCC snapshot is used, >> check the visibility map for referenced heap pages when the TIDs >> are read for an index-only scan, and/or skip LP_DEAD hinting for >> non-WAL-logged indexes. But all those are speculative future work; >> this is a conservative implementation that just didn't modify >> pinning where there were any confounding factors. > > Understood. Still, I'd like to see if we can easily get rid of the > pinning altogether. That would be great, but I felt that taking care of the easy cases in on patch and following with patches to handle the trickier cases as separate follow-on patches made more sense than trying to do everything at once. Assuming we sort out the mark/restore issues for the initial patch, it provides infrastructure to keep the other cases simpler. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, I measured the performance of this patch considering markpos/restorepos. The loss seems to be up to about 10% unfortunately. At Thu, 26 Feb 2015 17:49:23 +0000 (UTC), Kevin Grittner <kgrittn@ymail.com> wrote in <440831854.629116.1424972963735.JavaMail.yahoo@mail.yahoo.com> > Heikki Linnakangas <hlinnakangas@vmware.com> wrote:> On 02/15/2015 02:19 AM, Kevin Grittner wrote: > > Hmm. Did your test case actually exercise mark/restore? The memcpy()s > > are not that small. Especially if it's an index-only scan, you're > > copying a large portion of the page. Some scans call markpos on every > > tuple, so that could add up. > > > I have results from the `make world` regression tests and a 48-hour > customer test. Unfortunately I don't know how heavily either of those > exercise this code. Do you have a suggestion for a way to test whether > there is a serious regression for something approaching a "worst case"? ammarkpos/amrestrpos are called in merge joins. By the steps shown below, I had 1M times of markpos and no restorepos for 1M result rows, and had 500k times of markpos and the same number of times of restorepos for 2M rows result by a bit different configuration. I suppose we can say that they are the worst case considering maskpos/restrpos. The call counts can be taken using the attached patch. Both binaries ware compiled with -O2. shared_buffers=1GB and all shared pages used in the query were hit when measuring. The numbers were taken 12 times for each cofiguration and took averages and stddevs of 10 numbers excluding best and worst. Case 1. 500k markpos/restorepos for 2M result rows. Index scan: The patched loses about 2%. (1.98%) master: 6166 ms (std dev = 3.2 ms) Patched: 6288 ms (std dev = 3.7 ms) IndesOnly scan: The patches loses about 2%. (2.14%) master: 5946 ms (std dev = 5.0 ms) Patched: 6073 ms (std dev = 3.3 ms) The patched version is slower by about 2%. Of course all of it is not the effect of memcpy but I don't know the distribution. Case 2. 1M markpos, no restorepos for 1M result rows. IndesOnly scan: The patches loses about 10%. master: 3655 ms (std dev = 2.5 ms) Patched: 4038 ms (std dev = 2.6 ms) The loss is about 10%. The case looks a bit impractical but unfortunately the number might be unignorable. The distribution of the loss is unknown, too. regards, =========== CREATE TABLE t1 (a int); CREATE TABLE t2 (a int); delete from t1; delete from t2; -- This causes 1M times of markpos and no restorepos INSERT INTO t1 (SELECT a FROM generate_series(0, 999999) a); INSERT INTO t2 (SELECT a FROM generate_series(0, 999999) a); -- This causes 500k times of markpos and the same number of restorepos -- INSERT INTO t1 (SELECT a/2 FROM generate_series(0, 999999) a); -- INSERT INTO t2 (SELECT a/2 FROM generate_series(0, 999999) a); CREATE INDEX it1 ON t1 (a); CREATE INDEX it2 ON t2 (a); ANALYZE t1; ANALYZE t1; SET enable_seqscan to false; SET enable_material to false; SET enable_hashjoin to false; SET enable_nestloop to false; SET enable_indexonlyscan to false; -- omit this to do indexonly scan EXPLAIN (ANALYZE) SELECT t1.a, t2.a FROM t1 JOIN t2 on (t1.a = t2.a); QUERY PLAN --------------------------------------------------------------------------Merge Join (cost=2.83..322381.82 rows=3031231width=8) (actual time=0.013..5193.566 rows=2000000 loops=1) Merge Cond: (t1.a = t2.a) -> Index Scan usingit1 on t1 (cost=0.43..65681.87 rows=1666667 width=4) (actual time=0.004..727.557 rows=1000000 oops=1) -> Index Scan using it2 on t2 (cost=0.43..214642.89 rows=3031231 width=4) (actual time=0.004..1478.361 rows=19999loops=1)Planningtime: 25.585 msExecution time: 6299.521 ms (6 rows) -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 3af462b..04d6cec 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -543,15 +543,19 @@ btendscan(PG_FUNCTION_ARGS) PG_RETURN_VOID();} +/* * btmarkpos() -- save current scan position */ +int hoge_markpos_count = 0; +int hoge_restrpos_count = 0;Datumbtmarkpos(PG_FUNCTION_ARGS){ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0); BTScanOpaque so = (BTScanOpaque) scan->opaque; + hoge_markpos_count++; /* we aren't holding any read locks, but gotta drop the pin */ if (BTScanPosIsValid(so->markPos)) { @@ -585,7 +589,7 @@ btrestrpos(PG_FUNCTION_ARGS){ IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0); BTScanOpaqueso = (BTScanOpaque) scan->opaque; - + hoge_restrpos_count++; /* Restore the marked positions of any array keys */ if (so->numArrayKeys) _bt_restore_array_keys(scan); diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 33b172b..e7c1b99 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -258,14 +258,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags) * * ----------------------------------------------------------------*/ +extern int hoge_markpos_count; +extern int hoge_restrpos_count;voidExecutorRun(QueryDesc *queryDesc, ScanDirection direction, long count){ + hoge_markpos_count = 0; + hoge_restrpos_count = 0; if (ExecutorRun_hook) (*ExecutorRun_hook) (queryDesc, direction, count); else standard_ExecutorRun(queryDesc, direction, count); + elog(LOG, "markpos=%d, restrpos=%d", hoge_markpos_count, hoge_restrpos_count);}void
>>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: Kyotaro> ammarkpos/amrestrpos are called in merge joins. By the stepsKyotaro> shown below, I had 1M times of markpos andno restorepos forKyotaro> 1M result rows, and had 500k times of markpos and the sameKyotaro> number of times of restoreposfor 2M rows result by a bitKyotaro> different configuration. I suppose we can say that they areKyotaro> the worstcase considering maskpos/restrpos. The call countsKyotaro> can be taken using the attached patch. You might want to try running the same test, but after patching ExecSupportsMarkRestore to return false for index scans. This will cause the planner to insert a Materialize node to handle the mark/restore. This way, you can get an idea of how much gain (if any) the direct support of mark/restore in the scan is actually providing. -- Andrew (irc:RhodiumToad)
Hi, At Fri, 27 Feb 2015 05:56:18 +0000, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote in <874mq77vuu.fsf@news-spur.riddles.org.uk> > >>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > > Kyotaro> ammarkpos/amrestrpos are called in merge joins. By the steps > Kyotaro> shown below, I had 1M times of markpos and no restorepos for > Kyotaro> 1M result rows, and had 500k times of markpos and the same > Kyotaro> number of times of restorepos for 2M rows result by a bit > Kyotaro> different configuration. I suppose we can say that they are > Kyotaro> the worst case considering maskpos/restrpos. The call counts > Kyotaro> can be taken using the attached patch. > > You might want to try running the same test, but after patching > ExecSupportsMarkRestore to return false for index scans. This will cause > the planner to insert a Materialize node to handle the mark/restore. Mmm? The patch bt-nopin-v1.patch seems not contain the change for ExecSupportMarkRestore and the very simple function remain looking to return true for T_Index(Only)Scan after the patch applied. Explain shows that the plan does not involve materializing step and addition to it, the counters I put into both btmarkpos() and btrestrpos() showed that they are actually called during the execution, like this, for both unpatched and patched. | LOG: markpos=1000000, restrpos=0 | STATEMENT: EXPLAIN (ANALYZE,BUFFERS) SELECT t1.a, t2.a FROM t1 JOIN t2 on (t1.a = t2.a); To make sure, I looked into btmarkpos and btrestrpos to confirm that they really does the memcpy() we're talking about, then recompiled whole the source tree. > This way, you can get an idea of how much gain (if any) the direct > support of mark/restore in the scan is actually providing. Does "the scan" mean T_Material node? But no material in plan and counters in bt*pos were incremented in fact as mentioned above.. Could you point out any other possible mistake in my thought? regards, -- Kyotaro Horiguchi NTT Open Source Software Center
>>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: >> You might want to try running the same test, but after patching>> ExecSupportsMarkRestore to return false for index scans.This will>> cause the planner to insert a Materialize node to handle the>> mark/restore. Kyotaro> Mmm? The patch bt-nopin-v1.patch seems not contain the changeKyotaro> for ExecSupportMarkRestore and the very simplefunction remainKyotaro> looking to return true for T_Index(Only)Scan after the patchKyotaro> applied. Right. I'm suggesting you change that, in order to determine what performance cost, if any, would result from abandoning the idea of doing mark/restore entirely. -- Andrew (irc:RhodiumToad)
Just a reminder, but I am not the author of this patch:) At Fri, 27 Feb 2015 07:26:38 +0000, Andrew Gierth <andrew@tao11.riddles.org.uk> wrote in <87zj7z6ckc.fsf@news-spur.riddles.org.uk> > >>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: > > >> You might want to try running the same test, but after patching > >> ExecSupportsMarkRestore to return false for index scans. This will > >> cause the planner to insert a Materialize node to handle the > >> mark/restore. > > Kyotaro> Mmm? The patch bt-nopin-v1.patch seems not contain the change > Kyotaro> for ExecSupportMarkRestore and the very simple function remain > Kyotaro> looking to return true for T_Index(Only)Scan after the patch > Kyotaro> applied. > > Right. I'm suggesting you change that, in order to determine what > performance cost, if any, would result from abandoning the idea of doing > mark/restore entirely. I understand that you'd like to see the net drag of performance by the memcpy(), right? That don't seem to be possible without breaking (a part of) the patch's function, but, concerning btmarkpos() only case, the mark struct can have garbage so only changing refcount would be viable to see the pure loss by the memcpy(). The attached patch is the destruction I did, and the result was like below. > Case 2. 1M markpos, no restorepos for 1M result rows. > > IndesOnly scan: The patches loses about 10%. > master: 3655 ms (std dev = 2.5 ms) > Patched: 4038 ms (std dev = 2.6 ms) Patched: 4036 ms (std dev = 3.5 ms) Broken: 3718 ms (std dev = 1.6 ms) The "Patched" just above is the retook numbers. It's a little under 9% loss from "broken" version. That is the pure drag of memcpy(). This seems quite big as Heikki said. It's an extreme case, though. The rest 1.7% should be the share of all the other stuff. regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 65daf27..4973b63 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -547,12 +547,17 @@ btmarkpos(PG_FUNCTION_ARGS) BTScanOpaque so = (BTScanOpaque) scan->opaque; hoge_markpos_count++; - memcpy(&so->markPos, &so->currPos, - offsetof(BTScanPosData, items[1]) + - so->currPos.lastItem * sizeof(BTScanPosItem)); - if (so->markTuples) - memcpy(so->markTuples, so->currTuples, - so->currPos.nextTupleOffset); +// memcpy(&so->markPos, &so->currPos, +// offsetof(BTScanPosData, items[1]) + +// so->currPos.lastItem * sizeof(BTScanPosItem)); +// if (so->markTuples) +// memcpy(so->markTuples, so->currTuples, +// so->currPos.nextTupleOffset); + if (BTScanPosIsValid(so->markPos)) + { + ReleaseBuffer(so->markPos.buf); + so->markPos.buf = InvalidBuffer; + } /* We don't take out an extra pin for the mark position. */ so->markPos.buf = InvalidBuffer; @@ -599,12 +604,13 @@ btrestrpos(PG_FUNCTION_ARGS) if (BTScanPosIsValid(so->markPos)) { - memcpy(&so->currPos, &so->markPos, - offsetof(BTScanPosData, items[1]) + - so->markPos.lastItem * sizeof(BTScanPosItem)); - if (so->currTuples) - memcpy(so->currTuples, so->markTuples, - so->markPos.nextTupleOffset); +// memcpy(&so->currPos, &so->markPos, +// offsetof(BTScanPosData, items[1]) + +// so->markPos.lastItem * sizeof(BTScanPosItem)); +// if (so->currTuples) +// memcpy(so->currTuples, so->markTuples, +// so->markPos.nextTupleOffset); + IncrBufferRefCount(so->markPos.buf); } else BTScanPosInvalidate(so->currPos);
>>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: Kyotaro> Just a reminder, but I am not the author of this patch:) Yes, I understand that. Kyotaro> Mmm? The patch bt-nopin-v1.patch seems not contain the changeKyotaro> for ExecSupportMarkRestore and the very simplefunction remainKyotaro> looking to return true for T_Index(Only)Scan after the patchKyotaro> applied. >> Right. I'm suggesting you change that, in order to determine what>> performance cost, if any, would result from abandoningthe idea of>> doing mark/restore entirely. Kyotaro> I understand that you'd like to see the net drag ofKyotaro> performance by the memcpy(), right? No. What I am suggesting is this: if mark/restore is a performance issue, then it would be useful to know how much gain we're getting (if any) from supporting it _at all_. Let me try and explain it another way. If you change ExecSupportMarkRestore to return false for index scans, then btmarkpos/btrestorepos will no longer be called. What is the performance of this case compared to the original and patched versions? -- Andrew (irc:RhodiumToad)
<div dir="ltr"><p dir="ltr">Hello.<p dir="ltr">Mmm... We might be focusing on a bit different things..<p dir="ltr">2015/02/2717:27 "Andrew Gierth" > Kyotaro> I understand that you'd like to see the net drag of<br /> > Kyotaro> performance by the memcpy(), right?<br /> ><br /> > No.<br /> ><br /> > What I am suggestingis this: if mark/restore is a performance issue,<br /> > then it would be useful to know how much gain we'regetting (if any)<br /> > from supporting it _at all_.<p dir="ltr">The mark/restore works as before with this patch,except the stuff for early dropping of buffer pins. As far as my understanding so far all the stuff in the patch isfor the purpose but doesn't intend to boost btree itself. That is, it reduces the chance to block vacuum for possible burden on general performance. And I think the current issue in focus is that the burden might a bit too heavyon specific situation. Therefore I tried to measure how heavy/light the burden is.<p dir="ltr">Am I correct so far?<pdir="ltr">> Let me try and explain it another way. If you change<br /> > ExecSupportMarkRestore to return falsefor index scans, then<br /> > btmarkpos/btrestorepos will no longer be called. What is the performance<br /> >of this case compared to the original and patched versions?<p dir="ltr">As you mentioned before, such change will makethe planner turn to, perhaps materialize plan or rescan, or any other scans which are no use comparing to indexscan concerningthe issue in focus, the performance drag when btree scan does extremely frequent mark/restore in comparison to unpatched, less copy-amount version. Your suggestion looks intending somewhat different from this.<p dir="ltr">AnywayI'm sorry but I have left my dev env and cannot have time till next week.<p>regards,<br /><p dir="ltr">--<br /> Kyotaro Horiguti<br /></div>
>>>>> "Kyotaro" == Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes: Kyotaro> Anyway I'm sorry but I have left my dev env and cannot haveKyotaro> time till next week. The point is moot; rather than try and explain it further I did the test myself, and the performance penalty of disabling mark/restore is substantial, on the order of 30%, so that's a non-starter. (I was a bit surprised that it was so bad...) -- Andrew (irc:RhodiumToad)
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, I measured the performance of this patch considering > markpos/restorepos. The loss seems to be up to about 10% > unfortunately. Thanks for the test case! I took another look at this optimization, and found that it didn't really depend on the pin (as I had first thought), so I put it back (keeping the rest of the patch unchanged). I saw a 1.4% increase in run time with the patch (compared to master) for the mark-only test, and a 1.6% increase in run time for the mark/restore test. I'll look into what might be causing that. I have seen bigger differences just due to changing where executable code crossed cache boundaries, but this is big enough to worry about; it needs to be checked. New patch attached. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Kevin Grittner <kgrittn@ymail.com> wrote: > [need to check performance more] It looks like the remaining performance regression was indeed a result of code alignment. I found two "paranoia" assignments I had accidentally failed to put back with the rest of the mark/restore optimization; after that trivial change the patched version is ever-so-slightly faster than master on all tests. Average of 3 `make check-world` runs: master: 336.288 seconds patch: 332.237 seconds Average of 6 `make check` runs: master: 25.409 seconds patch: 25.270 seconds The following were all run 12 times, the worst 2 dropped, the rest averaged: Kyotaro-san's 1m mark "worst case" benchmark: master: 962.581 ms, 6.765 stdev patch: 947.518 ms, 3.584 stdev Kyotaro-san's 500k mark, 500k restore "worst case" benchmark: master: 1366.639 ms, 5.314 stdev patch: 1363.844 ms, 5.896 stdev pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench master: 265.011 tps, 4.952 stdev patch: 267.164 tps, 9.094 stdev How do people feel about the idea of me pushing this for 9.5 (after I clean up all the affected comments and README files)? I know this first appeared in the last CF, but the footprint is fairly small and the only user-visible behavior change is that a btree index scan of a WAL-logged table using an MVCC snapshot no longer blocks a vacuum indefinitely. (If there are objections I will move this to the first CF for the next release.) src/backend/access/nbtree/nbtree.c | 50 +++++-------src/backend/access/nbtree/nbtsearch.c | 141 +++++++++++++++++++++++++---------src/backend/access/nbtree/nbtutils.c | 58 ++++++++++----src/include/access/nbtree.h | 36 ++++++++-4 files changed, 201 insertions(+), 84 deletions(-) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, > It looks like the remaining performance regression was indeed a > result of code alignment. I found two "paranoia" assignments I had > accidentally failed to put back with the rest of the mark/restore > optimization; after that trivial change the patched version is > ever-so-slightly faster than master on all tests. The performance which your test shows looks great. The gain might comes from removing of buffer locking on _bt_next. I also would like to see this to come along with 9.5 but it is the matter for committers. Apart from it, I looked into the patch itself more closely. Please reconcile as you like among contradicting comments below:) In nbtutils.c, _bt_killitems: - This is not in charge of this patch, but the comment for _bt_killitems looks to have a trivial typo. > * _bt_killitems - set LP_DEAD state for items an indexscan caller has > * told us were killed > * > * scan->so containsinformation about the current page and killed tuples > * thereon (generally, this should only be called if so->numKilled> 0). I suppose "scan->so" should be "scan->opaque" and "so->numKilled" be "opaque->numKilled". - The following comment looks somewhat wrong. > /* Since the else condition needs page, get it here, too. */ "the else condition needs page" means "the page of the buffer is needed later"? But, I suppose the comment itself is notnecessary. - If (BTScanPosIsPinned(so->currPos)). As I mention below for nbtsearch.c, the page aquired in the if-then block may be vacuumed so the LSN check done in the if-elseblock is need also in the if-then block. It will be accomplished only by changing the position of the end of the if-elseblock. - _bt_killitems is called without pin when rescanning from before, so I think the previous code has already considered theunpinned case ("if (ItemPointerEquals(&ituple..." does that). Vacuum rearranges line pointers after deleting LP_DEAD itemsso the previous check seems enough for the purpose. The previous code is more effeicient becuase the mathing items willbe processed even after vacuum. In nbtsearch.c - _bt_first(): It now releases the share lock on the page before the items to be killed is processed. This allows other backendsto insert items into the page simultaneously. It seems to be dangerous alone, but _bt_killitems already considersthe case. Anyway I think It'll be better to add a comment mentioning unlocking is not dangerous. ... Sorry, time's up for now. regards, > Average of 3 `make check-world` runs: > master: 336.288 seconds > patch: 332.237 seconds > > Average of 6 `make check` runs: > master: 25.409 seconds > patch: 25.270 seconds > > The following were all run 12 times, the worst 2 dropped, the rest > averaged: > > Kyotaro-san's 1m mark "worst case" benchmark: > master: 962.581 ms, 6.765 stdev > patch: 947.518 ms, 3.584 stdev > > Kyotaro-san's 500k mark, 500k restore "worst case" benchmark: > master: 1366.639 ms, 5.314 stdev > patch: 1363.844 ms, 5.896 stdev > > pgbench -i -s 16 pgbench / pgbench -c 16 -j 4 -T 500 pgbench > master: 265.011 tps, 4.952 stdev > patch: 267.164 tps, 9.094 stdev > > How do people feel about the idea of me pushing this for 9.5 (after > I clean up all the affected comments and README files)? I know > this first appeared in the last CF, but the footprint is fairly > small and the only user-visible behavior change is that a btree > index scan of a WAL-logged table using an MVCC snapshot no longer > blocks a vacuum indefinitely. (If there are objections I will move > this to the first CF for the next release.) > > src/backend/access/nbtree/nbtree.c | 50 +++++------- > src/backend/access/nbtree/nbtsearch.c | 141 +++++++++++++++++++++++++--------- > src/backend/access/nbtree/nbtutils.c | 58 ++++++++++---- > src/include/access/nbtree.h | 36 ++++++++- > 4 files changed, 201 insertions(+), 84 deletions(-) -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > The performance which your test shows looks great. The gain might > comes from removing of buffer locking on _bt_next. Yeah, I had been hoping that removing some buffer pins and locks from the common path of scanning forward from one page to the next might have some direct performance benefit; it's possible that this will show up more dramatically at high concurrency levels. The main goal, of course, was to improve performance more indirectly, by not having autovacuum worker processes blocked for extended periods in some situations where that can now happen. > In nbtutils.c, _bt_killitems: > > - This is not in charge of this patch, but the comment for > _bt_killitems looks to have a trivial typo. As you say, that problem was already there, but no reason not to fix it in the patch as long as we're here. Reworded. > - The following comment looks somewhat wrong. > > > /* Since the else condition needs page, get it here, too. */ > > "the else condition needs page" means "the page of the buffer > is needed later"? But, I suppose the comment itself is not > necessary. I guess that comment isn't really worth the space it takes up; what it says is pretty obvious from the code. Removed. > - If (BTScanPosIsPinned(so->currPos)). > > As I mention below for nbtsearch.c, the page aquired in the > if-then block may be vacuumed so the LSN check done in the > if-else block is need also in the if-then block. It will be > accomplished only by changing the position of the end of the > if-else block. I'm not sure I agree on this. If the page is pinned it should have been pinned continuously since we initially read it, so the line pointers we have could not have been removed by any vacuum process. The tuples may have been pruned away in the heap, but that doesn't matter. Index entries may have been added and the index page may have been split, but that was true before this patch and _bt_killitems will deal with those things the same as it always has. I don't see any benefit to doing the LSN compare in this case; if we've paid the costs of holding the pin through to this point, we might as well flag any dead entries we can. > - _bt_killitems is called without pin when rescanning from > before, so I think the previous code has already considered the > unpinned case ("if (ItemPointerEquals(&ituple..." does > that). Vacuum rearranges line pointers after deleting LP_DEAD > items so the previous check seems enough for the purpose. The > previous code is more effeicient becuase the mathing items will > be processed even after vacuum. I'm not following you on this one; could you rephrase it? > In nbtsearch.c > > - _bt_first(): It now releases the share lock on the page before > the items to be killed is processed. This allows other backends > to insert items into the page simultaneously. It seems to be > dangerous alone, but _bt_killitems already considers the > case. Anyway I think It'll be better to add a comment > mentioning unlocking is not dangerous. Well, _bt_killitems does acquire a shared (read) lock to flag index tuples as LP_DEAD, and it has always been OK for the index page to accept inserts and allow page splits before calling _bt_killitems. I don't really see anything new with the patch in this regard. I do agree, though, that a lot of comments and at least two README files refer to the pins blocking vacuum, and these must be found and changed. It doesn't seem worth posting to the list for the small changes since the last version; I'll wait until I update the comments and README files. If you want review or test the latest, you can peek at: https://github.com/kgrittn/postgres/tree/btnopin -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> wrote: > It doesn't seem worth posting to the list for the small changes > since the last version; I'll wait until I update the comments and > README files. If you want review or test the latest, you can > peek at: > > https://github.com/kgrittn/postgres/tree/btnopin Here is v3, with the promised README and code comment changes. In the process of documenting the mark/restore area I noticed a subtlety that I had missed (in the case that there was a mark, advance to the next page, restore, advance within the page, and restore). I fixed that, and in the process gave the patched code an additional direct performance edge over unpatched code. For the 1000k marks, average timings are now: master: 970.999 ms, stdev: 4.043 patched: 940.460 ms, stdev: 4.874 So, in the micro-benchmark showing the biggest benefit the direct improvement is now just over 3%. It remains the case that the primary motivation for the patch is to reduce blocking of vacuum processes; but that's a nice side-benefit. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hello, I found no other problem including the performance issue in the patch attached to the last mail as far as I can understand. So I'll mark this as ready for commit after a few days with no objection after this comments is addressed. > > - If (BTScanPosIsPinned(so->currPos)). > > > > As I mention below for nbtsearch.c, the page aquired in the > > if-then block may be vacuumed so the LSN check done in the > > if-else block is need also in the if-then block. It will be > > accomplished only by changing the position of the end of the > > if-else block. > > I'm not sure I agree on this. Sorry, it is largely because of my poor composition. > If the page is pinned it should have > been pinned continuously since we initially read it, so the line > pointers we have could not have been removed by any vacuum process. > The tuples may have been pruned away in the heap, but that doesn't > matter. > Index entries may have been added and the index page may > have been split, but that was true before this patch and > _bt_killitems will deal with those things the same as it always > has. Yes. I think so. > I don't see any benefit to doing the LSN compare in this > case; if we've paid the costs of holding the pin through to this > point, we might as well flag any dead entries we can. I thought of the case that the buffer has been pinned by another backend after this backend unpinned it. I looked again the definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and understood that the pin should be mine if BTScanPosIsPinned. Would you mind rewriting the comment there like this? - /* The buffer is still pinned, but not locked. Lock it now. */ + /* I still hold the pin on the buffer, but not locked. Lock it now. */ | LockBuffer(so->currPos.buf, BT_READ); Or would you mind renaming the macro as "BTScanPosIsPinnedByMe" or something like, or anyway to notify the reader that the pin should be mine there? > > - _bt_killitems is called without pin when rescanning from > > before, so I think the previous code has already considered the > > unpinned case ("if (ItemPointerEquals(&ituple..." does > > that). Vacuum rearranges line pointers after deleting LP_DEAD > > items so the previous check seems enough for the purpose. The > > previous code is more effeicient becuase the mathing items will > > be processed even after vacuum. > > I'm not following you on this one; could you rephrase it? Sorry, I read btrescan incorrectly that it calls _bt_killitems() *after* releaseing the buffer. Please forget it. Finally, I'd like to timidly comment on this.. + To handle the cases where it is safe to release the pin after + reading the index page, the LSN of the index page is read along + with the index entries before the shared lock is released, and we + do not attempt to flag index tuples as dead if the page is not + pinned and the LSN does not match. I suppose that the sentence like following is more directly describing about the mechanism and easier to find the correnponsing code, if it is correct. > To handle the cases where a index page has unpinned when > trying to mark the unused index tuples on the page as dead, > the LSN of the index page is remembered when reading the index > page for index tuples, and we do not attempt to flag index > tuples as dead if the page is not pinned and the LSN does not > match. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > I found no other problem including the performance issue in the > patch attached to the last mail as far as I can understand. So > I'll mark this as ready for commit after a few days with no > objection after this comments is addressed. Thanks for the reviews! >> I don't see any benefit to doing the LSN compare in this >> case; if we've paid the costs of holding the pin through to this >> point, we might as well flag any dead entries we can. > > I thought of the case that the buffer has been pinned by another > backend after this backend unpinned it. I looked again the > definition of BTScanPosIsPinned and BTScanPosUnpinIfPinned, and > understood that the pin should be mine if BTScanPosIsPinned. > > Would you mind rewriting the comment there like this? > > - /* The buffer is still pinned, but not locked. Lock it now. */ > + /* I still hold the pin on the buffer, but not locked. Lock it now. */ > | LockBuffer(so->currPos.buf, BT_READ); > > Or would you mind renaming the macro as "BTScanPosIsPinnedByMe" > or something like, or anyway to notify the reader that the pin > should be mine there? I see your point, although those first person singular pronouns used like that make me a little uncomfortable; I'll change the comment and/or macro name, but I'll work on the name some more. > Finally, I'd like to timidly comment on this.. > > + To handle the cases where it is safe to release the pin after > + reading the index page, the LSN of the index page is read along > + with the index entries before the shared lock is released, and we > + do not attempt to flag index tuples as dead if the page is not > + pinned and the LSN does not match. > > I suppose that the sentence like following is more directly > describing about the mechanism and easier to find the > correnponsing code, if it is correct. > >> To handle the cases where a index page has unpinned when >> trying to mark the unused index tuples on the page as dead, >> the LSN of the index page is remembered when reading the index >> page for index tuples, and we do not attempt to flag index >> tuples as dead if the page is not pinned and the LSN does not >> match. Will reword that part to try to make it clearer. Thanks! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> wrote: > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> Would you mind rewriting the comment there like this? >> >> - /* The buffer is still pinned, but not locked. Lock it now. */ >> + /* I still hold the pin on the buffer, but not locked. Lock it now. */ >> Or would you mind renaming the macro as "BTScanPosIsPinnedByMe" >> or something like, or anyway to notify the reader that the pin >> should be mine there? > > I see your point, although those first person singular pronouns > used like that make me a little uncomfortable; I'll change the > comment and/or macro name, but I'll work on the name some more. After thinking it over, I think that the "BTScanPos" part of the macro name is enough of a hint that it is concerned with the actions of this scan; it is the comment that needs the change. I went with: /* * We have held the pin on this page since we read the index tuples, * so all we need to do is lock it. The pin will have prevented * re-use of any TID on the page, so there is no need to check the * LSN. */ >> + To handle the cases where it is safe to release the pin after >> + reading the index page, the LSN of the index page is read along >> + with the index entries before the shared lock is released, and we >> + do not attempt to flag index tuples as dead if the page is not >> + pinned and the LSN does not match. >> >> I suppose that the sentence like following is more directly >> describing about the mechanism and easier to find the >> correnponsing code, if it is correct. >> >>> To handle the cases where a index page has unpinned when >>> trying to mark the unused index tuples on the page as dead, >>> the LSN of the index page is remembered when reading the index >>> page for index tuples, and we do not attempt to flag index >>> tuples as dead if the page is not pinned and the LSN does not >>> match. > > Will reword that part to try to make it clearer. That sentence was full of "passive voice", which didn't help any. I changed it to: | So that we can handle the cases where we attempt LP_DEAD flagging | for a page after we have released its pin, we remember the LSN of | the index page when we read the index tuples from it; we do not | attempt to flag index tuples as dead if the we didn't hold the | pin the entire time and the LSN has changed. Do these changes seem clear? Because these changes are just to a comment and a README file, I'm posting a patch-on-patch v3a (to be applied on top of v3). -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Kevin Grittner <kgrittn@ymail.com> wrote: > Because these changes are just to a comment and a README file, > I'm posting a patch-on-patch v3a (to be applied on top of v3). Here is some additional comment work that I hope will make things easier to understand for whoever next visits this code. There is also a minor reorganization of some code that should not have any impact except to skip the restore memcpy() calls where they are not needed in a corner case that I'm not sure we can currently even reach. That reorganization is intended more for code clarity than for the marginal performance it could provide. I'm posting this as v3b to be applied on top of v3 and v3a, so that these changes can be reviewed and commented on separately. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > At some point we could consider building on this patch to recheck > index conditions for heap access when a non-MVCC snapshot is used, > check the visibility map for referenced heap pages when the TIDs > are read for an index-only scan, and/or skip LP_DEAD hinting for > non-WAL-logged indexes. But all those are speculative future work; > this is a conservative implementation that just didn't modify > pinning where there were any confounding factors. I don't have the bandwidth for a full review, but I took a quick look at this. I think you should call out those "confounding factors" in the README. It's not hard to piece them together from _bt_drop_lock_and_maybe_pin(), but I think you should anyway. Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing nbtree LockBuffer() callers do. You're inconsistent about that in V3. -- Peter Geoghegan
Hello, At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan <pg@heroku.com> wrote in <CAM3SWZQ5nwTB-y4ZOj=5ckMLce5GAEUnjKJ2=M1vMHfX_aYmCg@mail.gmail.com> > On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > > At some point we could consider building on this patch to recheck > > index conditions for heap access when a non-MVCC snapshot is used, > > check the visibility map for referenced heap pages when the TIDs > > are read for an index-only scan, and/or skip LP_DEAD hinting for > > non-WAL-logged indexes. But all those are speculative future work; > > this is a conservative implementation that just didn't modify > > pinning where there were any confounding factors. > > I don't have the bandwidth for a full review, but I took a quick look at this. > > I think you should call out those "confounding factors" in the README. > It's not hard to piece them together from > _bt_drop_lock_and_maybe_pin(), but I think you should anyway. > > Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing nbtree > LockBuffer() callers do. You're inconsistent about that in V3. I agree with you. It looks the only point where it is used. Addition to that, the commnet just above the point methioned above quizes me. > /* XXX: Can walking left be lighter on the locking and pins? */ > if (BTScanPosIsPinned(so->currPos)) > LockBuffer(so->currPos.buf, BUFFER_LOCK_SHARE); > else > so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ); I'm happy if I could read the meaming of the comment more clearly. I understand that it says that you want to remove the locking (and pinning), but can't now because the simultaneous splitting of the left page would break something. I'd like to see it clearer even for me either I am correct or not.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 4 March 2015 at 03:16, Kevin Grittner <kgrittn@ymail.com> wrote: > How do people feel about the idea of me pushing this for 9.5 (after > I clean up all the affected comments and README files)? I know > this first appeared in the last CF, but the footprint is fairly > small and the only user-visible behavior change is that a btree > index scan of a WAL-logged table using an MVCC snapshot no longer > blocks a vacuum indefinitely. (If there are objections I will move > this to the first CF for the next release.) It helps Hot Standby also, BTW. I proposed this previously, but it was shot down, so I'm glad to see it happening. I'm not sure why you have proposed only half the solution though? Hopefully we aren't just submitting the difficult half that you needed feedback on? Let's see the whole "snapshot too old" patch as well, since that affects core PostgresSQL also. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On 15 February 2015 at 00:19, Kevin Grittner <kgrittn@ymail.com> wrote: > What they wanted was what happened in the other database product -- > if a snapshot got sufficiently old that cleaning up the MVCC data > was a problem *and* the snapshot was used again *and* it read a > page which had been modified far enough back that it was not > possible to return correct data, then they wanted to receive a > "snapshot too old" error. I wrote a patch to do that... So, please lets see the patch. It seems useful for core Postgres. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 15 February 2015 at 00:19, Kevin Grittner <kgrittn@ymail.com> wrote: >> What they wanted was what happened in the other database product -- >> if a snapshot got sufficiently old that cleaning up the MVCC data >> was a problem *and* the snapshot was used again *and* it read a >> page which had been modified far enough back that it was not >> possible to return correct data, then they wanted to receive a >> "snapshot too old" error. I wrote a patch to do that... > > So, please lets see the patch. It seems useful for core Postgres. It was submitted here: http://www.postgresql.org/message-id/136937748.3364317.1423964815320.JavaMail.yahoo@mail.yahoo.com -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan <pg@heroku.com> wrote: >> On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >>> At some point we could consider building on this patch to >>> recheck index conditions for heap access when a non-MVCC >>> snapshot is used, check the visibility map for referenced heap >>> pages when the TIDs are read for an index-only scan, and/or >>> skip LP_DEAD hinting for non-WAL-logged indexes. But all those >>> are speculative future work; this is a conservative >>> implementation that just didn't modify pinning where there were >>> any confounding factors. >> I think you should call out those "confounding factors" in the >> README. It's not hard to piece them together from >> _bt_drop_lock_and_maybe_pin(), but I think you should anyway. OK: https://github.com/kgrittn/postgres/commit/f5f59ded30b114ac83b90a00ba1fa5ef490b994e >> Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing >> nbtree LockBuffer() callers do. You're inconsistent about that >> in V3. > > I agree with you. It looks the only point where it is used. OK: https://github.com/kgrittn/postgres/commit/76118b58be819ed5e68569c926d0222bc41640ea > Addition to that, the commnet just above the point methioned > above quizes me. > >> /* XXX: Can walking left be lighter on the locking and pins? */ >> if (BTScanPosIsPinned(so->currPos)) >> LockBuffer(so->currPos.buf, BUFFER_LOCK_SHARE); >> else >> so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ); > > I'm happy if I could read the meaming of the comment more > clearly. I understand that it says that you want to remove the > locking (and pinning), but can't now because the simultaneous > splitting of the left page would break something. I'd like to see > it clearer even for me either I am correct or not.. Does this clear it up?: https://github.com/kgrittn/postgres/commit/22066fc161a092e800e4c1e853136c4513f8771b Since there are no changes that would affect the compiled code here, I'm not posting a new patch yet. I'll do that once things seem to have settled down a bit more. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 15 February 2015 at 00:19, Kevin Grittner <kgrittn@ymail.com> wrote: >>> What they wanted was what happened in the other database >>> product -- if a snapshot got sufficiently old that cleaning up >>> the MVCC data was a problem *and* the snapshot was used again >>> *and* it read a page which had been modified far enough back >>> that it was not possible to return correct data, then they >>> wanted to receive a "snapshot too old" error. I wrote a patch >>> to do that... >> >> So, please lets see the patch. It seems useful for core >> Postgres. > > It was submitted here: > > http://www.postgresql.org/message-id/136937748.3364317.1423964815320.JavaMail.yahoo@mail.yahoo.com The feedback was generally fairly positive except for the fact that snapshot "age" (for purposes of being too old) was measured in transaction IDs assigned. There seemed to be a pretty universal feeling that this needed to be changed to a time-based setting. I've been working on that, although my initial efforts proved to be a dead-end. I've had some off-list discussions with Andrew Dunstan and we have agreed on something that looks like it should work well, but it's still being coded. The other thing that is needed is to sprinkle the other index AMs with TestForOldSnapshot() calls. In the btree code the right spots for those were all following BufferGetPage() calls, so the work was in seeing where each of those could be called from to see whether it was a spot that the check was needed. Of course, docs and comment changes are also needed, but that's probably only a day or two's effort. Further discussion of the "snapshot too old" patch should probably go on its thread. The spillover is largely my fault for referencing one from the other. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 13 March 2015 at 13:17, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 13, 2015 at 8:52 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 15 February 2015 at 00:19, Kevin Grittner <kgrittn@ymail.com> wrote: >>> What they wanted was what happened in the other database product -- >>> if a snapshot got sufficiently old that cleaning up the MVCC data >>> was a problem *and* the snapshot was used again *and* it read a >>> page which had been modified far enough back that it was not >>> possible to return correct data, then they wanted to receive a >>> "snapshot too old" error. I wrote a patch to do that... >> >> So, please lets see the patch. It seems useful for core Postgres. > > It was submitted here: > > http://www.postgresql.org/message-id/136937748.3364317.1423964815320.JavaMail.yahoo@mail.yahoo.com Thanks. I have +1'd that patch. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On 13 March 2015 at 15:41, Kevin Grittner <kgrittn@ymail.com> wrote: > The feedback was generally fairly positive except for the fact that > snapshot "age" (for purposes of being too old) was measured in > transaction IDs assigned. There seemed to be a pretty universal > feeling that this needed to be changed to a time-based setting. -1 for a time based setting. After years of consideration, bloat is now controllable by altering the size of the undo tablespace. I think PostgreSQL needs something size-based also. It would need some estimation to get it to work like that, true, but it is actually the size of the bloat we care about, not the time. So we should be thinking in terms of limits that we actually care about. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
Thank you for rewriting. I was satisfied with the current patch (Head of btnopin on your github repos). I have no further comment on the functions. Peter might have a comment on the README description but I suppose I can push this to the committers. I attached the your latest patch to this mail as bt-nopin-v4.patch for now. Please check that there's no problem in it. - By this patch, index scan becomes to release buffer pins while fetching index tuples in a page, so it should reduce thechance of index scans with long duration to block vacuum, because vacuums now can easily overtake the current positionof an index scan. I didn't actually measured how effective it is, though. - It looks to work correctly for scan in both direction with simultaneous page splitting and vacuum. - It makes no performance deterioration, on the contrary it accelerates index scans. It seems to be because of removal oflock and unlock surrounding _bt_steppage in bt_next. - It applies cleanly on the current head on the master branch. - It has enough intelligible comments and README descriptions. - An inrelevant typo fix is made in buffers/README by this patch. But it doesn't seem necessary to be separated. regards, At Fri, 13 Mar 2015 15:08:25 +0000 (UTC), Kevin Grittner <kgrittn@ymail.com> wrote in <848652045.4044965.1426259305233.JavaMail.yahoo@mail.yahoo.com> > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan <pg@heroku.com> wrote: > >> On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > >> I think you should call out those "confounding factors" in the > >> README. It's not hard to piece them together from > >> _bt_drop_lock_and_maybe_pin(), but I think you should anyway. > > OK: > > https://github.com/kgrittn/postgres/commit/f5f59ded30b114ac83b90a00ba1fa5ef490b994e It looks perfect for me. Do you have any further request on this, Peter? > >> Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing > >> nbtree LockBuffer() callers do. You're inconsistent about that > >> in V3. > > > > I agree with you. It looks the only point where it is used. > > OK: > > https://github.com/kgrittn/postgres/commit/76118b58be819ed5e68569c926d0222bc41640ea > > > Addition to that, the commnet just above the point methioned > > above quizes me. > > > >> /* XXX: Can walking left be lighter on the locking and pins? */ > >> if (BTScanPosIsPinned(so->currPos)) > >> LockBuffer(so->currPos.buf, BUFFER_LOCK_SHARE); > >> else > >> so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ); > > > > I'm happy if I could read the meaming of the comment more > > clearly. I understand that it says that you want to remove the > > locking (and pinning), but can't now because the simultaneous > > splitting of the left page would break something. I'd like to see > > it clearer even for me either I am correct or not.. > > Does this clear it up?: > > https://github.com/kgrittn/postgres/commit/22066fc161a092e800e4c1e853136c4513f8771b Thank you for the labor. It also looks perfect. > Since there are no changes that would affect the compiled code > here, I'm not posting a new patch yet. I'll do that once things > seem to have settled down a bit more. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README index 213542c..c904d2f 100644 --- a/src/backend/access/nbtree/README +++ b/src/backend/access/nbtree/README @@ -92,17 +92,18 @@ To minimize lock/unlock traffic, an index scan always searches a leaf pageto identify all the matchingitems at once, copying their heap tuple IDsinto backend-local storage. The heap tuple IDs are then processed whilenotholding any page lock within the index. We do continue to hold a pin -on the leaf page, to protect against concurrent deletions (see below). -In this state the scan is effectively stopped "between" pages, either -before or after the page it has pinned. This is safe in the presence of -concurrent insertions and even page splits, because items are never moved -across pre-existing page boundaries --- so the scan cannot miss any items -it should have seen, nor accidentally return the same item twice. The scan -must remember the page's right-link at the time it was scanned, since that -is the page to move right to; if we move right to the current right-link -then we'd re-scan any items moved by a page split. We don't similarly -remember the left-link, since it's best to use the most up-to-date -left-link when trying to move left (see detailed move-left algorithm below). +on the leaf page in some circumstances, to protect against concurrent +deletions (see below). In this state the scan is effectively stopped +"between" pages, either before or after the page it has pinned. This is +safe in the presence of concurrent insertions and even page splits, because +items are never moved across pre-existing page boundaries --- so the scan +cannot miss any items it should have seen, nor accidentally return the same +item twice. The scan must remember the page's right-link at the time it +was scanned, since that is the page to move right to; if we move right to +the current right-link then we'd re-scan any items moved by a page split. +We don't similarly remember the left-link, since it's best to use the most +up-to-date left-link when trying to move left (see detailed move-left +algorithm below).In most cases we release our lock and pin on a page before attemptingto acquire pin and lock on the pagewe are moving to. In a few places @@ -154,25 +155,37 @@ starts. This is not necessary for correctness in terms of the btree indexoperations themselves; asexplained above, index scans logically stop"between" pages and so can't lose their place. The reason we do it is toprovidean interlock between non-full VACUUM and indexscans. Since VACUUM -deletes index entries before deleting tuples, the super-exclusive lock -guarantees that VACUUM can't delete any heap tuple that an indexscanning -process might be about to visit. (This guarantee works only for simple -indexscans that visit the heap in sync with the index scan, not for bitmap -scans. We only need the guarantee when using non-MVCC snapshot rules; in -an MVCC snapshot, it wouldn't matter if the heap tuple were replaced with -an unrelated tuple at the same TID, because the new tuple wouldn't be -visible to our scan anyway.) - -Because a page can be split even while someone holds a pin on it, it is -possible that an indexscan will return items that are no longer stored on -the page it has a pin on, but rather somewhere to the right of that page. -To ensure that VACUUM can't prematurely remove such heap tuples, we require -btbulkdelete to obtain super-exclusive lock on every leaf page in the index, -even pages that don't contain any deletable tuples. This guarantees that -the btbulkdelete call cannot return while any indexscan is still holding -a copy of a deleted index tuple. Note that this requirement does not say -that btbulkdelete must visit the pages in any particular order. (See also -on-the-fly deletion, below.) +deletes index entries before reclaiming heap tuple line pointers, the +super-exclusive lock guarantees that VACUUM can't reclaim for re-use a +line pointer that an indexscanning process might be about to visit. This +guarantee works only for simple indexscans that visit the heap in sync +with the index scan, not for bitmap scans. We only need the guarantee +when using non-MVCC snapshot rules; when using an MVCC snapshot, it +doesn't matter if the heap tuple is replaced with an unrelated tuple at +the same TID, because the new tuple won't be visible to our scan anyway. +Therefore, a scan using an MVCC snapshot which has no other confounding +factors will not hold the pin after the page contents are read. The +current reasons for exceptions, where a pin is still needed, are if the +index is not WAL-logged or if the scan is an index-only scan. If later +work allows the pin to be dropped for all cases we will be able to +simplify the vacuum code, since the concept of a super-exclusive lock +for btree indexes will no longer be needed. + +Because a pin is not always held, and a page can be split even while +someone does hold a pin on it, it is possible that an indexscan will +return items that are no longer stored on the page it has a pin on, but +rather somewhere to the right of that page. To ensure that VACUUM can't +prematurely remove such heap tuples, we require btbulkdelete to obtain a +super-exclusive lock on every leaf page in the index, even pages that +don't contain any deletable tuples. Any scan which could yield incorrect +results if the tuple at a TID matching the the scan's range and filter +conditions were replaced by a different tuple while the scan is in +progress must hold the pin on each index page until all index entries read +from the page have been processed. This guarantees that the btbulkdelete +call cannot return while any indexscan is still holding a copy of a +deleted index tuple if the scan could be confused by that. Note that this +requirement does not say that btbulkdelete must visit the pages in any +particular order. (See also on-the-fly deletion, below.)There is no such interlocking for deletion of items in internalpages,since backends keep no lock nor pin on a page they have descended past. @@ -396,8 +409,12 @@ that this breaks the interlock between VACUUM and indexscans, but that isnot so: as long as an indexscanningprocess has a pin on the page wherethe index item used to be, VACUUM cannot complete its btbulkdelete scanandso cannot remove the heap tuple. This is another reason why -btbulkdelete has to get super-exclusive lock on every leaf page, not only -the ones where it actually sees items to delete. +btbulkdelete has to get a super-exclusive lock on every leaf page, not +only the ones where it actually sees items to delete. So that we can +handle the cases where we attempt LP_DEAD flagging for a page after we +have released its pin, we remember the LSN of the index page when we read +the index tuples from it; we do not attempt to flag index tuples as dead +if the we didn't hold the pin the entire time and the LSN has changed.WAL Considerations------------------ @@ -462,7 +479,7 @@ metapage update (of the "fast root" link) is performed and logged as partof the insertion into the parentlevel. When splitting the root page, themetapage update is handled as part of the "new root" action. -Each step in page deletion are logged as separate WAL entries: marking the +Each step in page deletion is logged as a separate WAL entry: marking theleaf as half-dead and removing the downlink isone record, and unlinking apage is a second record. If vacuum is interrupted for some reason, or thesystem crashes, thetree is consistent for searches and insertions. The diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 932c6f7..ef68a71 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -498,9 +498,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, * If there's not enough roomin the space, we try to make room by * removing any LP_DEAD tuples. * - * On entry, *buf and *offsetptr point to the first legal position + * On entry, *bufptr and *offsetptr point to the first legal position * where the new tuple could be inserted. The caller should hold an - * exclusive lock on *buf. *offsetptr can also be set to + * exclusive lock on *bufptr. *offsetptr can also be set to * InvalidOffsetNumber, in which case the functionwill search for the * right location within the page if needed. On exit, they point to the * choseninsert location. If _bt_findinsertloc decides to move right, diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 3af462b..9a6dcdd 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -404,7 +404,8 @@ btbeginscan(PG_FUNCTION_ARGS) /* allocate private workspace */ so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); - so->currPos.buf = so->markPos.buf = InvalidBuffer; + BTScanPosInvalidate(so->currPos); + BTScanPosInvalidate(so->markPos); if (scan->numberOfKeys > 0) so->keyData = (ScanKey) palloc(scan->numberOfKeys* sizeof(ScanKeyData)); else @@ -424,8 +425,6 @@ btbeginscan(PG_FUNCTION_ARGS) * scan->xs_itupdesc whether we'll need it or not, since that's so cheap. */ so->currTuples = so->markTuples = NULL; - so->currPos.nextTupleOffset = 0; - so->markPos.nextTupleOffset = 0; scan->xs_itupdesc = RelationGetDescr(rel); @@ -451,17 +450,14 @@ btrescan(PG_FUNCTION_ARGS) { /* Before leaving current page, deal with any killed items */ if (so->numKilled > 0) - _bt_killitems(scan, false); - ReleaseBuffer(so->currPos.buf); - so->currPos.buf = InvalidBuffer; + _bt_killitems(scan); + BTScanPosUnpinIfPinned(so->currPos); + BTScanPosInvalidate(so->currPos); } - if (BTScanPosIsValid(so->markPos)) - { - ReleaseBuffer(so->markPos.buf); - so->markPos.buf = InvalidBuffer; - } so->markItemIndex = -1; + BTScanPosUnpinIfPinned(so->markPos); + BTScanPosInvalidate(so->markPos); /* * Allocate tuple workspace arrays, if needed for an index-only scan and @@ -515,17 +511,14 @@ btendscan(PG_FUNCTION_ARGS) { /* Before leaving current page, deal with any killed items*/ if (so->numKilled > 0) - _bt_killitems(scan, false); - ReleaseBuffer(so->currPos.buf); - so->currPos.buf = InvalidBuffer; + _bt_killitems(scan); + BTScanPosUnpinIfPinned(so->currPos); } - if (BTScanPosIsValid(so->markPos)) - { - ReleaseBuffer(so->markPos.buf); - so->markPos.buf = InvalidBuffer; - } so->markItemIndex = -1; + BTScanPosUnpinIfPinned(so->markPos); + + /* No need to invalidate positions, the RAM is about to be freed. */ /* Release storage */ if (so->keyData !=NULL) @@ -552,12 +545,8 @@ btmarkpos(PG_FUNCTION_ARGS) IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0); BTScanOpaqueso = (BTScanOpaque) scan->opaque; - /* we aren't holding any read locks, but gotta drop the pin */ - if (BTScanPosIsValid(so->markPos)) - { - ReleaseBuffer(so->markPos.buf); - so->markPos.buf = InvalidBuffer; - } + /* There may be an old mark with a pin (but no lock). */ + BTScanPosUnpinIfPinned(so->markPos); /* * Just record the current itemIndex. If we later step to next page @@ -568,7 +557,10 @@ btmarkpos(PG_FUNCTION_ARGS) if (BTScanPosIsValid(so->currPos)) so->markItemIndex = so->currPos.itemIndex; else + { + BTScanPosInvalidate(so->markPos); so->markItemIndex = -1; + } /* Also record the current positions of any array keys */ if (so->numArrayKeys) @@ -593,28 +585,54 @@ btrestrpos(PG_FUNCTION_ARGS) if (so->markItemIndex >= 0) { /* - * The mark position is on the same page we are currently on. Just + * The scan has never moved to a new page since the last mark. Just * restore the itemIndex. + * + * NB: In this case we can't count on anything in so->markPos to be + * accurate. */ so->currPos.itemIndex = so->markItemIndex; } + else if (so->currPos.currPage == so->markPos.currPage) + { + /* + * so->markItemIndex < 0 but mark and current positions are on the + * same page. This would be an unusual case, where the scan moved to + * a new index page after the mark, restored, and later restored again + * without moving off the marked page. It is not clear that this code + * can currently be reached, but it seems better to make this function + * robust for this case than to Assert() or elog() that it can't + * happen. + * + * We neither want to set so->markItemIndex >= 0 (because that could + * cause a later move to a new page to redo the memcpy() executions) + * nor re-execute the memcpy() functions for a restore within the same + * page. The previous restore to this page already set everything + * except markPos as it should be. + */ + so->currPos.itemIndex = so->markPos.itemIndex; + } else { - /* we aren't holding any read locks, but gotta drop the pin */ + /* + * The scan moved to a new page after last mark or restore, and we are + * now restoring to the marked page. We aren't holding any read + * locks, but if we're still holding the pin for the current position, + * we must drop it. + */ if (BTScanPosIsValid(so->currPos)) { /* Before leaving current page, deal with anykilled items */ - if (so->numKilled > 0 && - so->currPos.buf != so->markPos.buf) - _bt_killitems(scan, false); - ReleaseBuffer(so->currPos.buf); - so->currPos.buf = InvalidBuffer; + if (so->numKilled > 0) + _bt_killitems(scan); + BTScanPosUnpinIfPinned(so->currPos); } if (BTScanPosIsValid(so->markPos)) { /* bump pin on mark buffer for assignment to current buffer */ - IncrBufferRefCount(so->markPos.buf); + if (BTScanPosIsPinned(so->markPos)) + IncrBufferRefCount(so->markPos.buf); memcpy(&so->currPos, &so->markPos, offsetof(BTScanPosData,items[1]) + so->markPos.lastItem * sizeof(BTScanPosItem)); @@ -622,6 +640,8 @@ btrestrpos(PG_FUNCTION_ARGS) memcpy(so->currTuples, so->markTuples, so->markPos.nextTupleOffset); } + else + BTScanPosInvalidate(so->currPos); } PG_RETURN_VOID(); diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index a681d6b..af4201a 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -22,6 +22,7 @@#include "storage/predicate.h"#include "utils/lsyscache.h"#include "utils/rel.h" +#include "utils/tqual.h"static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir, @@ -31,6 +32,36 @@ static void _bt_saveitem(BTScanOpaque so, int itemIndex,static bool _bt_steppage(IndexScanDesc scan, ScanDirectiondir);static Buffer _bt_walk_left(Relation rel, Buffer buf);static bool _bt_endpoint(IndexScanDesc scan, ScanDirectiondir); +static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp); + + +/* + * _bt_drop_lock_and_maybe_pin() + * + * Unlock the buffer; and if it is safe to release the pin, do that, too. It + * is safe if the scan is using an MVCC snapshot and the index is WAL-logged. + * This will prevent vacuum from stalling in a blocked state trying to read a + * page when a cursor is sitting on it -- at least in many important cases. + * + * Set the buffer to invalid if the pin is released, since the buffer may be + * re-used. If we need to go back to this block (for example, to apply + * LP_DEAD hints) we must get a fresh reference to the buffer. Hopefully it + * will remain in shared memory for as long as it takes to scan the index + * buffer page. + */ +static void +_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp) +{ + LockBuffer(sp->buf, BUFFER_LOCK_UNLOCK); + + if (IsMVCCSnapshot(scan->xs_snapshot) && + RelationNeedsWAL(scan->indexRelation) && + !scan->xs_want_itup) + { + ReleaseBuffer(sp->buf); + sp->buf = InvalidBuffer; + } +}/* @@ -506,6 +537,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) StrategyNumber strat_total; BTScanPosItem *currItem; + Assert(!BTScanPosIsValid(so->currPos)); + pgstat_count_index_scan(rel); /* @@ -942,9 +975,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) /* don't need to keep the stack around... */ _bt_freestack(stack); - /* remember which buffer we have pinned, if any */ - so->currPos.buf = buf; - if (!BufferIsValid(buf)) { /* @@ -1023,6 +1053,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) break; } + /* remember which buffer we have pinned, if any */ + Assert(!BTScanPosIsValid(so->currPos)); + so->currPos.buf = buf; + /* * Now load data from the first page of the scan. */ @@ -1032,12 +1066,15 @@ _bt_first(IndexScanDesc scan, ScanDirection dir) * There's no actually-matching data on thispage. Try to advance to * the next page. Return false if there's no matching data at all. */ + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); if (!_bt_steppage(scan, dir)) return false; } - - /* Drop the lock, but not pin, on the current page */ - LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + else + { + /* Drop the lock, and maybe the pin, on the current page */ + _bt_drop_lock_and_maybe_pin(scan, &so->currPos); + } /* OK, itemIndex says what to return */ currItem = &so->currPos.items[so->currPos.itemIndex]; @@ -1051,8 +1088,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)/* * _bt_next() -- Get the next item in a scan.* - * On entry, so->currPos describes the current page, which is pinned - * but not locked, and so->currPos.itemIndex identifies which item was + * On entry, so->currPos describes the current page, which may be pinned + * but is not locked, and so->currPos.itemIndex identifies which item was * previously returned. * * On successful exit, scan->xs_ctup.t_self is set to the TID of the @@ -1076,26 +1113,16 @@ _bt_next(IndexScanDesc scan, ScanDirection dir) { if (++so->currPos.itemIndex > so->currPos.lastItem) { - /* We must acquire lock before applying _bt_steppage */ - Assert(BufferIsValid(so->currPos.buf)); - LockBuffer(so->currPos.buf, BT_READ); if (!_bt_steppage(scan, dir)) return false; - /* Drop the lock, but not pin, on the new page */ - LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); } } else { if (--so->currPos.itemIndex< so->currPos.firstItem) { - /* We must acquire lock before applying _bt_steppage */ - Assert(BufferIsValid(so->currPos.buf)); - LockBuffer(so->currPos.buf, BT_READ); if (!_bt_steppage(scan, dir)) return false; - /* Drop the lock, but not pin, on the new page */ - LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); } } @@ -1135,7 +1162,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) IndexTuple itup; bool continuescan; - /* we must have the buffer pinned and locked */ + /* + * We must have the buffer pinned and locked, but the usual macro can't be + * used here; this function is what makes it good for currPos. + */ Assert(BufferIsValid(so->currPos.buf)); page = BufferGetPage(so->currPos.buf); @@ -1144,6 +1174,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) maxoff = PageGetMaxOffsetNumber(page); /* + * We note the buffer's block number so that we can release the pin later. + * This allows us to re-read the buffer if it is needed again for hinting. + */ + so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf); + + /* + * We save the LSN of the page as we read it, so that we know whether it + * safe to apply LP_DEAD hints to the page later. This allows us to drop + * the pin for MVCC scans, which allows vacuum to avoid blocking. + */ + so->currPos.lsn = PageGetLSN(page); + + /* * we must save the page's right-link while scanning it; this tells us * where to step right to after we'redone with these items. There is no * corresponding need for the left-link, since splits always go right. @@ -1153,6 +1196,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) /* initialize tuple workspaceto empty */ so->currPos.nextTupleOffset = 0; + /* + * Now that the current page has been made consistent, the macro should be + * good. + */ + Assert(BTScanPosIsPinned(so->currPos)); + if (ScanDirectionIsForward(dir)) { /* load items[] in ascending order */ @@ -1241,11 +1290,14 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,/* * _bt_steppage() -- Step to next page containingvalid data for scan * - * On entry, so->currPos.buf must be pinned and read-locked. We'll drop - * the lock and pin before moving to next page. + * On entry, if so->currPos.buf is valid the buffer is pinned but not locked; + * if pinned, we'll drop the pin before moving to next page. The buffer is + * not locked on entry. * - * On success exit, we hold pin and read-lock on the next interesting page, - * and so->currPos is updated to contain data from that page. + * On success exit, so->currPos is updated to contain data from the next + * interesting page. For success on a scan using a non-MVCC snapshot we hold + * a pin, but not a read lock, on that page. If we do not hold the pin, we + * set so->currPos.buf to InvalidBuffer. We return TRUE to indicate success. * * If there are no more matching recordsin the given direction, we drop all * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE. @@ -1258,12 +1310,11 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir) Page page; BTPageOpaque opaque; - /* we must have the buffer pinned and locked */ - Assert(BufferIsValid(so->currPos.buf)); + Assert(BTScanPosIsValid(so->currPos)); /* Before leaving current page, deal with any killed items */ if (so->numKilled> 0) - _bt_killitems(scan, true); + _bt_killitems(scan); /* * Before we modify currPos, make a copy of the page data if there was a @@ -1272,7 +1323,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir) if (so->markItemIndex >= 0) { /* bumppin on current buffer for assignment to mark buffer */ - IncrBufferRefCount(so->currPos.buf); + if (BTScanPosIsPinned(so->currPos)) + IncrBufferRefCount(so->currPos.buf); memcpy(&so->markPos, &so->currPos, offsetof(BTScanPosData,items[1]) + so->currPos.lastItem * sizeof(BTScanPosItem)); @@ -1294,14 +1346,17 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir) /* Remember we left a page with data */ so->currPos.moreLeft = true; + /* release the previous buffer, if pinned */ + BTScanPosUnpinIfPinned(so->currPos); + for (;;) { - /* release the previous buffer */ - _bt_relbuf(rel, so->currPos.buf); - so->currPos.buf = InvalidBuffer; /* if we're at end of scan, give up */ if (blkno == P_NONE|| !so->currPos.moreRight) + { + BTScanPosInvalidate(so->currPos); return false; + } /* check for interrupts while we're not holding any buffer lock */ CHECK_FOR_INTERRUPTS(); /* step right one page */ @@ -1317,8 +1372,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir) if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) break; } + /* nope, keep going */ blkno = opaque->btpo_next; + _bt_relbuf(rel, so->currPos.buf); } } else @@ -1332,14 +1389,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir) * to our left splits while we are inflight to it, plus the * possibility that the page we were on gets deleted after we leave * it. See nbtree/READMEfor details. + * + * It might be possible to rearrange this code to have less overhead + * in pinning and locking, but that would require capturing the left + * pointer when the page is initially read, and using it here, along + * with big changes to _bt_walk_left() and the code below. It is not + * clear whether this would be a win, since if the page immediately to + * the left splits after we read this page and before we step left, we + * would need to visit more pages than with the current code. */ + if (BTScanPosIsPinned(so->currPos)) + LockBuffer(so->currPos.buf, BT_READ); + else + so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ); + for (;;) { /* Done if we know there are no matching keys to the left */ if (!so->currPos.moreLeft) { _bt_relbuf(rel, so->currPos.buf); - so->currPos.buf = InvalidBuffer; + BTScanPosInvalidate(so->currPos); return false; } @@ -1348,7 +1418,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir) /* if we're physically at end of index,return failure */ if (so->currPos.buf == InvalidBuffer) + { + BTScanPosInvalidate(so->currPos); return false; + } /* * Okay, we managed to move left to a non-deleted page. Done if @@ -1368,6 +1441,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir) } } + /* Drop the lock, and maybe the pin, on the current page */ + _bt_drop_lock_and_maybe_pin(scan, &so->currPos); + return true;} @@ -1604,7 +1680,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) * exists. */ PredicateLockRelation(rel,scan->xs_snapshot); - so->currPos.buf = InvalidBuffer; + BTScanPosInvalidate(so->currPos); return false; } @@ -1658,12 +1734,15 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir) * There's no actually-matching data onthis page. Try to advance to * the next page. Return false if there's no matching data at all. */ + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); if (!_bt_steppage(scan, dir)) return false; } - - /* Drop the lock, but not pin, on the current page */ - LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); + else + { + /* Drop the lock, and maybe the pin, on the current page */ + _bt_drop_lock_and_maybe_pin(scan, &so->currPos); + } /* OK, itemIndex says what to return */ currItem = &so->currPos.items[so->currPos.itemIndex]; diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c index 379dac9..d1589f0 100644 --- a/src/backend/access/nbtree/nbtutils.c +++ b/src/backend/access/nbtree/nbtutils.c @@ -1721,27 +1721,35 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc, * _bt_killitems - set LP_DEADstate for items an indexscan caller has * told us were killed * - * scan->so contains information about the current page and killed tuples - * thereon (generally, this should only be called if so->numKilled > 0). + * scan->opaque, referenced locally through so, contains information about the + * current page and killed tuples thereon (generally, this should only be + * called if so->numKilled > 0). * - * The caller must have pin on so->currPos.buf, but may or may not have - * read-lock, as indicated by haveLock. Note that we assume read-lock - * is sufficient for setting LP_DEAD status (which is only a hint). + * The caller does not have a lock on the page and may or may not have the + * page pinned in a buffer. Note that read-lock is sufficient for setting + * LP_DEAD status (which is only a hint). * * We match items by heap TID before assuming they are the right ones to * delete. We cope with cases where items have moved right due to insertions. * If an item has moved off the current page dueto a split, we'll fail to * find it and do nothing (this is not an error case --- we assume the item - * will eventually get marked in a future indexscan). Note that because we - * hold pin on the target page continuously from initially reading the items - * until applying this function, VACUUM cannot have deleted any items from - * the page, and so there is no need to search left from the recorded offset. - * (This observation also guarantees that the item is still the right one - * to delete, which might otherwise be questionable since heap TIDs can get - * recycled.) + * will eventually get marked in a future indexscan). + * + * Note that if we hold a pin on the target page continuously from initially + * reading the items until applying this function, VACUUM cannot have deleted + * any items from the page, and so there is no need to search left from the + * recorded offset. (This observation also guarantees that the item is still + * the right one to delete, which might otherwise be questionable since heap + * TIDs can get recycled.) This holds true even if the page has been modified + * by inserts and page splits, so there is no need to consult the LSN. + * + * If the pin was released after reading the page, then we re-read it. If it + * has been modified since we read it (as determined by the LSN), we dare not + * flag any entries because it is possible that the old entry was vacuumed + * away and the TID was re-used by a completely different heap tuple. */void -_bt_killitems(IndexScanDesc scan, bool haveLock) +_bt_killitems(IndexScanDesc scan){ BTScanOpaque so = (BTScanOpaque) scan->opaque; Page page; @@ -1749,19 +1757,56 @@ _bt_killitems(IndexScanDesc scan, bool haveLock) OffsetNumber minoff; OffsetNumber maxoff; int i; + int numKilled = so->numKilled; bool killedsomething = false; - Assert(BufferIsValid(so->currPos.buf)); + Assert(BTScanPosIsValid(so->currPos)); + + /* + * Always reset the scan state, so we don't look for same items on other + * pages. + */ + so->numKilled = 0; - if (!haveLock) + if (BTScanPosIsPinned(so->currPos)) + { + /* + * We have held the pin on this page since we read the index tuples, + * so all we need to do is lock it. The pin will have prevented + * re-use of any TID on the page, so there is no need to check the + * LSN. + */ LockBuffer(so->currPos.buf, BT_READ); - page = BufferGetPage(so->currPos.buf); + page = BufferGetPage(so->currPos.buf); + } + else + { + Buffer buf; + + /* Attempt to re-read the buffer, getting pin and lock. */ + buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ); + + /* It might not exist anymore; in which case we can't hint it. */ + if (!BufferIsValid(buf)) + return; + + page = BufferGetPage(buf); + if (PageGetLSN(page) == so->currPos.lsn) + so->currPos.buf = buf; + else + { + /* Modified while not pinned means hinting is not safe. */ + _bt_relbuf(scan->indexRelation, buf); + return; + } + } + opaque = (BTPageOpaque) PageGetSpecialPointer(page); minoff = P_FIRSTDATAKEY(opaque); maxoff = PageGetMaxOffsetNumber(page); - for (i = 0; i < so->numKilled; i++) + for (i = 0; i < numKilled; i++) { int itemIndex = so->killedItems[i]; BTScanPosItem *kitem= &so->currPos.items[itemIndex]; @@ -1799,14 +1844,7 @@ _bt_killitems(IndexScanDesc scan, bool haveLock) MarkBufferDirtyHint(so->currPos.buf, true); } - if (!haveLock) - LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK); - - /* - * Always reset the scan state, so we don't look for same items on other - * pages. - */ - so->numKilled = 0; + LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);} diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c index e2ace5c..2f85867 100644 --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -395,7 +395,8 @@ btree_xlog_vacuum(XLogReaderState *record) * unpinned between the lastBlockVacuumed and the currentblock, if there * are any. This prevents replay of the VACUUM from reaching the stage of * removing heaptuples while there could still be indexscans "in flight" - * to those particular tuples (see nbtree/README). + * to those particular tuples for those scans which could be confused by + * finding new tuples at the old TID locations (see nbtree/README). * * It might be worth checking if thereare actually any backends running; * if not, we could just skip this. diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README index a4ebbcc..45c5c83 100644 --- a/src/backend/storage/buffer/README +++ b/src/backend/storage/buffer/README @@ -18,8 +18,8 @@ backend to pin a page more than once concurrently; the buffer managerhandles this efficiently. It is consideredOK to hold a pin for longintervals --- for example, sequential scans hold a pin on the current pageuntil done processingall the tuples on the page, which could be quite a -while if the scan is the outer scan of a join. Similarly, btree index -scans hold a pin on the current index page. This is OK because normal +while if the scan is the outer scan of a join. Similarly, a btree index +scan may hold a pin on the current index page. This is OK because normaloperations never wait for a page's pin count todrop to zero. (Anythingthat might need to do such a wait is instead handled by waiting to obtainthe relation-level lock,which is why you'd better hold one first.) Pins diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h index 990dab3..2ef349b 100644 --- a/src/include/access/nbtree.h +++ b/src/include/access/nbtree.h @@ -518,6 +518,8 @@ typedef struct BTScanPosData{ Buffer buf; /* if valid, the buffer is pinned */ + XLogRecPtr lsn; /* pos in the WAL stream when page was read */ + BlockNumber currPage; /* page we've referencd by items array */ BlockNumber nextPage; /* page's rightlink when we scanned it */ /* @@ -551,7 +553,37 @@ typedef struct BTScanPosDatatypedef BTScanPosData *BTScanPos; -#define BTScanPosIsValid(scanpos) BufferIsValid((scanpos).buf) +#define BTScanPosIsPinned(scanpos) \ +( \ + AssertMacro(BlockNumberIsValid((scanpos).currPage) || \ + !BufferIsValid((scanpos).buf)), \ + BufferIsValid((scanpos).buf) \ +) +#define BTScanPosUnpin(scanpos) \ + do { \ + ReleaseBuffer((scanpos).buf); \ + (scanpos).buf = InvalidBuffer; \ + } while (0) +#define BTScanPosUnpinIfPinned(scanpos) \ + do { \ + if (BTScanPosIsPinned(scanpos)) \ + BTScanPosUnpin(scanpos); \ + } while (0) + +#define BTScanPosIsValid(scanpos) \ +( \ + AssertMacro(BlockNumberIsValid((scanpos).currPage) || \ + !BufferIsValid((scanpos).buf)), \ + BlockNumberIsValid((scanpos).currPage) \ +) +#define BTScanPosInvalidate(scanpos) \ + do { \ + (scanpos).currPage = InvalidBlockNumber; \ + (scanpos).nextPage = InvalidBlockNumber; \ + (scanpos).buf = InvalidBuffer; \ + (scanpos).lsn = InvalidXLogRecPtr; \ + (scanpos).nextTupleOffset = 0; \ + } while (0);/* We need one of these for each equality-type SK_SEARCHARRAY scan key */typedef struct BTArrayKeyInfo @@ -697,7 +729,7 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);extern IndexTuple _bt_checkkeys(IndexScanDesc scan, Page page, OffsetNumber offnum, ScanDirection dir, bool *continuescan); -extern void _bt_killitems(IndexScanDesc scan, bool haveLock); +extern void _bt_killitems(IndexScanDesc scan);extern BTCycleId _bt_vacuum_cycleid(Relation rel);extern BTCycleId _bt_start_vacuum(Relationrel);extern void _bt_end_vacuum(Relation rel);
Simon Riggs <simon@2ndQuadrant.com> wrote: > On 13 March 2015 at 15:41, Kevin Grittner <kgrittn@ymail.com> wrote: > >> The feedback was generally fairly positive except for the fact that >> snapshot "age" (for purposes of being too old) was measured in >> transaction IDs assigned. There seemed to be a pretty universal >> feeling that this needed to be changed to a time-based setting. > > -1 for a time based setting. > > After years of consideration, bloat is now controllable by altering > the size of the undo tablespace. > > I think PostgreSQL needs something size-based also. It would need some > estimation to get it to work like that, true, but it is actually the > size of the bloat we care about, not the time. So we should be > thinking in terms of limits that we actually care about. Are you thinking, then, that WAL volume generated (as determined by LSN) would be the appropriate unit of measure for this? (We would still need to map that back to transaction IDs for vacuuming, of course.) If we did that we could allow the "size" units of measure, like '5GB' and similar. Or are you thinking of something else? Given that there seems to be disagreement on what is the more useful metric, do we want to consider allowing more than one? If so, would it be when *all* conditions are met or when *any* conditions are met? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 16, 2015 at 8:48 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > Given that there seems to be disagreement on what is the more > useful metric, do we want to consider allowing more than one? If > so, would it be when *all* conditions are met or when *any* > conditions are met? Oh, gosh. Maybe we could invent a whole policy language for this. Or, if we want to do something less painful, we could fire nail-guns into our eyeballs. I'm not sure what the value of basing this on WAL volume is, and I think we should talk about that a bit more first. The value of doing this by rate of XID consumption is easy to understand: it's simple to implement. The value of doing this by time is also simple to understand: if you set the setting to 20 minutes, then you can be sure that queries will not get cancelled unless they run for more than 20 minutes. This clearly makes the system easier to reason about. I don't see what the benefit of doing this by WAL volume would be. What Simon actually proposed was by *bloat*; that is, if the overall amount of bloat in the system exceeds some metric, start whacking old queries in order to control it. The appeal of that is obvious, but I think it would be very hard to make work, because canceling queries won't get rid of the bloat you've already introduced, and so you'd tend to cancel nothing until you hit some threshold, and then start canceling everything. If we want this feature, let's try to keep the spec simple enough that we actually get it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Thank you for rewriting. Thank *you* for pointing out where more work was needed in the comments and README! > I attached the your latest patch to this mail as > bt-nopin-v4.patch for now. Please check that there's no problem > in it. I checked out master, applied the patch and checked it against my latest code (merged with master), and it matched. So, looks good. > - By this patch, index scan becomes to release buffer pins while > fetching index tuples in a page, so it should reduce the chance > of index scans with long duration to block vacuum, because > vacuums now can easily overtake the current position of an > index scan. I didn't actually measured how effective it is, > though. That part got pretty thorough testing on end-user software. The whole reason for writing the patch was that after applying the snapshot-too-old PoC patch they still saw massive bloat because all autovacuum workers blocked behind cursors which were left idle. The initial version of this patch fixed that, preventing (in combination with the other patch) uncontrolled bloat in a 48 hour test. > - It makes no performance deterioration, on the contrary it > accelerates index scans. It seems to be because of removal of > lock and unlock surrounding _bt_steppage in bt_next. I fear that the performance improvement may only show up on forward scans -- because the whole L&Y algorithm depends on splits only moving right, walking right is almost trivial to perform correctly with minimal locking and pinning. Our left pointers help with scanning backward, but there are a lot of conditions that can complicate that, leaving us with the choice between keeping some of the locking or potentially scanning more pages than we now do on a backward scan. Since it wasn't clear how many cases would benefit from a change and how many would lose, I pretty much left the backward scan locking alone in this patch. Changing the code to work the other way would not be outrageously difficult; but benchmarking to determine whether the alternative was a net win would be pretty tricky and very time-consuming. If that is to be done, I strongly feel it should be a separate patch. Because this patch makes a forward scan faster without having much affect on a backward scan, the performance difference between them (which has always existed) will get a little wider. I wonder whether this difference should perhaps be reflected in plan costing. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16 March 2015 at 12:48, Kevin Grittner <kgrittn@ymail.com> wrote: > Simon Riggs <simon@2ndQuadrant.com> wrote: >> On 13 March 2015 at 15:41, Kevin Grittner <kgrittn@ymail.com> wrote: >> >>> The feedback was generally fairly positive except for the fact that >>> snapshot "age" (for purposes of being too old) was measured in >>> transaction IDs assigned. There seemed to be a pretty universal >>> feeling that this needed to be changed to a time-based setting. >> >> -1 for a time based setting. >> >> After years of consideration, bloat is now controllable by altering >> the size of the undo tablespace. >> >> I think PostgreSQL needs something size-based also. It would need some >> estimation to get it to work like that, true, but it is actually the >> size of the bloat we care about, not the time. So we should be >> thinking in terms of limits that we actually care about. > > Are you thinking, then, that WAL volume generated (as determined by > LSN) would be the appropriate unit of measure for this? (We would > still need to map that back to transaction IDs for vacuuming, of > course.) If we did that we could allow the "size" units of > measure, like '5GB' and similar. Or are you thinking of something > else? It's probably the closest and easiest measure, and the most meaningful. We can easily accumulate that in a data structure in clog, like async commit LSN. For next release though, since it will take a little bit of thought to interpret that. With commit timestamp enabled in 9.5, we can easily judge time limit, but it is less useful because its not a measure of bloat. As I've said, I'd be happy with just an xid limit for 9.5, if that was the only thing we had. But I think timestamp is just as easy. > Given that there seems to be disagreement on what is the more > useful metric, do we want to consider allowing more than one? If > so, would it be when *all* conditions are met or when *any* > conditions are met? Yours was the first reply to my idea, so I think its too early to describe that as disagreement. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On 16 March 2015 at 13:02, Robert Haas <robertmhaas@gmail.com> wrote: > I'm not sure what the value of basing this on WAL volume is, and I > think we should talk about that a bit more first. The value of doing > this by rate of XID consumption is easy to understand: it's simple to > implement. The value of doing this by time is also simple to > understand: if you set the setting to 20 minutes, then you can be sure > that queries will not get cancelled unless they run for more than 20 > minutes. This clearly makes the system easier to reason about. I > don't see what the benefit of doing this by WAL volume would be. I think we can assess that more clearly as the amount of now-dead space left by an action. So we can include Updates and Deletes, or in the case of aborted transactions, the total WAL written. I assumed that was what Kevin meant. > What Simon actually proposed was by *bloat*; that is, if the overall > amount of bloat in the system exceeds some metric, start whacking old > queries in order to control it. The appeal of that is obvious, Agreed > but I > think it would be very hard to make work, because canceling queries > won't get rid of the bloat you've already introduced, and so you'd > tend to cancel nothing until you hit some threshold, and then start > canceling everything. That would just be an unhelpful way of using that info. There are many possible designs that wouldn't need to work that way. We have many cases where we begin a cleanup action before we run out of space for other server resources. VACUUM is itself a good example, but there are many others. The main point is that if we blindly decide things based upon xid age or time then it will be wrong in many cases, since apps with uniform write rates are rare. We need to work out how to limit bloat itself. If we can't do that easily at a macro level, then we'll have to do that more precisely at the table level, for example having VACUUM decide where to put the limit if we find too much unremovable/recently dead data. > If we want this feature, let's try to keep the spec simple enough that > we actually get it. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On Mon, Mar 16, 2015 at 12:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> What Simon actually proposed was by *bloat*; that is, if the overall >> amount of bloat in the system exceeds some metric, start whacking old >> queries in order to control it. The appeal of that is obvious, > > Agreed > >> but I >> think it would be very hard to make work, because canceling queries >> won't get rid of the bloat you've already introduced, and so you'd >> tend to cancel nothing until you hit some threshold, and then start >> canceling everything. > > That would just be an unhelpful way of using that info. There are many > possible designs that wouldn't need to work that way. > > We have many cases where we begin a cleanup action before we run out > of space for other server resources. VACUUM is itself a good example, > but there are many others. > > The main point is that if we blindly decide things based upon xid age > or time then it will be wrong in many cases, since apps with uniform > write rates are rare. We need to work out how to limit bloat itself. > If we can't do that easily at a macro level, then we'll have to do > that more precisely at the table level, for example having VACUUM > decide where to put the limit if we find too much unremovable/recently > dead data. I am sure there are more sophisticated things to be done here, but I guess my feeling is that time is a good way to go here for a first cut - lots of people have suggested it, and there's clearly a use case for it. If the setting turns out to be popular, we can fine-tune it more once we've got some experience with it. But I'm nervous about trying to to design something more complicated than that right now, especially so late in the release cycle. We know that this setting, with time-based units, will meet the needs of the customer for whom it was developed, and that is a good-enough reason to think that time is a reasonable metric for this, even if we eventually have others. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 3/16/15 11:47 AM, Robert Haas wrote: > I am sure there are more sophisticated things to be done here, but I > guess my feeling is that time is a good way to go here for a first cut > - lots of people have suggested it, and there's clearly a use case for > it. If the setting turns out to be popular, we can fine-tune it more > once we've got some experience with it. But I'm nervous about trying > to to design something more complicated than that right now, > especially so late in the release cycle. We know that this setting, > with time-based units, will meet the needs of the customer for whom it > was developed, and that is a good-enough reason to think that time is > a reasonable metric for this, even if we eventually have others. +1. As you said, time is easy for people to understand, and I think it will handle a large chunk of the use cases. I used to have this quote (or something close to it) on my whiteboard... I think it's appropriate here ;) "The perfect is the enemy of the good. -Simon Riggs" -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 18 March 2015 at 03:01, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 3/16/15 11:47 AM, Robert Haas wrote: >> >> I am sure there are more sophisticated things to be done here, but I >> guess my feeling is that time is a good way to go here for a first cut >> - lots of people have suggested it, and there's clearly a use case for >> it. If the setting turns out to be popular, we can fine-tune it more >> once we've got some experience with it. But I'm nervous about trying >> to to design something more complicated than that right now, >> especially so late in the release cycle. We know that this setting, >> with time-based units, will meet the needs of the customer for whom it >> was developed, and that is a good-enough reason to think that time is >> a reasonable metric for this, even if we eventually have others. > > > +1. As you said, time is easy for people to understand, and I think it will > handle a large chunk of the use cases. If it wasn't clear, I had already said that my idea was for next release, before Robert said that. > I used to have this quote (or something close to it) on my whiteboard... I > think it's appropriate here ;) > > "The perfect is the enemy of the good. -Simon Riggs" Interesting to hear I was anybody's pinup ;-) Not sure that is a saying of mine though, sounds more like Robert to me. I do believe in something now, something better later. I'd call that incremental benefit, as opposed to incremental development where the benefit may not be realisable until the last point. Which is why I am backing this patch for 9.5, even though the UI is less useful than I think is eventually necessary. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On Sat, Mar 14, 2015 at 4:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
-1 for a time based setting.
After years of consideration, bloat is now controllable by altering
the size of the undo tablespace.
Hm. Well, fwiw the situation is rather more complicated than that. You're correct that you can create a fixed size undo tablespace in which case Oracle will tell you approximately how much retention period you can rely on given that size. But alternately you can set the undo tablespace to be autoextend and Oracle will tune it to ensure there's enough retention to cover the longest-running query in the system. You can also set a time-based parameter (undo_retention) to ensure there's always at least enough data retained to cover a fixed amount of time for flashback (Oracle's equivalent to our time travel feature in ancient versions of Postgres).
greg
Kevin Grittner <kgrittn@ymail.com> wrote: > Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> I attached the your latest patch to this mail as >> bt-nopin-v4.patch for now. Please check that there's no problem >> in it. > > I checked out master, applied the patch and checked it against my > latest code (merged with master), and it matched. So, looks > good. Pushed with the addition of one more paragraph of comments, regarding something to watch out for in any follow-on patch to eliminate the pin for a scan using a non-MVCC snapshot. Thanks again to Kyotaro-san and Heikki for the reviews! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company