Thread: In progress INSERT wrecks plans on table
Recently we encountered the following unhappy sequence of events: 1/ system running happily 2/ batch load into table begins 3/ very quickly (some) preexisting queries on said table go orders of magnitude slower 4/ database instance becomes unresponsive 5/ application outage After looking down a few false leads, We've isolated the cause to the following: The accumulating in-progress row changes are such that previously optimal plans are optimal no longer. Now this situation will fix itself when the next autoanalyze happens (and new plan will be chosen) - however that cannot occur until the batch load is completed and committed (approx 70 seconds). However during that time there is enough of a performance degradation for queries still using the old plan to cripple the server. Now that we know what is happening we can work around it. But I'm wondering - is there any way (or if not should there be one) to let postgres handle this automatically? I experimented with a quick hack to src/backend/commands/analyze.c (attached) that lets another session's ANALYZE see in progress rows - which works but a) may cause other problems and b) does not help autoaanalyze which has to wait for COMMIT + stats message. I've attached a (synthetic) test case that shows the issue, I'll reproduce the output below to hopefully make the point obvious: Table "public.plan" Column | Type | Modifiers --------+-----------------------------+----------- id | integer | not null typ | integer | not null dat | timestamp without time zone | val | text | not null Indexes: "plan_id" UNIQUE, btree (id) "plan_dat" btree (dat) "plan_typ" btree (typ) [Session 1] EXPLAIN ANALYZE SELECT * FROM plan WHERE typ = 3 AND dat IS NOT NULL; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Index Scan using plan_dat on plan (cost=0.00..265.47 rows=55 width=117) (actual time=0.130..4.409 rows=75 loops=1) Index Cond: (dat IS NOT NULL) Filter: (typ = 3) Rows Removed by Filter: 5960 Total runtime: 4.440 ms (5 rows) [Session 2] BEGIN; INSERT INTO plan SELECT id + 2000001,typ,current_date + id * '1 seconds'::interval ,val FROM plan ; [Session 1] EXPLAIN ANALYZE SELECT * FROM plan WHERE typ = 3 AND dat IS NOT NULL; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Index Scan using plan_dat on plan (cost=0.00..551.35 rows=91 width=117) (actual time=0.131..202.699 rows=75 loops=1) Index Cond: (dat IS NOT NULL) Filter: (typ = 3) Rows Removed by Filter: 5960 Total runtime: 202.729 ms (5 rows) [Session 2] COMMIT; [Session 1...wait for autoanalyze to finish then] EXPLAIN ANALYZE SELECT * FROM plan WHERE typ = 3 AND dat IS NOT NULL; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on plan (cost=407.87..44991.95 rows=10116 width=117) (actual time=2.692..6.582 rows=75 loops=1) Recheck Cond: (typ = 3) Filter: (dat IS NOT NULL) Rows Removed by Filter: 19925 -> Bitmap Index Scan on plan_typ (cost=0.00..405.34 rows=20346 width=0) (actual time=2.573..2.573 rows=20000 loops=1) Index Cond: (typ = 3) Total runtime: 6.615 ms Regards Mark
Attachment
On 26/04/13 14:56, Gavin Flower wrote: > On 26/04/13 14:33, Mark Kirkwood wrote: >> Recently we encountered the following unhappy sequence of events: >> >> 1/ system running happily >> 2/ batch load into table begins >> 3/ very quickly (some) preexisting queries on said table go orders of >> magnitude slower >> 4/ database instance becomes unresponsive >> 5/ application outage >> >> After looking down a few false leads, We've isolated the cause to the >> following: >> >> The accumulating in-progress row changes are such that previously >> optimal plans are optimal no longer. Now this situation will fix >> itself when the next autoanalyze happens (and new plan will be >> chosen) - however that cannot occur until the batch load is completed >> and committed (approx 70 seconds). However during that time there is >> enough of a performance degradation for queries still using the old >> plan to cripple the server. >> >> Now that we know what is happening we can work around it. But I'm >> wondering - is there any way (or if not should there be one) to let >> postgres handle this automatically? I experimented with a quick hack >> to src/backend/commands/analyze.c (attached) that lets another >> session's ANALYZE see in progress rows - which works but a) may cause >> other problems and b) does not help autoaanalyze which has to wait >> for COMMIT + stats message. >> >> I've attached a (synthetic) test case that shows the issue, I'll >> reproduce the output below to hopefully make the point obvious: >> >> >> Table "public.plan" >> Column | Type | Modifiers >> --------+-----------------------------+----------- >> id | integer | not null >> typ | integer | not null >> dat | timestamp without time zone | >> val | text | not null >> Indexes: >> "plan_id" UNIQUE, btree (id) >> "plan_dat" btree (dat) >> "plan_typ" btree (typ) >> >> >> [Session 1] >> EXPLAIN ANALYZE >> SELECT * FROM plan >> WHERE typ = 3 AND dat IS NOT NULL; >> QUERY PLAN >> --------------------------------------------------------------------------------------------------------------------- >> >> Index Scan using plan_dat on plan (cost=0.00..265.47 rows=55 >> width=117) (actual time=0.130..4.409 rows=75 loops=1) >> Index Cond: (dat IS NOT NULL) >> Filter: (typ = 3) >> Rows Removed by Filter: 5960 >> Total runtime: 4.440 ms >> (5 rows) >> >> [Session 2] >> >> BEGIN; >> INSERT INTO plan >> SELECT id + 2000001,typ,current_date + id * '1 seconds'::interval ,val >> FROM plan >> ; >> >> [Session 1] >> EXPLAIN ANALYZE >> SELECT * FROM plan >> WHERE typ = 3 AND dat IS NOT NULL; >> >> QUERY PLAN >> ----------------------------------------------------------------------------------------------------------------------- >> >> Index Scan using plan_dat on plan (cost=0.00..551.35 rows=91 >> width=117) (actual time=0.131..202.699 rows=75 loops=1) >> Index Cond: (dat IS NOT NULL) >> Filter: (typ = 3) >> Rows Removed by Filter: 5960 >> Total runtime: 202.729 ms >> (5 rows) >> [Session 2] >> COMMIT; >> >> [Session 1...wait for autoanalyze to finish then] >> >> EXPLAIN ANALYZE >> SELECT * FROM plan >> WHERE typ = 3 AND dat IS NOT NULL; >> QUERY PLAN >> --------------------------------------------------------------------------------------------------------------------------- >> >> Bitmap Heap Scan on plan (cost=407.87..44991.95 rows=10116 >> width=117) (actual time=2.692..6.582 rows=75 loops=1) >> Recheck Cond: (typ = 3) >> Filter: (dat IS NOT NULL) >> Rows Removed by Filter: 19925 >> -> Bitmap Index Scan on plan_typ (cost=0.00..405.34 rows=20346 >> width=0) (actual time=2.573..2.573 rows=20000 loops=1) >> Index Cond: (typ = 3) >> Total runtime: 6.615 ms >> >> >> Regards >> >> Mark >> >> > Hmm... > > You need to specify: > > 1. version of Postgres > 2. Operating system > 3. changes to postgresql.conf > 4. CPU/RAM etc > 5. anything else that might be relevant > > While in general you are quite correct - in the above case (particularly as I've supplied a test case) it should be pretty obvious that any moderately modern version of postgres on any supported platform will exhibit this. I produced the above test case on Postgres 9.2.4 Ubuntu 13.04, with no changes to the default postgresql.conf Now our actual production server is a 32 CPU box with 512GB RAM, and 16 SAS SSD running Postgres 9.2.4 on Ubuntu 12.04. And yes there are quite a few changes from the defaults there - and I wasted quite a lot of time chasing issues with high CPU and RAM, and changing various configs to see if they helped - before identifying that the issue was in progress row changes and planner statistics. Also in the "real" case with much bigger datasets the difference between the plan being optimal and it *not* being optimal is a factor of 2000x elapsed time instead of a mere 50x ! regards Mark
On 26/04/13 15:34, Gavin Flower wrote: > On 26/04/13 15:19, Mark Kirkwood wrote: >> While in general you are quite correct - in the above case >> (particularly as I've supplied a test case) it should be pretty >> obvious that any moderately modern version of postgres on any >> supported platform will exhibit this. > > While I admit that I did not look closely at your test case - I am aware > that several times changes to Postgres from one minor version to > another, can have drastic unintended side effects (which might, or might > not, be relevant to your situation). Besides, it helps sets the scene, > and is one less thing that needs to be deduced. > Indeed - however, my perhaps slightly grumpy reply to your email was based on an impression of over keen-ness to dismiss my message without reading it (!) and a - dare I say it - one size fits all presentation of "here are the hoops to jump through". Now I spent a reasonable amount of time preparing the message and its attendant test case - and a comment such as your based on *not reading it* ...errrm... well lets say I think we can/should do better. I am concerned that the deafening lack of any replies to my original message is a result of folk glancing at your original quick reply and thinking... incomplete problem spec...ignore... when that is not that case - yes I should have muttered "9.2" in the original email, but we have covered that now. Regards Mark
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes: > I am concerned that the deafening lack of any replies to my original > message is a result of folk glancing at your original quick reply and > thinking... incomplete problem spec...ignore... when that is not that > case - yes I should have muttered "9.2" in the original email, but we > have covered that now. No, I think it's more that we're trying to get to beta, and so anything that looks like new development is getting shuffled to folks' "to look at later" queues. The proposed patch is IMO a complete nonstarter anyway; but I'm not sure what a less bogus solution would look like. regards, tom lane
On 02/05/13 02:06, Tom Lane wrote: > Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes: >> I am concerned that the deafening lack of any replies to my original >> message is a result of folk glancing at your original quick reply and >> thinking... incomplete problem spec...ignore... when that is not that >> case - yes I should have muttered "9.2" in the original email, but we >> have covered that now. > No, I think it's more that we're trying to get to beta, and so anything > that looks like new development is getting shuffled to folks' "to > look at later" queues. The proposed patch is IMO a complete nonstarter > anyway; but I'm not sure what a less bogus solution would look like. > Yeah, I did think that beta might be consuming everyone's attention (of course immediately *after* sending the email)! And yes, the patch was merely to illustrate the problem rather than any serious attempt at a solution. Regards Mark
On 2 May 2013 01:49, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 02/05/13 02:06, Tom Lane wrote: >> >> Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes: >>> >>> I am concerned that the deafening lack of any replies to my original >>> message is a result of folk glancing at your original quick reply and >>> thinking... incomplete problem spec...ignore... when that is not that >>> case - yes I should have muttered "9.2" in the original email, but we >>> have covered that now. >> >> No, I think it's more that we're trying to get to beta, and so anything >> that looks like new development is getting shuffled to folks' "to >> look at later" queues. The proposed patch is IMO a complete nonstarter >> anyway; but I'm not sure what a less bogus solution would look like. >> > > Yeah, I did think that beta might be consuming everyone's attention (of > course immediately *after* sending the email)! > > And yes, the patch was merely to illustrate the problem rather than any > serious attempt at a solution. I think we need a problem statement before we attempt a solution, which is what Tom is alluding to. ISTM that you've got a case where the plan is very sensitive to a table load. Which is a pretty common situation and one that can be solved in various ways. I don't see much that Postgres can do because it can't know ahead of time you're about to load rows. We could imagine an optimizer that set thresholds on plans that caused the whole plan to be recalculated half way thru a run, but that would be a lot of work to design and implement and even harder to test. Having static plans at least allows us to discuss what it does after the fact with some ease. The plan is set using stats that are set when there are very few non-NULL rows, and those increase massively on load. The way to cope is to run the ANALYZE immediately after the load and then don't allow auto-ANALYZE to reset them later. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> On 2 May 2013 01:49, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > > I think we need a problem statement before we attempt a solution, > which is what Tom is alluding to. > Actually no - I think Tom (quite correctly) was saying that the patch was not a viable solution. With which I agree. I believe the title of this thread is the problem statement. > ISTM that you've got a case where the plan is very sensitive to a > table load. Which is a pretty common situation and one that can be > solved in various ways. I don't see much that Postgres can do because > it can't know ahead of time you're about to load rows. We could > imagine an optimizer that set thresholds on plans that caused the > whole plan to be recalculated half way thru a run, but that would be a > lot of work to design and implement and even harder to test. Having > static plans at least allows us to discuss what it does after the fact > with some ease. > > The plan is set using stats that are set when there are very few > non-NULL rows, and those increase massively on load. The way to cope > is to run the ANALYZE immediately after the load and then don't allow > auto-ANALYZE to reset them later. No. We do run analyze immediately after the load. The surprise was that this was not sufficient - the (small) amount of time where non optimal plans were being used due to the in progress row activity was enough to cripple the system - that is the problem. The analysis of why not led to the test case included in the original email. And sure it is deliberately crafted to display the issue, and is therefore open to criticism for being artificial. However it was purely meant to make it easy to see what I was talking about. Currently we are working around this by coercing one of the predicates in the query to discourage the attractive looking but dangerous index. I think the idea of telling postgres that we are doing a load is probably the wrong way to go about this. We have a framework that tries to automatically figure out the best plans...I think some more thought about how to make that understand some of the more subtle triggers for a time-to-do-new-plans moment is the way to go. I understand this is probably hard - and may imply some radical surgery to how the stats collector and planner interact. Regards Mark
mark.kirkwood@catalyst.net.nz wrote on 03.05.2013 00:19: > I think the idea of telling postgres that we are doing a load is probably > the wrong way to go about this. We have a framework that tries to > automatically figure out the best plans...I think some more thought about > how to make that understand some of the more subtle triggers for a > time-to-do-new-plans moment is the way to go. I understand this is > probably hard - and may imply some radical surgery to how the stats > collector and planner interact. I wonder if "freezing" (analyze, then disable autovacuum) the statistics for the large number of rows would work.
> mark.kirkwood@catalyst.net.nz wrote on 03.05.2013 00:19: >> I think the idea of telling postgres that we are doing a load is >> probably >> the wrong way to go about this. We have a framework that tries to >> automatically figure out the best plans...I think some more thought >> about >> how to make that understand some of the more subtle triggers for a >> time-to-do-new-plans moment is the way to go. I understand this is >> probably hard - and may imply some radical surgery to how the stats >> collector and planner interact. > > I wonder if "freezing" (analyze, then disable autovacuum) the statistics > for the large number of rows would work. > > > I'm thinking that the issue is actually the opposite - it is that a new plan is needed because the new (uncomitted) rows are changing the data distribution. So we want more plan instability rather than plan stability :-) Cheers Mark
On 2 May 2013 23:19, <mark.kirkwood@catalyst.net.nz> wrote: >> On 2 May 2013 01:49, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: >> >> I think we need a problem statement before we attempt a solution, >> which is what Tom is alluding to. >> > > Actually no - I think Tom (quite correctly) was saying that the patch was > not a viable solution. With which I agree. > > I believe the title of this thread is the problem statement. > >> ISTM that you've got a case where the plan is very sensitive to a >> table load. Which is a pretty common situation and one that can be >> solved in various ways. I don't see much that Postgres can do because >> it can't know ahead of time you're about to load rows. We could >> imagine an optimizer that set thresholds on plans that caused the >> whole plan to be recalculated half way thru a run, but that would be a >> lot of work to design and implement and even harder to test. Having >> static plans at least allows us to discuss what it does after the fact >> with some ease. >> >> The plan is set using stats that are set when there are very few >> non-NULL rows, and those increase massively on load. The way to cope >> is to run the ANALYZE immediately after the load and then don't allow >> auto-ANALYZE to reset them later. > > No. We do run analyze immediately after the load. The surprise was that > this was not sufficient - the (small) amount of time where non optimal > plans were being used due to the in progress row activity was enough to > cripple the system - that is the problem. The analysis of why not led to > the test case included in the original email. And sure it is deliberately > crafted to display the issue, and is therefore open to criticism for being > artificial. However it was purely meant to make it easy to see what I was > talking about. I had another look at this and see I that I read the second explain incorrectly. The amount of data examined and returned is identical in both plans. The only difference is the number of in-progress rows seen by the second query. Looking at the numbers some more, it looks like 6000 in-progress rows are examined in addition to the data. It might be worth an EXPLAIN patch to put instrumentation in to show that, but its not that interesting. It would be useful to force the indexscan into a bitmapscan to check that the cost isn't attributable to the plan but to other overheads. What appears to be happening is we're spending a lot of time in TransactionIdIsInProgress() so we can set hints and then when we find it is still in progress we then spend more time in XidIsInSnapshot() while we check that it is still invisible to us. Even if the transaction we see repeatedly ends, we will still pay the cost in XidIsInSnapshot repeatedly as we execute. Given that code path, I would expect it to suck worse on a live system with many sessions, and even worse with many subtransactions. (1) A proposed fix is attached, but its only a partial one and barely tested. Deeper fixes might be (2) to sort the xid array if we call XidIsInSnapshot too many times in a transaction. I don't think that is worth it, because a long running snapshot may be examined many times, but is unlikely to see multiple in-progress xids repeatedly. Whereas your case seems reasonably common. (3) to make the check on TransactionIdIsInProgress() into a heuristic, since we don't *need* to check that, so if we keep checking the same xid repeatedly we can reduce the number of checks or avoid xids that seem to be long running. That's slightly more coding than my quick hack here but seems worth it. I think we need both (1) and (3) but the attached patch does just (1). This is a similar optimisation to the one I introduced for TransactionIdIsKnownCompleted(), except this applies to repeated checking of as yet-incomplete xids, and to bulk concurrent transactions. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 3 May 2013 13:41, Simon Riggs <simon@2ndquadrant.com> wrote: > (3) to make the check on TransactionIdIsInProgress() into a heuristic, > since we don't *need* to check that, so if we keep checking the same > xid repeatedly we can reduce the number of checks or avoid xids that > seem to be long running. That's slightly more coding than my quick > hack here but seems worth it. > > I think we need both (1) and (3) but the attached patch does just (1). > > This is a similar optimisation to the one I introduced for > TransactionIdIsKnownCompleted(), except this applies to repeated > checking of as yet-incomplete xids, and to bulk concurrent > transactions. ISTM we can improve performance of TransactionIdIsInProgress() by caching the procno of our last xid. Mark, could you retest with both these patches? Thanks. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 05/05/13 00:49, Simon Riggs wrote: > On 3 May 2013 13:41, Simon Riggs <simon@2ndquadrant.com> wrote: > >> (3) to make the check on TransactionIdIsInProgress() into a heuristic, >> since we don't *need* to check that, so if we keep checking the same >> xid repeatedly we can reduce the number of checks or avoid xids that >> seem to be long running. That's slightly more coding than my quick >> hack here but seems worth it. >> >> I think we need both (1) and (3) but the attached patch does just (1). >> >> This is a similar optimisation to the one I introduced for >> TransactionIdIsKnownCompleted(), except this applies to repeated >> checking of as yet-incomplete xids, and to bulk concurrent >> transactions. > > ISTM we can improve performance of TransactionIdIsInProgress() by > caching the procno of our last xid. > > Mark, could you retest with both these patches? Thanks. > Thanks Simon, will do and report back.
On 6 May 2013 02:51, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 05/05/13 00:49, Simon Riggs wrote: >> >> On 3 May 2013 13:41, Simon Riggs <simon@2ndquadrant.com> wrote: >> >>> (3) to make the check on TransactionIdIsInProgress() into a heuristic, >>> since we don't *need* to check that, so if we keep checking the same >>> xid repeatedly we can reduce the number of checks or avoid xids that >>> seem to be long running. That's slightly more coding than my quick >>> hack here but seems worth it. >>> >>> I think we need both (1) and (3) but the attached patch does just (1). >>> >>> This is a similar optimisation to the one I introduced for >>> TransactionIdIsKnownCompleted(), except this applies to repeated >>> checking of as yet-incomplete xids, and to bulk concurrent >>> transactions. >> >> >> ISTM we can improve performance of TransactionIdIsInProgress() by >> caching the procno of our last xid. >> >> Mark, could you retest with both these patches? Thanks. >> > > Thanks Simon, will do and report back. OK, here's a easily reproducible test... Prep: DROP TABLE IF EXISTS plan; CREATE TABLE plan ( id INTEGER NOT NULL, typ INTEGER NOT NULL, dat TIMESTAMP, val TEXT NOT NULL ); insert into plan select generate_series(1,100000), 0, current_timestamp, 'some texts'; CREATE UNIQUE INDEX plan_id ON plan(id); CREATE INDEX plan_dat ON plan(dat); testcase.pgb select count(*) from plan where dat is null and typ = 3; Session 1: pgbench -n -f testcase.pgb -t 100 Session 2: BEGIN; insert into plan select 1000000 + generate_series(1, 100000), 3, NULL, 'b'; Transaction rate in Session 1: (in tps) (a) before we run Session 2: Current: 5600tps Patched: 5600tps (b) after Session 2 has run, yet before transaction end Current: 56tps Patched: 65tps (c ) after Session 2 has aborted Current/Patched: 836, 1028, 5400tps VACUUM improves timing again New version of patch attached which fixes a few bugs. Patch works and improves things, but we're still swamped by the block accesses via the index. Which brings me back to Mark's original point, which is that we are x100 times slower in this case and it *is* because the choice of IndexScan is a bad one for this situation. After some thought on this, I do think we need to do something about it directly, rather than by tuning infrastructire (as I just attempted). The root cause here is that IndexScan plans are sensitive to mistakes in data distribution, much more so than other plan types. The two options, broadly, are to either 1. avoid IndexScans in the planner unless they have a *significantly* better cost. At the moment we use IndexScans if cost is lowest, even if that is only by a whisker. 2. make IndexScans adaptive so that they switch to other plan types mid-way through execution. (2) seems fairly hard generically, since we'd have to keep track of the tids returned from the IndexScan to allow us to switch to a different plan and avoid re-issuing rows that we've already returned. But maybe if we adapted the IndexScan plan type so that it adopted a more page oriented approach internally, it could act like a bitmapscan. Anyway, that would need some proof that it would work and sounds like a fair task. (1) sounds more easily possible and plausible. At the moment we have enable_indexscan = off. If we had something like plan_cost_weight_indexscan = N, we could selectively increase the cost of index scans so that they would be less likely to be selected. i.e. plan_cost_weight_indexscan = 2 would mean an indexscan would need to be half the cost of any other plan before it was selected. (parameter name selected so it could apply to all parameter types). The reason to apply this weighting would be to calculate "risk adjusted cost" not just estimated cost. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 03/05/13 00:27, Simon Riggs wrote:
Would be practicable to have a facility for telling Postgres in advance what you intend to do, so it can create plans accordingly?On 2 May 2013 01:49, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote:On 02/05/13 02:06, Tom Lane wrote:Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes:I am concerned that the deafening lack of any replies to my original message is a result of folk glancing at your original quick reply and thinking... incomplete problem spec...ignore... when that is not that case - yes I should have muttered "9.2" in the original email, but we have covered that now.No, I think it's more that we're trying to get to beta, and so anything that looks like new development is getting shuffled to folks' "to look at later" queues. The proposed patch is IMO a complete nonstarter anyway; but I'm not sure what a less bogus solution would look like.Yeah, I did think that beta might be consuming everyone's attention (of course immediately *after* sending the email)! And yes, the patch was merely to illustrate the problem rather than any serious attempt at a solution.I think we need a problem statement before we attempt a solution, which is what Tom is alluding to. ISTM that you've got a case where the plan is very sensitive to a table load. Which is a pretty common situation and one that can be solved in various ways. I don't see much that Postgres can do because it can't know ahead of time you're about to load rows. We could imagine an optimizer that set thresholds on plans that caused the whole plan to be recalculated half way thru a run, but that would be a lot of work to design and implement and even harder to test. Having static plans at least allows us to discuss what it does after the fact with some ease. The plan is set using stats that are set when there are very few non-NULL rows, and those increase massively on load. The way to cope is to run the ANALYZE immediately after the load and then don't allow auto-ANALYZE to reset them later. --Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
I won't try and invent syntax, but it would be good to tell the system that you intend to:
insert a million rows sequentially
or
insert ten million rows randomly
or
update two million rows in the primary key range 'AA00000' to 'PP88877'
etc.
Though, sometime it may be useful to give a little more detail,especially if you have a good estimate of the distribution of primary keys, e.g.:
AA000000
20%
AA000456
0%
KE700999
30%
NN400005
35%
PA000001
15%
PP808877
I figure that if the planner had more information about what one intends to do, then it could combine that with the statistics it knows, to come up with a more realistic plan.
Cheers,
Gavin
> > (2) seems fairly hard generically, since we'd have to keep track of > the tids returned from the IndexScan to allow us to switch to a > different plan and avoid re-issuing rows that we've already returned. > But maybe if we adapted the IndexScan plan type so that it adopted a > more page oriented approach internally, it could act like a > bitmapscan. Anyway, that would need some proof that it would work and > sounds like a fair task. > > (1) sounds more easily possible and plausible. At the moment we have > enable_indexscan = off. If we had something like > plan_cost_weight_indexscan = N, we could selectively increase the cost > of index scans so that they would be less likely to be selected. i.e. > plan_cost_weight_indexscan = 2 would mean an indexscan would need to > be half the cost of any other plan before it was selected. (parameter > name selected so it could apply to all parameter types). The reason to > apply this weighting would be to calculate "risk adjusted cost" not > just estimated cost. > > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services Another option would be for the bulk insert/update/delete to track the distribution stats as the operation progresses and if it detects that it is changing the distribution of data beyond a certain threshold it would update the pg stats accordingly. -- Matt Clarkson Catalyst.Net Limited
> Simon Riggs wrote: > > Patch works and improves things, but we're still swamped by the block > accesses via the index. Which *might* be enough to stop it making the server go unresponsive, we'll look at the effect of this in the next few days, nice work! > > Which brings me back to Mark's original point, which is that we are > x100 times slower in this case and it *is* because the choice of > IndexScan is a bad one for this situation. > > After some thought on this, I do think we need to do something about > it directly, rather than by tuning infrastructire (as I just > attempted). The root cause here is that IndexScan plans are sensitive > to mistakes in data distribution, much more so than other plan types. > > The two options, broadly, are to either > > 1. avoid IndexScans in the planner unless they have a *significantly* > better cost. At the moment we use IndexScans if cost is lowest, even > if that is only by a whisker. > > 2. make IndexScans adaptive so that they switch to other plan types > mid-way through execution. > > (2) seems fairly hard generically, since we'd have to keep track of > the tids returned from the IndexScan to allow us to switch to a > different plan and avoid re-issuing rows that we've already returned. > But maybe if we adapted the IndexScan plan type so that it adopted a > more page oriented approach internally, it could act like a > bitmapscan. Anyway, that would need some proof that it would work and > sounds like a fair task. > > (1) sounds more easily possible and plausible. At the moment we have > enable_indexscan = off. If we had something like > plan_cost_weight_indexscan = N, we could selectively increase the cost > of index scans so that they would be less likely to be selected. i.e. > plan_cost_weight_indexscan = 2 would mean an indexscan would need to > be half the cost of any other plan before it was selected. (parameter > name selected so it could apply to all parameter types). The reason to > apply this weighting would be to calculate "risk adjusted cost" not > just estimated cost. > I'm thinking that a variant of (2) might be simpler to inplement: (I think Matt C essentially beat me to this suggestion - he originally discovered this issue). It is probably good enough for only *new* plans to react to the increased/increasing number of in progress rows. So this would require backends doing significant numbers of row changes to either directly update pg_statistic or report their in progress numbers to the stats collector. The key change here is the partial execution numbers would need to be sent. Clearly one would need to avoid doing this too often (!) - possibly only when number of changed rows > autovacuum_analyze_scale_factor proportion of the relation concerned or similar. regards Mark
On 7 May 2013 01:23, <mark.kirkwood@catalyst.net.nz> wrote: > I'm thinking that a variant of (2) might be simpler to inplement: > > (I think Matt C essentially beat me to this suggestion - he originally > discovered this issue). It is probably good enough for only *new* plans to > react to the increased/increasing number of in progress rows. So this > would require backends doing significant numbers of row changes to either > directly update pg_statistic or report their in progress numbers to the > stats collector. The key change here is the partial execution numbers > would need to be sent. Clearly one would need to avoid doing this too > often (!) - possibly only when number of changed rows > > autovacuum_analyze_scale_factor proportion of the relation concerned or > similar. Are you loading using COPY? Why not break down the load into chunks? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 07/05/13 18:10, Simon Riggs wrote: > On 7 May 2013 01:23, <mark.kirkwood@catalyst.net.nz> wrote: > >> I'm thinking that a variant of (2) might be simpler to inplement: >> >> (I think Matt C essentially beat me to this suggestion - he originally >> discovered this issue). It is probably good enough for only *new* plans to >> react to the increased/increasing number of in progress rows. So this >> would require backends doing significant numbers of row changes to either >> directly update pg_statistic or report their in progress numbers to the >> stats collector. The key change here is the partial execution numbers >> would need to be sent. Clearly one would need to avoid doing this too >> often (!) - possibly only when number of changed rows > >> autovacuum_analyze_scale_factor proportion of the relation concerned or >> similar. > > Are you loading using COPY? Why not break down the load into chunks? > INSERT - but we could maybe workaround by chunking the INSERT. However that *really* breaks the idea that in SQL you just say what you want, not how the database engine should do it! And more practically means that the most obvious and clear way to add your new data has nasty side effects, and you have to tip toe around muttering secret incantations to make things work well :-) I'm still thinking that making postgres smarter about having current stats for getting the actual optimal plan is the best solution. Cheers Mark
On Tue, 2013-05-07 at 18:32 +1200, Mark Kirkwood wrote: > On 07/05/13 18:10, Simon Riggs wrote: > > On 7 May 2013 01:23, <mark.kirkwood@catalyst.net.nz> wrote: > > > >> I'm thinking that a variant of (2) might be simpler to inplement: > >> > >> (I think Matt C essentially beat me to this suggestion - he originally > >> discovered this issue). It is probably good enough for only *new* plans to > >> react to the increased/increasing number of in progress rows. So this > >> would require backends doing significant numbers of row changes to either > >> directly update pg_statistic or report their in progress numbers to the > >> stats collector. The key change here is the partial execution numbers > >> would need to be sent. Clearly one would need to avoid doing this too > >> often (!) - possibly only when number of changed rows > > >> autovacuum_analyze_scale_factor proportion of the relation concerned or > >> similar. > > > > Are you loading using COPY? Why not break down the load into chunks? > > > > INSERT - but we could maybe workaround by chunking the INSERT. However > that *really* breaks the idea that in SQL you just say what you want, > not how the database engine should do it! And more practically means > that the most obvious and clear way to add your new data has nasty side > effects, and you have to tip toe around muttering secret incantations to > make things work well :-) We also had the same problem with an UPDATE altering the data distribution in such a way that trivial but frequently executed queries cause massive server load until auto analyze sorted out the stats. -- Matt Clarkson Catalyst.Net Limited
On 7 May 2013 07:32, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: > On 07/05/13 18:10, Simon Riggs wrote: >> >> On 7 May 2013 01:23, <mark.kirkwood@catalyst.net.nz> wrote: >> >>> I'm thinking that a variant of (2) might be simpler to inplement: >>> >>> (I think Matt C essentially beat me to this suggestion - he originally >>> discovered this issue). It is probably good enough for only *new* plans >>> to >>> react to the increased/increasing number of in progress rows. So this >>> would require backends doing significant numbers of row changes to either >>> directly update pg_statistic or report their in progress numbers to the >>> stats collector. The key change here is the partial execution numbers >>> would need to be sent. Clearly one would need to avoid doing this too >>> often (!) - possibly only when number of changed rows > >>> autovacuum_analyze_scale_factor proportion of the relation concerned or >>> similar. >> >> >> Are you loading using COPY? Why not break down the load into chunks? >> > > INSERT - but we could maybe workaround by chunking the INSERT. However that > *really* breaks the idea that in SQL you just say what you want, not how the > database engine should do it! And more practically means that the most > obvious and clear way to add your new data has nasty side effects, and you > have to tip toe around muttering secret incantations to make things work > well :-) Yes, we'd need to break up SQL statements into pieces and use external transaction snapshots to do that. > I'm still thinking that making postgres smarter about having current stats > for getting the actual optimal plan is the best solution. I agree. The challenge now is to come up with something that actually works; most of the ideas have been very vague and ignore the many downsides. The hard bit is the analysis and balanced thinking, not the developing. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 07/05/13 19:33, Simon Riggs wrote: > On 7 May 2013 07:32, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: >> On 07/05/13 18:10, Simon Riggs wrote: >>> >>> On 7 May 2013 01:23, <mark.kirkwood@catalyst.net.nz> wrote: >>> >>>> I'm thinking that a variant of (2) might be simpler to inplement: >>>> >>>> (I think Matt C essentially beat me to this suggestion - he originally >>>> discovered this issue). It is probably good enough for only *new* plans >>>> to >>>> react to the increased/increasing number of in progress rows. So this >>>> would require backends doing significant numbers of row changes to either >>>> directly update pg_statistic or report their in progress numbers to the >>>> stats collector. The key change here is the partial execution numbers >>>> would need to be sent. Clearly one would need to avoid doing this too >>>> often (!) - possibly only when number of changed rows > >>>> autovacuum_analyze_scale_factor proportion of the relation concerned or >>>> similar. >>> >>> >>> Are you loading using COPY? Why not break down the load into chunks? >>> >> >> INSERT - but we could maybe workaround by chunking the INSERT. However that >> *really* breaks the idea that in SQL you just say what you want, not how the >> database engine should do it! And more practically means that the most >> obvious and clear way to add your new data has nasty side effects, and you >> have to tip toe around muttering secret incantations to make things work >> well :-) > > Yes, we'd need to break up SQL statements into pieces and use external > transaction snapshots to do that. > >> I'm still thinking that making postgres smarter about having current stats >> for getting the actual optimal plan is the best solution. > > I agree. > > The challenge now is to come up with something that actually works; > most of the ideas have been very vague and ignore the many downsides. > The hard bit is the analysis and balanced thinking, not the > developing. > Yeah - seeing likely downsides can be a bit tricky too. I'll have a play with some prototyping ideas, since this is actually an area of postgres (analyze/stats collector) that I've fiddled with before :-) Cheers Mark
Well, could you write a trigger that would do what you need? AFAIR analyze data is stored no matter transaction boundaries. You could store some counters in session vars and issue an explicit analyze when enough rows were added.
7 трав. 2013 08:33, "Mark Kirkwood" <mark.kirkwood@catalyst.net.nz> напис.
On 07/05/13 18:10, Simon Riggs wrote:On 7 May 2013 01:23, <mark.kirkwood@catalyst.net.nz> wrote:I'm thinking that a variant of (2) might be simpler to inplement:
(I think Matt C essentially beat me to this suggestion - he originally
discovered this issue). It is probably good enough for only *new* plans to
react to the increased/increasing number of in progress rows. So this
would require backends doing significant numbers of row changes to either
directly update pg_statistic or report their in progress numbers to the
stats collector. The key change here is the partial execution numbers
would need to be sent. Clearly one would need to avoid doing this too
often (!) - possibly only when number of changed rows >
autovacuum_analyze_scale_factor proportion of the relation concerned or
similar.
Are you loading using COPY? Why not break down the load into chunks?
INSERT - but we could maybe workaround by chunking the INSERT. However that *really* breaks the idea that in SQL you just say what you want, not how the database engine should do it! And more practically means that the most obvious and clear way to add your new data has nasty side effects, and you have to tip toe around muttering secret incantations to make things work well :-)
I'm still thinking that making postgres smarter about having current stats for getting the actual optimal plan is the best solution.
Cheers
Mark
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
(See below for the reply) On 10/05/13 22:48, Vitalii Tymchyshyn wrote: > Well, could you write a trigger that would do what you need? AFAIR > analyze data is stored no matter transaction boundaries. You could store > some counters in session vars and issue an explicit analyze when enough > rows were added. > > 7 трав. 2013 08:33, "Mark Kirkwood" <mark.kirkwood@catalyst.net.nz > <mailto:mark.kirkwood@catalyst.net.nz>> напис. > > On 07/05/13 18:10, Simon Riggs wrote: > > On 7 May 2013 01:23, <mark.kirkwood@catalyst.net.nz > <mailto:mark.kirkwood@catalyst.net.nz>__> wrote: > > I'm thinking that a variant of (2) might be simpler to > inplement: > > (I think Matt C essentially beat me to this suggestion - he > originally > discovered this issue). It is probably good enough for only > *new* plans to > react to the increased/increasing number of in progress > rows. So this > would require backends doing significant numbers of row > changes to either > directly update pg_statistic or report their in progress > numbers to the > stats collector. The key change here is the partial > execution numbers > would need to be sent. Clearly one would need to avoid doing > this too > often (!) - possibly only when number of changed rows > > autovacuum_analyze_scale___factor proportion of the relation > concerned or > similar. > > > Are you loading using COPY? Why not break down the load into chunks? > > > INSERT - but we could maybe workaround by chunking the INSERT. > However that *really* breaks the idea that in SQL you just say what > you want, not how the database engine should do it! And more > practically means that the most obvious and clear way to add your > new data has nasty side effects, and you have to tip toe around > muttering secret incantations to make things work well :-) > > I'm still thinking that making postgres smarter about having current > stats for getting the actual optimal plan is the best solution. Unfortunately a trigger will not really do the job - analyze ignores in progress rows (unless they were added by the current transaction), and then the changes made by analyze are not seen by any other sessions. So no changes to plans until the entire INSERT is complete and COMMIT happens (which could be a while - too long in our case). Figuring out how to improve on this situation is tricky. Cheers Mark
Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes: > Unfortunately a trigger will not really do the job - analyze ignores in > progress rows (unless they were added by the current transaction), and > then the changes made by analyze are not seen by any other sessions. So > no changes to plans until the entire INSERT is complete and COMMIT > happens (which could be a while - too long in our case). I'm not sure I believe the thesis that plans won't change at all. The planner will notice that the physical size of the table is growing. That may not be enough, if the table-contents statistics are missing or completely unreflective of reality, but it's something. It is true that *already cached* plans won't change until after an ANALYZE is done (the key point there being that ANALYZE sends out a shared-inval message to force replanning of plans for the table). Conceivably you could issue concurrent ANALYZEs occasionally while the INSERT is running, not so much to update the stats --- because they wouldn't --- as to force cached-plan invalidation. regards, tom lane
On 11/05/13 01:30, Tom Lane wrote: > Mark Kirkwood <mark.kirkwood@catalyst.net.nz> writes: >> Unfortunately a trigger will not really do the job - analyze ignores in >> progress rows (unless they were added by the current transaction), and >> then the changes made by analyze are not seen by any other sessions. So >> no changes to plans until the entire INSERT is complete and COMMIT >> happens (which could be a while - too long in our case). > > I'm not sure I believe the thesis that plans won't change at all. > The planner will notice that the physical size of the table is growing. > That may not be enough, if the table-contents statistics are missing > or completely unreflective of reality, but it's something. > > It is true that *already cached* plans won't change until after an > ANALYZE is done (the key point there being that ANALYZE sends out a > shared-inval message to force replanning of plans for the table). > Conceivably you could issue concurrent ANALYZEs occasionally while > the INSERT is running, not so much to update the stats --- because > they wouldn't --- as to force cached-plan invalidation. Yeah - true, I was focusing on the particular type of query illustrated in the test case - pretty much entirely needing updated selectivity stats for a column, which wouldn't change unfortunately. Cheers Mark
On 03.05.2013 15:41, Simon Riggs wrote: > What appears to be happening is we're spending a lot of time in > TransactionIdIsInProgress() so we can set hints and then when we find > it is still in progress we then spend more time in XidIsInSnapshot() > while we check that it is still invisible to us. Even if the > transaction we see repeatedly ends, we will still pay the cost in > XidIsInSnapshot repeatedly as we execute. > > Given that code path, I would expect it to suck worse on a live system > with many sessions, and even worse with many subtransactions. > > (1) A proposed fix is attached, but its only a partial one and barely tested. > > Deeper fixes might be > > (2) to sort the xid array if we call XidIsInSnapshot too many times > in a transaction. I don't think that is worth it, because a long > running snapshot may be examined many times, but is unlikely to see > multiple in-progress xids repeatedly. Whereas your case seems > reasonably common. Yeah, sorting would be a waste of time most of the time. Instead of adding a new cache field, how about just swapping the matched XID to the beginning of the array? Did you have some simple performance test script for this? - Heikki
On 06.05.2013 04:51, Mark Kirkwood wrote: > On 05/05/13 00:49, Simon Riggs wrote: >> On 3 May 2013 13:41, Simon Riggs <simon@2ndquadrant.com> wrote: >> >>> (3) to make the check on TransactionIdIsInProgress() into a heuristic, >>> since we don't *need* to check that, so if we keep checking the same >>> xid repeatedly we can reduce the number of checks or avoid xids that >>> seem to be long running. That's slightly more coding than my quick >>> hack here but seems worth it. >>> >>> I think we need both (1) and (3) but the attached patch does just (1). >>> >>> This is a similar optimisation to the one I introduced for >>> TransactionIdIsKnownCompleted(), except this applies to repeated >>> checking of as yet-incomplete xids, and to bulk concurrent >>> transactions. >> >> ISTM we can improve performance of TransactionIdIsInProgress() by >> caching the procno of our last xid. >> >> Mark, could you retest with both these patches? Thanks. >> > > Thanks Simon, will do and report back. Did anyone ever try (3) ? I'm not sure if this the same idea as (3) above, but ISTM that HeapTupleSatisfiesMVCC doesn't actually need to call TransactionIdIsInProgress(), because it checks XidInMVCCSnapshot(). The comment at the top of tqual.c says: > * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array) > * before TransactionIdDidCommit/TransactionIdDidAbort (which look in > * pg_clog). Otherwise we have a race condition: we might decide that a > * just-committed transaction crashed, because none of the tests succeed. > * xact.c is careful to record commit/abort in pg_clog before it unsets > * MyPgXact->xid in PGXACT array. That fixes that problem, but it also > * means there is a window where TransactionIdIsInProgress and > * TransactionIdDidCommit will both return true. If we check only > * TransactionIdDidCommit, we could consider a tuple committed when a > * later GetSnapshotData call will still think the originating transaction > * is in progress, which leads to application-level inconsistency. The > * upshot is that we gotta check TransactionIdIsInProgress first in all > * code paths, except for a few cases where we are looking at > * subtransactions of our own main transaction and so there can't be any > * race condition. If TransactionIdIsInProgress() returns true for a given XID, then surely it was also running when the snapshot was taken (or had not even began yet). In which case the XidInMVCCSnapshot() call will also return true. Am I missing something? There's one little problem: we currently only set the hint bits when TransactionIdIsInProgress() returns false. If we do that earlier, then even though HeapTupleSatisfiesMVCC works correctly thanks to the XidInMVCCSnapshot call, other HeapTupleSatisfies* functions that don't call XIdInMVCCSnapshot might see the tuple as committed or aborted too early, if they see the hint bit as set while the transaction is still in-progress according to the proc array. Would have to check all the callers of those other HeapTupleSatisfies* functions to verify if that's OK. - Heikki
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > I'm not sure if this the same idea as (3) above, but ISTM that > HeapTupleSatisfiesMVCC doesn't actually need to call > TransactionIdIsInProgress(), because it checks XidInMVCCSnapshot(). The > comment at the top of tqual.c says: >> * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array) >> * before TransactionIdDidCommit/TransactionIdDidAbort (which look in >> * pg_clog). Otherwise we have a race condition: we might decide that a >> * just-committed transaction crashed, because none of the tests succeed. >> * xact.c is careful to record commit/abort in pg_clog before it unsets >> * MyPgXact->xid in PGXACT array. That fixes that problem, but it also >> * means there is a window where TransactionIdIsInProgress and >> * TransactionIdDidCommit will both return true. If we check only >> * TransactionIdDidCommit, we could consider a tuple committed when a >> * later GetSnapshotData call will still think the originating transaction >> * is in progress, which leads to application-level inconsistency. The >> * upshot is that we gotta check TransactionIdIsInProgress first in all >> * code paths, except for a few cases where we are looking at >> * subtransactions of our own main transaction and so there can't be any >> * race condition. > If TransactionIdIsInProgress() returns true for a given XID, then surely > it was also running when the snapshot was taken (or had not even began > yet). In which case the XidInMVCCSnapshot() call will also return true. > Am I missing something? Yes, you're failing to understand the nature of the race condition. What we're concerned about is that if tqual says a tuple is committed, its transaction must be committed (not still in progress) according to any *subsequently taken* snapshot. This is not about the contents of the snapshot we're currently consulting; it's about not wanting a tuple to be thought committed if anyone could possibly later decide its transaction is still in progress. It's possible that this issue would be moot if the only use of transaction-in-progress data were in tqual.c (so that we could assume all later tests use the same logic you propose here), but I doubt that that's true. regards, tom lane
On 16 June 2013 16:04, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 03.05.2013 15:41, Simon Riggs wrote: >> >> What appears to be happening is we're spending a lot of time in >> TransactionIdIsInProgress() so we can set hints and then when we find >> it is still in progress we then spend more time in XidIsInSnapshot() >> while we check that it is still invisible to us. Even if the >> transaction we see repeatedly ends, we will still pay the cost in >> XidIsInSnapshot repeatedly as we execute. >> >> Given that code path, I would expect it to suck worse on a live system >> with many sessions, and even worse with many subtransactions. >> >> (1) A proposed fix is attached, but its only a partial one and barely >> tested. >> >> Deeper fixes might be >> >> (2) to sort the xid array if we call XidIsInSnapshot too many times >> in a transaction. I don't think that is worth it, because a long >> running snapshot may be examined many times, but is unlikely to see >> multiple in-progress xids repeatedly. Whereas your case seems >> reasonably common. > > > Yeah, sorting would be a waste of time most of the time. > > Instead of adding a new cache field, how about just swapping the matched XID > to the beginning of the array? Do you think that is significantly different from what I've done? > Did you have some simple performance test script for this? Files attached to set up and tear down the test. Needs max_prepared_transactions = 100 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 16 June 2013 16:23, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 06.05.2013 04:51, Mark Kirkwood wrote: >> >> On 05/05/13 00:49, Simon Riggs wrote: >>> >>> On 3 May 2013 13:41, Simon Riggs <simon@2ndquadrant.com> wrote: >>> >>>> (3) to make the check on TransactionIdIsInProgress() into a heuristic, >>>> since we don't *need* to check that, so if we keep checking the same >>>> xid repeatedly we can reduce the number of checks or avoid xids that >>>> seem to be long running. That's slightly more coding than my quick >>>> hack here but seems worth it. >>>> >>>> I think we need both (1) and (3) but the attached patch does just (1). >>>> >>>> This is a similar optimisation to the one I introduced for >>>> TransactionIdIsKnownCompleted(), except this applies to repeated >>>> checking of as yet-incomplete xids, and to bulk concurrent >>>> transactions. >>> >>> >>> ISTM we can improve performance of TransactionIdIsInProgress() by >>> caching the procno of our last xid. >>> >>> Mark, could you retest with both these patches? Thanks. >>> >> >> Thanks Simon, will do and report back. > > > Did anyone ever try (3) ? No, because my other patch meant I didn't need to. In other words, my other patch speeded up repeated access enough I didn't care about (3) anymore. > I'm not sure if this the same idea as (3) above, but ISTM that > HeapTupleSatisfiesMVCC doesn't actually need to call > TransactionIdIsInProgress(), because it checks XidInMVCCSnapshot(). The > comment at the top of tqual.c says: > >> * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT >> array) >> * before TransactionIdDidCommit/TransactionIdDidAbort (which look in >> * pg_clog). Otherwise we have a race condition: we might decide that a >> * just-committed transaction crashed, because none of the tests succeed. >> * xact.c is careful to record commit/abort in pg_clog before it unsets >> * MyPgXact->xid in PGXACT array. That fixes that problem, but it also >> * means there is a window where TransactionIdIsInProgress and >> * TransactionIdDidCommit will both return true. If we check only >> * TransactionIdDidCommit, we could consider a tuple committed when a >> * later GetSnapshotData call will still think the originating transaction >> * is in progress, which leads to application-level inconsistency. >> The >> * upshot is that we gotta check TransactionIdIsInProgress first in all >> * code paths, except for a few cases where we are looking at >> * subtransactions of our own main transaction and so there can't be any >> * race condition. > > > If TransactionIdIsInProgress() returns true for a given XID, then surely it > was also running when the snapshot was taken (or had not even began yet). In > which case the XidInMVCCSnapshot() call will also return true. Am I missing > something? > > There's one little problem: we currently only set the hint bits when > TransactionIdIsInProgress() returns false. If we do that earlier, then even > though HeapTupleSatisfiesMVCC works correctly thanks to the > XidInMVCCSnapshot call, other HeapTupleSatisfies* functions that don't call > XIdInMVCCSnapshot might see the tuple as committed or aborted too early, if > they see the hint bit as set while the transaction is still in-progress > according to the proc array. Would have to check all the callers of those > other HeapTupleSatisfies* functions to verify if that's OK. Well, I looked at that and its too complex and fiddly to be worth it, IMHO. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, May 6, 2013 at 1:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 6 May 2013 02:51, Mark Kirkwood <mark.kirkwood@catalyst.net.nz> wrote: >> On 05/05/13 00:49, Simon Riggs wrote: >>> >>> On 3 May 2013 13:41, Simon Riggs <simon@2ndquadrant.com> wrote: >>> >>>> (3) to make the check on TransactionIdIsInProgress() into a heuristic, >>>> since we don't *need* to check that, so if we keep checking the same >>>> xid repeatedly we can reduce the number of checks or avoid xids that >>>> seem to be long running. That's slightly more coding than my quick >>>> hack here but seems worth it. >>>> >>>> I think we need both (1) and (3) but the attached patch does just (1). >>>> >>>> This is a similar optimisation to the one I introduced for >>>> TransactionIdIsKnownCompleted(), except this applies to repeated >>>> checking of as yet-incomplete xids, and to bulk concurrent >>>> transactions. >>> >>> >>> ISTM we can improve performance of TransactionIdIsInProgress() by >>> caching the procno of our last xid. >>> >>> Mark, could you retest with both these patches? Thanks. >>> >> >> Thanks Simon, will do and report back. > > OK, here's a easily reproducible test... > > Prep: > DROP TABLE IF EXISTS plan; > CREATE TABLE plan > ( > id INTEGER NOT NULL, > typ INTEGER NOT NULL, > dat TIMESTAMP, > val TEXT NOT NULL > ); > insert into plan select generate_series(1,100000), 0, > current_timestamp, 'some texts'; > CREATE UNIQUE INDEX plan_id ON plan(id); > CREATE INDEX plan_dat ON plan(dat); > > testcase.pgb > select count(*) from plan where dat is null and typ = 3; > > Session 1: > pgbench -n -f testcase.pgb -t 100 > > Session 2: > BEGIN; insert into plan select 1000000 + generate_series(1, 100000), > 3, NULL, 'b'; > > Transaction rate in Session 1: (in tps) > (a) before we run Session 2: > Current: 5600tps > Patched: 5600tps > > (b) after Session 2 has run, yet before transaction end > Current: 56tps > Patched: 65tps When I run this test case in single-client mode, I don't see nearly that much speedup, it just goes from 38.99 TPS to 40.12 TPS. But still, it is a speedup, and very reproducible (t-test p-val < 1e-40, n of 21 for both) But I also tried it with 4 pgbench clients, and ran into a collapse of the performance, TPS dropping down to ~8 TPS. It is too variable to figure out how reliable the speed-up with this patch is, so far. Apparently they are all fighting over the spinlock on the ProcArrayLock. This is a single quad core, "Intel(R) Xeon(R) CPU X3210 @ 2.13GHz" So I agree with (3) above, about not checking TransactionIdIsInProgress repeatedly. Or could we change the order of operations so that TransactionIdIsInProgress is checked only after XidInMVCCSnapshot? Or perhaps the comment "XXX Can we test without the lock first?" could be implemented and save the day here? Looking at the code, there is something that bothers me about this part: pxid = cxid = InvalidTransactionId; return false; If it is safe to return false at this point (as determined by the stale values of pxid and cxid) then why would we clear out the stale values so they can't be used in the future to also short circuit things? On the other hand, if the stale values need to be cleared so they are not misleading to future invocations, why is it safe for this invocation to have made a decision based on them? Maybe with more thought I will see why this is so. .... > > Which brings me back to Mark's original point, which is that we are > x100 times slower in this case and it *is* because the choice of > IndexScan is a bad one for this situation. > > After some thought on this, I do think we need to do something about > it directly, rather than by tuning infrastructire (as I just > attempted). The root cause here is that IndexScan plans are sensitive > to mistakes in data distribution, much more so than other plan types. > > The two options, broadly, are to either > > 1. avoid IndexScans in the planner unless they have a *significantly* > better cost. At the moment we use IndexScans if cost is lowest, even > if that is only by a whisker. This wouldn't work well in Mark's specific case, because the problem is that it is using the wrong index, not that it is using an index at all. There are two candidate indexes, and one looks slightly better but then gets ruined by the in-progress insert, while the other looks slightly worse but would be resistant to the in-progress insert. Switching from the bad index to the seq scan is not going to fix things. I don't think there is any heuristic solution here other than to keep track of the invisible data distribution as well as the visible data distribution. The more I've thought about it, the more I see the charm of Mark's original proposal. Why not build the statistics assuming that the in-progress insert will commit? It is not a complete solution, because autoanalyze will not get triggered until the transaction completes. But why not let the manual ANALYZE get the benefit of seeing them? The statistics serve two masters. One is to estimate how many rows will actually be returned. The other is to estimate how much work it will take to return them (including the work of groveling through a list of in-process tuples). Right now those are implicitly considered the same thing--we could change that and keep separate sets of statistics, but I think we could improve things some without doing that. For the first case of estimating actual rows returned, I think counting in-progress rows is a wash. It seems just about as likely that the bulk transaction which was in progress at the time of the last ANALYZE has already committed at the time of the planning (but not yet completed the autoanalyze) as it is that the bulk transaction is still in progress at the time of the planning; which means counting the rows as if they committed is sometimes right and sometimes wrong but in about equal measure. But for the second master, counting the in progress rows seems like a win all the time. Either we actually see them and do the work, or we refuse to see them but still have to do the work to figure that out. Cheers, Jeff
On Sat, Jul 13, 2013 at 1:47 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > But I also tried it with 4 pgbench clients, and ran into a collapse of > the performance, TPS dropping down to ~8 TPS. It is too variable to > figure out how reliable the speed-up with this patch is, so far. > Apparently they are all fighting over the spinlock on the > ProcArrayLock. > > This is a single quad core, "Intel(R) Xeon(R) CPU X3210 @ 2.13GHz" > > So I agree with (3) above, about not checking > TransactionIdIsInProgress repeatedly. Or could we change the order of > operations so that TransactionIdIsInProgress is checked only after > XidInMVCCSnapshot? I haven't checked the patch in detail, but it sounds like my proposal for CSN based snapshots[1] could help here. Using it TransactionIdIsInProgress can be done completely lock-free. It would include a direct dense array lookup, read barrier and a check of the dense/sparse horizon, and if necessary a binary search in the sparse array and another read barrier and check for sparse array version counter. I plan to start working on the patch next week. I hope to have a first cut available for CF2. [1] http://www.postgresql.org/message-id/CA+CSw_tEpJ=md1zgxPkjH6CWDnTDft4gBi=+P9SnoC+Wy3pKdA@mail.gmail.com Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
On Sunday, June 16, 2013, Heikki Linnakangas wrote:
On 06.05.2013 04:51, Mark Kirkwood wrote:On 05/05/13 00:49, Simon Riggs wrote:On 3 May 2013 13:41, Simon Riggs <simon@2ndquadrant.com> wrote:(3) to make the check on TransactionIdIsInProgress() into a heuristic,
since we don't *need* to check that, so if we keep checking the same
xid repeatedly we can reduce the number of checks or avoid xids that
seem to be long running. That's slightly more coding than my quick
hack here but seems worth it.
I think we need both (1) and (3) but the attached patch does just (1).
This is a similar optimisation to the one I introduced for
TransactionIdIsKnownCompleted(), except this applies to repeated
checking of as yet-incomplete xids, and to bulk concurrent
transactions.
ISTM we can improve performance of TransactionIdIsInProgress() by
caching the procno of our last xid.
Mark, could you retest with both these patches? Thanks.
Thanks Simon, will do and report back.
Did anyone ever try (3) ?
I'm not sure if this the same idea as (3) above, but ISTM that HeapTupleSatisfiesMVCC doesn't actually need to call TransactionIdIsInProgress(), because it checks XidInMVCCSnapshot(). The comment at the top of tqual.c says:
Or at least, it doesn't need to call TransactionIdIsInProgress() if XidInMVCCSnapshot() returned true.
* NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)
* before TransactionIdDidCommit/TransactionIdDidAbort (which look in
* pg_clog). Otherwise we have a race condition: we might decide that a
* just-committed transaction crashed, because none of the tests succeed.
* xact.c is careful to record commit/abort in pg_clog before it unsets
* MyPgXact->xid in PGXACT array. That fixes that problem, but it also
* means there is a window where TransactionIdIsInProgress and
* TransactionIdDidCommit will both return true. If we check only
* TransactionIdDidCommit, we could consider a tuple committed when a
* later GetSnapshotData call will still think the originating transaction
* is in progress, which leads to application-level inconsistency. The
* upshot is that we gotta check TransactionIdIsInProgress first in all
* code paths, except for a few cases where we are looking at
* subtransactions of our own main transaction and so there can't be any
* race condition.
If TransactionIdIsInProgress() returns true for a given XID, then surely it was also running when the snapshot was taken (or had not even began yet). In which case the XidInMVCCSnapshot() call will also return true. Am I missing something?
There's one little problem: we currently only set the hint bits when TransactionIdIsInProgress() returns false. If we do that earlier,
But why would we do that earlier? If we never bother to call TransactionIdIsInProgress(), then we just don't set the hint bits, because we don't know what to set them to. It can't matter what order we call TransactionIdIsInProgress and TransactionIdDidCommit in if we don't call either of them at all during this invocation.
Cheers,
Jeff