Thread: COUNT(*) and index-only scans
I talked to Robert Haas and he said that index-only scans do not optimize COUNT(*). Is this something we can do for PG 9.2? Is anyone working on this? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 10 October 2011 18:23, Bruce Momjian <bruce@momjian.us> wrote: > I talked to Robert Haas and he said that index-only scans do not > optimize COUNT(*). Is this something we can do for PG 9.2? Is anyone > working on this? Yes it does, provided that there is an appropriate WHERE clause. But yes, I think we definitely want this if it's relatively easy. In addition to this, it's not always easy to get it to use an index-only scan even if it's going to significantly faster. I'm assuming some supporting planner work needs to be added too. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 10, 2011 at 6:23 PM, Bruce Momjian <bruce@momjian.us> wrote: > I talked to Robert Haas and he said that index-only scans do not > optimize COUNT(*). Is this something we can do for PG 9.2? Is anyone > working on this? People usually conflate multiple problems when they talk about "count(*). The usual case people are concerned about would require materialized views, not index-only scans. -- greg
Bruce Momjian <bruce@momjian.us> wrote: > I talked to Robert Haas and he said that index-only scans do not > optimize COUNT(*). Is this something we can do for PG 9.2? Is > anyone working on this? Well, it's not that it doesn't optimize COUNT(*) -- it's that it doesn't yet cost the index scan as cheaper than a table scan when you're accessing every row. create table t (id int not null primary key); insert into t select generate_series(1, 1000000); vacuum freeze analyze; explain analyze select count(*) from t where id between 500000 and 500010; That gives you an index-only scan; but without the WHERE clause it uses a seq scan. I think it's mainly a matter of doing enough benchmarks to figure out how best to model the costs of the index scan so that it can be picked for that case. -Kevin
On Mon, Oct 10, 2011 at 1:36 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Bruce Momjian <bruce@momjian.us> wrote: >> I talked to Robert Haas and he said that index-only scans do not >> optimize COUNT(*). Is this something we can do for PG 9.2? Is >> anyone working on this? > > Well, it's not that it doesn't optimize COUNT(*) -- it's that it > doesn't yet cost the index scan as cheaper than a table scan when > you're accessing every row. > > create table t (id int not null primary key); > insert into t select generate_series(1, 1000000); > vacuum freeze analyze; > explain analyze select count(*) from t > where id between 500000 and 500010; > > That gives you an index-only scan; but without the WHERE clause it > uses a seq scan. I think it's mainly a matter of doing enough > benchmarks to figure out how best to model the costs of the index > scan so that it can be picked for that case. Right now, our costing model for index-only scans is pretty dumb. It assumes that using an index-only scan will avoid 10% of the heap fetches. That could easily be low, and on an insert-only table or one where only the recently-updated rows are routinely accessed, it could also be high. To use an index-only scan for a full-table COUNT(*), we're going to have to be significantly smarter, because odds are good that skipping 10% of the heap fetches won't be sufficient inducement to the planner to go that route; we are going to need a real number. This isn't just an exercise in costing, though: right now, we don't even generate a plan to use an index for a full-table scan, because we assume that it can never be cheaper. This is actually not quite true even in previous releases (suppose the table is severely bloated but the index is not) and it's going to be less true now that we have index-only scans. So that's going to need some adjustment, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Bruce Momjian <bruce@momjian.us> wrote: >> I talked to Robert Haas and he said that index-only scans do not >> optimize COUNT(*). Is this something we can do for PG 9.2? Is >> anyone working on this? > Well, it's not that it doesn't optimize COUNT(*) -- it's that it > doesn't yet cost the index scan as cheaper than a table scan when > you're accessing every row. I think what Robert is complaining about is that we won't currently consider an index that matches neither any WHERE clauses nor ORDER BY, ie, count(*) over the whole table won't get considered for an index-only scan, regardless of cost estimates. regards, tom lane
Robert Haas <robertmhaas@gmail.com> wrote: > Right now, our costing model for index-only scans is pretty dumb. > It assumes that using an index-only scan will avoid 10% of the > heap fetches. That could easily be low, and on an insert-only > table or one where only the recently-updated rows are routinely > accessed, it could also be high. As a reality check, I just ran this query on a table in a statewide copy of our data: select count(*), sum(case when xmin = '2'::xid then 0 else 1 end) as read_heap from "CaseHist"; and got: count | read_heap -----------+-----------205765311 | 3934924 So on our real-world database, it would skip something on the order of 98% of the heap reads, right? > This isn't just an exercise in costing, though: right now, we > don't even generate a plan to use an index for a full-table scan, > because we assume that it can never be cheaper. This is actually > not quite true even in previous releases (suppose the table is > severely bloated but the index is not) and it's going to be less > true now that we have index-only scans. So that's going to need > some adjustment, too. OK. Thanks for clarifying. -Kevin
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think what Robert is complaining about is that we won't > currently consider an index that matches neither any WHERE clauses > nor ORDER BY, ie, count(*) over the whole table won't get > considered for an index-only scan, regardless of cost estimates. I guess the trick would be to get it to consider such plans only under some conditions, to avoid explosive growth in planning time for some types of queries. Some statistics bucket for the number of non-frozen tuples in the relation, maybe? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think what Robert is complaining about is that we won't >> currently consider an index that matches neither any WHERE clauses >> nor ORDER BY, ie, count(*) over the whole table won't get >> considered for an index-only scan, regardless of cost estimates. > I guess the trick would be to get it to consider such plans only > under some conditions, to avoid explosive growth in planning time > for some types of queries. Some statistics bucket for the number of > non-frozen tuples in the relation, maybe? My intention was to allow it to consider any covering index. You're thinking about the cost estimate, which is really entirely different. regards, tom lane
On Mon, Oct 10, 2011 at 10:36 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Bruce Momjian <bruce@momjian.us> wrote: >> I talked to Robert Haas and he said that index-only scans do not >> optimize COUNT(*). Is this something we can do for PG 9.2? Is >> anyone working on this? > > Well, it's not that it doesn't optimize COUNT(*) -- it's that it > doesn't yet cost the index scan as cheaper than a table scan when > you're accessing every row. > > create table t (id int not null primary key); > insert into t select generate_series(1, 1000000); > vacuum freeze analyze; > explain analyze select count(*) from t > where id between 500000 and 500010; > > That gives you an index-only scan; but without the WHERE clause it > uses a seq scan. If you convert the where clause to "where id is not null" it uses the index only scan again, but only if you nudge it too with enable_seqscan=off. I'm not sure why it needs the nudge in one case but not the other. Cheers, Jeff
On Mon, Oct 10, 2011 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > My intention was to allow it to consider any covering index. You're > thinking about the cost estimate, which is really entirely different. > Is there any reason to consider more than one? I would have expected the narrowest one to be the best choice. There's something to be said for using the same index consistently but we already have that problem and make no attempt to do that. And partial indexes might be better but then we would already be considering them if their constraints are satisfied. -- greg
> Jeff Janes wrote: > Kevin Grittner wrote: >> create table t (id int not null primary key); >> insert into t select generate_series(1, 1000000); >> vacuum freeze analyze; >> explain analyze select count(*) from t >> where id between 500000 and 500010; >> >> That gives you an index-only scan; but without the WHERE clause it >> uses a seq scan. > > If you convert the where clause to "where id is not null" it uses > the index only scan again, but only if you nudge it too with > enable_seqscan=off. Clever way to get a full-table test. It turns out that for the above, with your trick to use the index only scan, it comes out 12% faster to do a seqscan, even when the table and index are fully cached (based on the average time of ten runs each way). There's very little overlap, so the difference looks real. But that's on a very narrow record, having just the one column used in the index. I added one wide column like this: alter table t add column x text; update t set x = (repeat(random()::text, (random() * 100)::int)); cluster t USING t_pkey; vacuum freeze analyze; With that change the index-only scan time remained unchanged, while the seqscan time grew to about 2.6 times the index only scan time. That was mildly surprising for me, considering it was all still cached. -Kevin
>> Jeff Janes wrote: >> Kevin Grittner wrote: > >>> create table t (id int not null primary key); >>> insert into t select generate_series(1, 1000000); >>> vacuum freeze analyze; >>> explain analyze select count(*) from t >>> where id between 500000 and 500010; >>> >>> That gives you an index-only scan; but without the WHERE clause it >>> uses a seq scan. >> >> If you convert the where clause to "where id is not null" it uses >> the index only scan again, but only if you nudge it too with >> enable_seqscan=off. > > Clever way to get a full-table test. > > It turns out that for the above, with your trick to use the index > only scan, it comes out 12% faster to do a seqscan, even when the > table and index are fully cached (based on the average time of ten > runs each way). There's very little overlap, so the difference looks > real. But that's on a very narrow record, having just the one column > used in the index. I added one wide column like this: > > alter table t add column x text; > update t set x = (repeat(random()::text, (random() * 100)::int)); > cluster t USING t_pkey; > vacuum freeze analyze; > > With that change the index-only scan time remained unchanged, while > the seqscan time grew to about 2.6 times the index only scan time. > That was mildly surprising for me, considering it was all still > cached. Moving data around in memory isn't that cheap, and you have probably made the row size 5 times larger by the above thing. However if you added 4000 bytes instead of 100, you should be back to the previous results since data then ends up (most likely,otherwise add a bit more) in a toast table. Jesper
On Mon, Oct 10, 2011 at 3:15 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > >> Right now, our costing model for index-only scans is pretty dumb. >> It assumes that using an index-only scan will avoid 10% of the >> heap fetches. That could easily be low, and on an insert-only >> table or one where only the recently-updated rows are routinely >> accessed, it could also be high. > > As a reality check, I just ran this query on a table in a statewide > copy of our data: > > select count(*), > sum(case when xmin = '2'::xid then 0 else 1 end) as read_heap > from "CaseHist"; > > and got: > > count | read_heap > -----------+----------- > 205765311 | 3934924 > > So on our real-world database, it would skip something on the order > of 98% of the heap reads, right? Yeah, if it's scanning the whole table. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greg Stark wrote: > On Mon, Oct 10, 2011 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > My intention was to allow it to consider any covering index. ?You're > > thinking about the cost estimate, which is really entirely different. > > > > Is there any reason to consider more than one? I would have expected > the narrowest one to be the best choice. There's something to be said > for using the same index consistently but we already have that problem > and make no attempt to do that. And partial indexes might be better > but then we would already be considering them if their constraints are > satisfied. Actually, I think the smallest non-partial one on disk might be the best --- that is very easy to find out. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Tue, Oct 11, 2011 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote: > Greg Stark wrote: >> On Mon, Oct 10, 2011 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > My intention was to allow it to consider any covering index. ?You're >> > thinking about the cost estimate, which is really entirely different. >> > >> >> Is there any reason to consider more than one? I would have expected >> the narrowest one to be the best choice. There's something to be said >> for using the same index consistently but we already have that problem >> and make no attempt to do that. And partial indexes might be better >> but then we would already be considering them if their constraints are >> satisfied. > > Actually, I think the smallest non-partial one on disk might be the best > --- that is very easy to find out. I doubt there is any need to write special-purpose code to decide which index ought to be used for a full table scan. We can just throw all of the otherwise-useless indexes into the costing machinery with empty pathkeys, and let them duke it out. All but the best one will be instantly discarded, and the best one will either beat or lose to a sequential scan. All of this will happen before we start trying to build join paths, so there's no combinatorial explosion in planning time - it'll just be a straightforward cost comparison between plans with identical pathkeys, and should be quite fast. The real issue is that the costing estimates need to be accurate, and that's where the rubber hits the road. Otherwise, even if we pick the right way to scan the table, we may do silly things up the line when we go to start constructing the join order. I think we need to beef up ANALYZE to gather statistics on the fraction of the pages that are marked all-visible, or maybe VACUUM should gather that information. The trouble is that if we VACUUM and then ANALYZE, we'll often get back a value very close to 100%, but then the real value may diminish quite a bit before the next auto-analyze fires. I think if we can figure out what to do about that problem we'll be well on our way... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Oct 11, 2011 at 5:45 AM, Greg Stark <stark@mit.edu> wrote: > On Mon, Oct 10, 2011 at 9:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> My intention was to allow it to consider any covering index. You're >> thinking about the cost estimate, which is really entirely different. >> > > Is there any reason to consider more than one? I would have expected > the narrowest one to be the best choice. There's something to be said > for using the same index consistently but we already have that problem > and make no attempt to do that. And partial indexes might be better > but then we would already be considering them if their constraints are > satisfied. You raise a fantastic idea. Use the frequency of use as a factor of an index in the cost of optimising a query. We have previously discussed the idea of using the RAM residency of an index to control the cost. That is difficult to judge. Using the long term prevalence of usage as a weighting factor makes a great deal of sense for queries that could potentially utilise multiple indexes. That information is readily available and directly applicable. The prevalence of use directly drives RAM residency, so it makes sense to use the causal factor as input to the cost. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> The trouble is that if we VACUUM and then ANALYZE, we'll often get > back a value very close to 100%, but then the real value may diminish > quite a bit before the next auto-analyze fires. I think if we can > figure out what to do about that problem we'll be well on our way... It's not so much an issue of when the last auto-analyze was as an issue of the number of rows in write transactions against that table in the last X minutes. This is where it really hurts us that pg_stat_user_tables is not time-based. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Tue, Oct 11, 2011 at 3:18 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> The trouble is that if we VACUUM and then ANALYZE, we'll often get >> back a value very close to 100%, but then the real value may diminish >> quite a bit before the next auto-analyze fires. I think if we can >> figure out what to do about that problem we'll be well on our way... > > It's not so much an issue of when the last auto-analyze was as an issue > of the number of rows in write transactions against that table in the > last X minutes. This is where it really hurts us that > pg_stat_user_tables is not time-based. The number of write transactions in the last X minutes seems pretty much irrelevant. What matters is the number of previously-all-visible pages written since the last vacuum. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 2011-10-11 at 13:22 -0400, Robert Haas wrote: > The real issue is that the costing estimates need to be accurate, and > that's where the rubber hits the road. Otherwise, even if we pick the > right way to scan the table, we may do silly things up the line when > we go to start constructing the join order. I think we need to beef > up ANALYZE to gather statistics on the fraction of the pages that are > marked all-visible, or maybe VACUUM should gather that information. > The trouble is that if we VACUUM and then ANALYZE, we'll often get > back a value very close to 100%, but then the real value may diminish > quite a bit before the next auto-analyze fires. I think if we can > figure out what to do about that problem we'll be well on our way... Can you send stats messages to keep track when you unset a bit in the VM? That might allow it to be more up-to-date. Regards,Jeff Davis
On Wed, Oct 12, 2011 at 2:50 AM, Jeff Davis <pgsql@j-davis.com> wrote: > On Tue, 2011-10-11 at 13:22 -0400, Robert Haas wrote: >> The real issue is that the costing estimates need to be accurate, and >> that's where the rubber hits the road. Otherwise, even if we pick the >> right way to scan the table, we may do silly things up the line when >> we go to start constructing the join order. I think we need to beef >> up ANALYZE to gather statistics on the fraction of the pages that are >> marked all-visible, or maybe VACUUM should gather that information. >> The trouble is that if we VACUUM and then ANALYZE, we'll often get >> back a value very close to 100%, but then the real value may diminish >> quite a bit before the next auto-analyze fires. I think if we can >> figure out what to do about that problem we'll be well on our way... > > Can you send stats messages to keep track when you unset a bit in the > VM? That might allow it to be more up-to-date. In theory, that seems like it would work, although I'm a little worried about the overhead. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 12, 2011 at 2:50 AM, Jeff Davis <pgsql@j-davis.com> wrote: >> On Tue, 2011-10-11 at 13:22 -0400, Robert Haas wrote: >>> The real issue is that the costing estimates need to be accurate, and >>> that's where the rubber hits the road. >> Can you send stats messages to keep track when you unset a bit in the >> VM? That might allow it to be more up-to-date. > In theory, that seems like it would work, although I'm a little > worried about the overhead. I think it's overkill, and possibly unpleasantly unstable as well. For the initial attack on this we should just have VACUUM and ANALYZE count the number of all-visible blocks and store that in pg_class along with the tuple-count statistics. There's no reason to think that this will be any worse than the way we deal with dead tuple counts, for instance. What bothers me considerably more is the issue about how specific queries might see an all-visible fraction that's very substantially different from the table's overall ratio, especially in examples such as historical-data tables where most of the update and query activity has to do with recently-added rows. I don't see any practical way to attack that problem with statistics; we're just going to have to adopt some estimation rule. What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction of the table expected to be scanned gets smaller. That is, if the query fetches nearly all of the table, take the stored visibility ratio at face value; if it fetches only one block, never believe that that will be an all-visible block; and in general if we're expecting to read a fraction f of the pages, multiply the whole-table visibility ratio by f before using it in the cost estimate. This amounts to assuming that the historical-data case is the usual case, but I'm not sure that's unfair. regards, tom lane
On Wed, Oct 12, 2011 at 2:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think it's overkill, and possibly unpleasantly unstable as well. > For the initial attack on this we should just have VACUUM and ANALYZE > count the number of all-visible blocks and store that in pg_class along > with the tuple-count statistics. There's no reason to think that this > will be any worse than the way we deal with dead tuple counts, for > instance. So I have a theory. Assuming you're in a steady-state situation the amount of all-visible blocks will fluctuate from a high just after vacuum to a low just before the next vacuum. There are other ways a block can be marked all-visible but for the most part I would expect the fraction to go steadily down until vacuum comes along and cleans things up. So if vacuum tracked the fraction of blocks marked all-visible *before* it processed them and the fraction it marked all-visible after processing we have an upper and lower bound. If we knew how long it's been since vacuum we could interpolate between those, or we could just take the mean, or we could take the lower bound as a conservative estimate. > What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction > of the table expected to be scanned gets smaller. I think there's a statistically more rigorous way of accomplishing the same thing. If you treat the pages we estimate we're going to read as a random sample of the population of pages then your expected value is the fraction of the overall population that is all-visible but your 95th percentile confidence interval will be, uh, a simple formula we can compute but I don't recall off-hand. This gets back to a discussion long-ago of what estimates the planner should be using. It currently uses all expected values but in many cases it would be valuable if the planner knew what the standard deviation of those estimates was. It might sometimes be better to pick a plan that's slightly worse on average but is less likely to be much worse. -- greg
On Wed, Oct 12, 2011 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > What bothers me considerably more is the issue about how specific > queries might see an all-visible fraction that's very substantially > different from the table's overall ratio, especially in examples such as > historical-data tables where most of the update and query activity has > to do with recently-added rows. I don't see any practical way to attack > that problem with statistics; we're just going to have to adopt some > estimation rule. What I suggest as a first cut for that is: simply > derate the visibility fraction as the fraction of the table expected to > be scanned gets smaller. That is, if the query fetches nearly all of > the table, take the stored visibility ratio at face value; if it fetches > only one block, never believe that that will be an all-visible block; > and in general if we're expecting to read a fraction f of the pages, > multiply the whole-table visibility ratio by f before using it in the > cost estimate. This amounts to assuming that the historical-data case > is the usual case, but I'm not sure that's unfair. I don't think that's an unfair assumption -- in fact I think it's exactly the right assumption -- but I'm worried about how the math works out with that specific proposal. - Suppose VACUUM processes the table and makes it all-visible. Then, somebody comes along and updates one tuple on every page, making them all not-all-visible, but not trigger VACUUM because we're nowhere close the 20% threshold. Now COUNT(*) will think it should use an index-scan, but really... not so much. In fact, even if it's only that a tuple has been updated on 25% of the pages, we're probably in trouble. - Suppose the table has a million rows and we're going to read 100 of them, or 0.01%. Now it might appear that a covering index has a negligible advantage over a non-covering index, but in fact I think we still want to err on the side of trying to use the covering index. In fact, even if we're only reading a single row, we probably still generally want to pick up the covering index, to cater to the case where someone is doing primary key fetches against a gigantic table and hoping that index-only scans will save them from random I/O hell. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greg Stark <stark@mit.edu> writes: > Assuming you're in a steady-state situation the amount of all-visible > blocks will fluctuate from a high just after vacuum to a low just > before the next vacuum. There are other ways a block can be marked > all-visible but for the most part I would expect the fraction to go > steadily down until vacuum comes along and cleans things up. > So if vacuum tracked the fraction of blocks marked all-visible > *before* it processed them and the fraction it marked all-visible > after processing we have an upper and lower bound. If we knew how long > it's been since vacuum we could interpolate between those, or we could > just take the mean, or we could take the lower bound as a conservative > estimate. I thought of that too, but we don't do the comparable thing for dead tuple counts, and I am not convinced that we should do it for visibility. I'd rather have a simple rule that "it's right immediately after VACUUM", so that at least trivial cases like read-only tables work correctly. >> What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction >> of the table expected to be scanned gets smaller. > I think there's a statistically more rigorous way of accomplishing the > same thing. If you treat the pages we estimate we're going to read as > a random sample of the population of pages then your expected value is > the fraction of the overall population that is all-visible but your > 95th percentile confidence interval will be, uh, a simple formula we > can compute but I don't recall off-hand. The problem is precisely that the pages a query is going to read are likely to *not* be a random sample, but to be correlated with recently-dirtied pages. > ... It currently uses all expected values but in many > cases it would be valuable if the planner knew what the standard > deviation of those estimates was. No doubt, but I'm not volunteering to fix that before we can have a non-toy estimate for index-only scans. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 12, 2011 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What bothers me considerably more is the issue about how specific >> queries might see an all-visible fraction that's very substantially >> different from the table's overall ratio, > - Suppose VACUUM processes the table and makes it all-visible. Then, > somebody comes along and updates one tuple on every page, making them > all not-all-visible, but not trigger VACUUM because we're nowhere > close the 20% threshold. Now COUNT(*) will think it should use an > index-scan, but really... not so much. In fact, even if it's only > that a tuple has been updated on 25% of the pages, we're probably in > trouble. Yeah, but that would be a pretty unlucky pattern, and in any case the fix for it is going to be to make autovacuum more aggressive. > - Suppose the table has a million rows and we're going to read 100 of > them, or 0.01%. Now it might appear that a covering index has a > negligible advantage over a non-covering index, but in fact I think we > still want to err on the side of trying to use the covering index. Given that fact pattern we still will, I think. We'll still prefer an indexscan over a seqscan, for sure. In any case, if you believe the assumption that those 100 rows are more likely to be recently-dirtied than the average row, I'm not sure why you think we should be trying to force an assumption that index-only will succeed here. regards, tom lane
On Wed, Oct 12, 2011 at 10:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> - Suppose the table has a million rows and we're going to read 100 of >> them, or 0.01%. Now it might appear that a covering index has a >> negligible advantage over a non-covering index, but in fact I think we >> still want to err on the side of trying to use the covering index. > > Given that fact pattern we still will, I think. We'll still prefer an > indexscan over a seqscan, for sure. In any case, if you believe the > assumption that those 100 rows are more likely to be recently-dirtied > than the average row, I'm not sure why you think we should be trying to > force an assumption that index-only will succeed here. The elephant in the room is that the index-only-scan really doesn't save a *whole* lot if the heap pages are already in shared buffers. But it matters a *lot* when they heap pages are not in shared buffers (both ways, saving IO, or causing lots of random IO) Can we hope that if pages are not in shared buffers, they are not recently modified, so hopefully both all visible, and have the VM bit?set? Or does the table-based nature of vacuum mean there is no value there? -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
On Wed, Oct 12, 2011 at 3:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction >>> of the table expected to be scanned gets smaller. > >> I think there's a statistically more rigorous way of accomplishing the >> same thing. If you treat the pages we estimate we're going to read as >> a random sample of the population of pages then your expected value is >> the fraction of the overall population that is all-visible but your >> 95th percentile confidence interval will be, uh, a simple formula we >> can compute but I don't recall off-hand. > > The problem is precisely that the pages a query is going to read are > likely to *not* be a random sample, but to be correlated with > recently-dirtied pages. Sure, but I was suggesting aiming for the nth percentile rather than a linear factor which I don't know has any concrete meaning. -- greg
Greg Stark <stark@mit.edu> writes: > On Wed, Oct 12, 2011 at 3:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The problem is precisely that the pages a query is going to read are >> likely to *not* be a random sample, but to be correlated with >> recently-dirtied pages. > Sure, but I was suggesting aiming for the nth percentile rather than a > linear factor which I don't know has any concrete meaning. Well, I have no problem with using a more complicated estimation equation, but it might be nice to get some field experience with the thing before we start complicating matters. regards, tom lane
Aidan Van Dyk <aidan@highrise.ca> wrote: > The elephant in the room is that the index-only-scan really > doesn't save a *whole* lot if the heap pages are already in shared > buffers. It's not hard to create a simple test case where it's about three times slower to go to cached heap pages than to use the values from the index. That was just my first try, so it's not likely to be a real "worst case", although was using the default shared_memory size, so a lot of the heap pages probably came from the OS cache, rather than being in shared memory. > But it matters a *lot* when they heap pages are not in shared > buffers Yeah, obviously it matters more if you actually need to add a random disk read. -Kevin
On Wed, Oct 12, 2011 at 4:26 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> But it matters a *lot* when they heap pages are not in shared >> buffers > > Yeah, obviously it matters more if you actually need to add a random > disk read. To be fair the indexes are also random I/O. So the case that really matters is when the index fits in RAM but the heap does not. -- greg
Aidan Van Dyk <aidan@highrise.ca> writes: > The elephant in the room is that the index-only-scan really doesn't > save a *whole* lot if the heap pages are already in shared buffers. > But it matters a *lot* when they heap pages are not in shared buffers > (both ways, saving IO, or causing lots of random IO) > Can we hope that if pages are not in shared buffers, they are not > recently modified, so hopefully both all visible, and have the VM > bit?set? Or does the table-based nature of vacuum mean there is no > value there? Hmm, that's an interesting point. If you suppose that recently-modified pages are likely to still be in memory then it could well be that an index-only scan is relatively cheap (i.e., not many actual disk reads) no matter whether it hits recently-modified pages or not. So maybe the first cut should just be to measure the overall visibility fraction and use that at face value. regards, tom lane
On Wed, Oct 12, 2011 at 10:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Oct 12, 2011 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> What bothers me considerably more is the issue about how specific >>> queries might see an all-visible fraction that's very substantially >>> different from the table's overall ratio, > >> - Suppose VACUUM processes the table and makes it all-visible. Then, >> somebody comes along and updates one tuple on every page, making them >> all not-all-visible, but not trigger VACUUM because we're nowhere >> close the 20% threshold. Now COUNT(*) will think it should use an >> index-scan, but really... not so much. In fact, even if it's only >> that a tuple has been updated on 25% of the pages, we're probably in >> trouble. > > Yeah, but that would be a pretty unlucky pattern, and in any case the > fix for it is going to be to make autovacuum more aggressive. Hmm, maybe. >> - Suppose the table has a million rows and we're going to read 100 of >> them, or 0.01%. Now it might appear that a covering index has a >> negligible advantage over a non-covering index, but in fact I think we >> still want to err on the side of trying to use the covering index. > > Given that fact pattern we still will, I think. We'll still prefer an > indexscan over a seqscan, for sure. In any case, if you believe the > assumption that those 100 rows are more likely to be recently-dirtied > than the average row, I'm not sure why you think we should be trying to > force an assumption that index-only will succeed here. I'm not concerned about an index scan vs. a sequential scan here. I'm concerned about it being impossible for the DBA to get an index-only scan when s/he wants it very badly. The current (stupid) formula handles this case just about perfectly - it will prefer a smaller index over a larger one, except when a covering index is available, in which case it will prefer the smallest covering index. That sounds exactly right to me. We get that behavior because the 10% of heap fetches that we're assuming we'll get to skip is larger than the penalty for using a bigger index. If we take out 10% and replace it by all_visible_percentage * fraction_of_tuples_fetched, then that 10% is going to drop to some infinitesmally small value on single row fetches from giant tables. But that's exactly one of the cases for which people want index-only scans in the first place. It's no better to be overly pessimistic here than it is to be overly optimistic. If the table is 90% all-visible, the probability of our finding an all-visible row is probably not 90%. But it's probably not 0.01% or 0.0001% either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I'm not concerned about an index scan vs. a sequential scan here. I'm > concerned about it being impossible for the DBA to get an index-only > scan when s/he wants it very badly. The current (stupid) formula > handles this case just about perfectly - it will prefer a smaller > index over a larger one, except when a covering index is available, in > which case it will prefer the smallest covering index. That sounds > exactly right to me. We get that behavior because the 10% of heap > fetches that we're assuming we'll get to skip is larger than the > penalty for using a bigger index. If we take out 10% and replace it > by all_visible_percentage * fraction_of_tuples_fetched, then that 10% > is going to drop to some infinitesmally small value on single row > fetches from giant tables. But that's exactly one of the cases for > which people want index-only scans in the first place. It's no better > to be overly pessimistic here than it is to be overly optimistic. If > the table is 90% all-visible, the probability of our finding an > all-visible row is probably not 90%. But it's probably not 0.01% or > 0.0001% either. I think you're overstating the size of the problem. Given that fact pattern, the thing will choose an indexscan anyway, because it's the cheapest alternative even under traditional costing rules. And it will choose to use an index-only scan if the index is covering. It doesn't matter what the exact cost estimate is. The place where the decision is actually somewhat hard, IMO, is where you're pulling a small part of the table but significantly more than one row, and the traditional best choice would be a bitmap scan. Now we have to guess whether possibly avoiding heap fetches is better than batching the fetches, and it doesn't seem open-and-shut to me. But having said that, I see some credibility in Aidan's suggestion that pages that actually have to be fetched from disk are disproportionately likely to be all-visible. That would counteract the history-table effect of correlation between current reads and recent changes, probably not perfectly, but to a considerable extent. So right at the moment I'm inclined to just apply the most-recently-measured visibility fraction at face value. We shouldn't complicate matters more than that until we have more field experience to tell us what really happens. regards, tom lane
On Wed, Oct 12, 2011 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I'm not concerned about an index scan vs. a sequential scan here. I'm >> concerned about it being impossible for the DBA to get an index-only >> scan when s/he wants it very badly. The current (stupid) formula >> handles this case just about perfectly - it will prefer a smaller >> index over a larger one, except when a covering index is available, in >> which case it will prefer the smallest covering index. That sounds >> exactly right to me. We get that behavior because the 10% of heap >> fetches that we're assuming we'll get to skip is larger than the >> penalty for using a bigger index. If we take out 10% and replace it >> by all_visible_percentage * fraction_of_tuples_fetched, then that 10% >> is going to drop to some infinitesmally small value on single row >> fetches from giant tables. But that's exactly one of the cases for >> which people want index-only scans in the first place. It's no better >> to be overly pessimistic here than it is to be overly optimistic. If >> the table is 90% all-visible, the probability of our finding an >> all-visible row is probably not 90%. But it's probably not 0.01% or >> 0.0001% either. > > I think you're overstating the size of the problem. Given that fact > pattern, the thing will choose an indexscan anyway, because it's the > cheapest alternative even under traditional costing rules. And it will > choose to use an index-only scan if the index is covering. It doesn't > matter what the exact cost estimate is. Maybe I'm not being clear. The case I'm worried about is when you have a table with an id column (an integer) and a name column (text). You have a unique index on (id), and an index on (id, name). You then do SELECT name FROM foo WHERE id = ?. I understand it's going to pick *an* index-scan, but which index is it going to pick? The unique index, because it's smaller, or the other one, because it's covering? I think if the table is large your proposal will lead to ignoring the covering index in favor of the smaller one, and I submit that's not what we want, because, on the average, the index-only approach is going to succeed often enough to > The place where the decision is actually somewhat hard, IMO, is where > you're pulling a small part of the table but significantly more than one > row, and the traditional best choice would be a bitmap scan. Now we > have to guess whether possibly avoiding heap fetches is better than > batching the fetches, and it doesn't seem open-and-shut to me. Yes, I agree. I was actually wondering if there is some way we could make index-only scans work for bitmap index scans. Something like this: If the index is covering, then as we retrieve each tuple, we check whether the page is all-visible. If so, we return the data from the index tuple. If not, we save the TID for later. Then, when we get done scanning the index, we go back and fetch all the pages containing saved TIDs in ascending block number order. The trouble is that I think you could get in some trouble if you use a TID bitmap here, because if you lossify the bitmap then you have to make sure you can't return a tuple that you already handled with the index-only approach (imagine that the visibility map bit gets cleared partway through the scan). All in all this seems pretty complicated... > But having said that, I see some credibility in Aidan's suggestion that > pages that actually have to be fetched from disk are disproportionately > likely to be all-visible. That would counteract the history-table > effect of correlation between current reads and recent changes, > probably not perfectly, but to a considerable extent. So right at the > moment I'm inclined to just apply the most-recently-measured visibility > fraction at face value. We shouldn't complicate matters more than that > until we have more field experience to tell us what really happens. Fetches from disk aren't the only problem; thrashing shared_buffers is pretty expensive, too. But it's an interesting point. I guess we could give it a try and see what happens. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 12, 2011 at 03:16:54PM +0100, Greg Stark wrote: > On Wed, Oct 12, 2011 at 2:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I think it's overkill, and possibly unpleasantly unstable as well. > > For the initial attack on this we should just have VACUUM and ANALYZE > > count the number of all-visible blocks and store that in pg_class along > > with the tuple-count statistics. There's no reason to think that this > > will be any worse than the way we deal with dead tuple counts, for > > instance. > > So I have a theory. > > Assuming you're in a steady-state situation the amount of all-visible > blocks will fluctuate from a high just after vacuum to a low just > before the next vacuum. There are other ways a block can be marked > all-visible but for the most part I would expect the fraction to go > steadily down until vacuum comes along and cleans things up. > > So if vacuum tracked the fraction of blocks marked all-visible > *before* it processed them and the fraction it marked all-visible > after processing we have an upper and lower bound. If we knew how long > it's been since vacuum we could interpolate between those, or we could > just take the mean, or we could take the lower bound as a conservative > estimate. > > > What I suggest as a first cut for that is: simply derate the visibility fraction as the fraction > > of the table expected to be scanned gets smaller. > > I think there's a statistically more rigorous way of accomplishing the > same thing. If you treat the pages we estimate we're going to read as > a random sample of the population of pages then your expected value is > the fraction of the overall population that is all-visible but your > 95th percentile confidence interval will be, uh, a simple formula we > can compute but I don't recall off-hand. Incidentally, I had a similar idea at PGCon relating to planning... My idea was to compute not just the cost but the sensitivity of the cost an estimate for each plan. The sensitivity is the derivate of the cost. So, if the total cost was n^2 the sensitivity would be 2n. If you picked a tolerance (like 2 standard deviations) of the observed distribution. You could compare the expected cost and the expected 'unlucky cost' for plans. (Basically, this is parametric error propagation) I know very little about the planner... I don't know how how often it would lead to picking a better plan (it might not be worth the cost to compute), but it seemed like an interesting approach to me. Garick
On Mon, Oct 10, 2011 at 9:48 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> Jeff Janes wrote: >> Kevin Grittner wrote: > >>> create table t (id int not null primary key); >>> insert into t select generate_series(1, 1000000); >>> vacuum freeze analyze; >>> explain analyze select count(*) from t >>> where id between 500000 and 500010; >>> >>> That gives you an index-only scan; but without the WHERE clause it >>> uses a seq scan. >> >> If you convert the where clause to "where id is not null" it uses >> the index only scan again, but only if you nudge it too with >> enable_seqscan=off. With a recent commit from (I assume) Tom, the "where id is not null" is no longer needed. > Clever way to get a full-table test. > > It turns out that for the above, with your trick to use the index > only scan, it comes out 12% faster to do a seqscan, even when the > table and index are fully cached (based on the average time of ten > runs each way). There's very little overlap, so the difference looks > real. But that's on a very narrow record, having just the one column > used in the index. I added one wide column like this: > > alter table t add column x text; > update t set x = (repeat(random()::text, (random() * 100)::int)); > cluster t USING t_pkey; > vacuum freeze analyze; > > With that change the index-only scan time remained unchanged, while > the seqscan time grew to about 2.6 times the index only scan time. > That was mildly surprising for me, considering it was all still > cached. I used the pgbench_accounts table from pgbench -i -s 50, where all data fits in shared_buffers, using the -f switch with either set enable_seqscan=off; select count(*) from pgbench_accounts; or set enable_indexonlyscan=off; select count(*) from pgbench_accounts; With just a single client, it was a toss-up. But with 8 concurrent clients on a 8 CPU machine, the index-only scan was 50% faster. So that is a nice win, even if well-designed apps probably shouldn't be endlessly counting rows of an unchanging table using all available CPUs in the first place. Cheers, Jeff
On 12 October 2011 17:26, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Oct 12, 2011 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The place where the decision is actually somewhat hard, IMO, is where >> you're pulling a small part of the table but significantly more than one >> row, and the traditional best choice would be a bitmap scan. Now we >> have to guess whether possibly avoiding heap fetches is better than >> batching the fetches, and it doesn't seem open-and-shut to me. > > Yes, I agree. > > I was actually wondering if there is some way we could make index-only > scans work for bitmap index scans. Something like this: If the index > is covering, then as we retrieve each tuple, we check whether the page > is all-visible. If so, we return the data from the index tuple. If > not, we save the TID for later. Then, when we get done scanning the > index, we go back and fetch all the pages containing saved TIDs in > ascending block number order. The trouble is that I think you could > get in some trouble if you use a TID bitmap here, because if you > lossify the bitmap then you have to make sure you can't return a tuple > that you already handled with the index-only approach (imagine that > the visibility map bit gets cleared partway through the scan). All in > all this seems pretty complicated... So is there a chance of getting bitmap index-only scans? -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thom Brown <thom@linux.com> writes: > So is there a chance of getting bitmap index-only scans? Don't hold your breath. It seems like a huge increment of complexity for probably very marginal gains. The point of a bitmap scan (as opposed to regular indexscan) is to reduce heap accesses by combining visits to the same page; but it's not clear how useful that is if you're not making heap accesses at all. Robert's sketch of how this could work, full of don't-know-how-to-do-this as it was, still managed to omit a whole lot of reasons why it wouldn't work. Notably the fact that the index AM API for bitmap scans is to return a bitmap, not index-tuple data; and trying to do the latter would break a lot of the performance advantages that exist now for bitmap scans. regards, tom lane
On 19 November 2011 16:08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thom Brown <thom@linux.com> writes: >> So is there a chance of getting bitmap index-only scans? > > Don't hold your breath. It seems like a huge increment of complexity > for probably very marginal gains. The point of a bitmap scan (as > opposed to regular indexscan) is to reduce heap accesses by combining > visits to the same page; but it's not clear how useful that is if > you're not making heap accesses at all. Well consider: pgbench=# explain analyse select count(*) from pgbench_accounts where aid between 143243 and 374825 or aid between 523242 and 712111; QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=83016.38..83016.39 rows=1 width=0) (actual time=1039.282..1039.282 rows=1 loops=1) -> Bitmap Heap Scan on pgbench_accounts (cost=7934.54..81984.58 rows=412719 width=0) (actual time=243.012..977.946 rows=420453 loops=1) Recheck Cond: (((aid >= 143243) AND (aid <= 374825)) OR ((aid >= 523242) AND (aid <= 712111))) -> BitmapOr (cost=7934.54..7934.54 rows=423802 width=0) (actual time=228.934..228.934 rows=0 loops=1) -> Bitmap Index Scan on pgbench_accounts_pkey (cost=0.00..4299.40 rows=235782 width=0) (actual time=134.410..134.410 rows=231583 loops=1) Index Cond: ((aid >= 143243) AND (aid <= 374825)) -> Bitmap Index Scanon pgbench_accounts_pkey (cost=0.00..3428.78 rows=188020 width=0) (actual time=94.520..94.520 rows=188870 loops=1) Index Cond: ((aid >= 523242) AND (aid <= 712111))Total runtime: 1039.598 ms (9 rows) Since I can't get this to use an index-only scan, it will always visit the heap. Instead I'd be forced to write: pgbench=# explain analyse select count(*) from (select aid from pgbench_accounts where aid between 143243 and 374825 union all select aid from pgbench_accounts where aid between 523242 and 712111) x; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=17263.72..17263.73 rows=1 width=0) (actual time=232.053..232.053 rows=1 loops=1) -> Append (cost=0.00..16204.22 rows=423802 width=0) (actual time=10.925..195.134 rows=420453 loops=1) -> Subquery Scan on "*SELECT* 1" (cost=0.00..9015.04 rows=235782 width=0) (actual time=10.925..90.116 rows=231583 loops=1) -> Index Only Scan using pgbench_accounts_pkeyon pgbench_accounts (cost=0.00..6657.22 rows=235782 width=4) (actual time=10.924..61.822 rows=231583 loops=1) Index Cond: ((aid >= 143243) AND (aid <= 374825)) -> Subquery Scan on "*SELECT* 2" (cost=0.00..7189.18 rows=188020 width=0) (actual time=0.062..62.953 rows=188870 loops=1) -> Index Only Scan using pgbench_accounts_pkeyon pgbench_accounts (cost=0.00..5308.98 rows=188020 width=4) (actual time=0.061..40.343 rows=188870 loops=1) Index Cond: ((aid >= 523242) AND (aid <= 712111))Total runtime:232.291 ms (9 rows) These 2 queries are equal only because the ranges being checked don't overlap, so if arbitrary values were being substituted, and the ranges did happen to overlap, that last method couldn't work. I'd have to use a UNION ALL in that particular case, which adds a lot of overhead due to de-duplication. While I accept that maybe adapting the existing bitmap index scan functionality isn't necessarily desirable, would it be feasible to create a corresponding bitmap index-only scan method. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Nov 19, 2011 at 11:22 AM, Thom Brown <thom@linux.com> wrote: > While I accept that maybe adapting the existing bitmap index scan > functionality isn't necessarily desirable, would it be feasible to > create a corresponding bitmap index-only scan method. I've been thinking about this a bit more; I wonder whether we could create yet another type of index scan - let's call it a Streaming Index Only Scan. A streaming index only scan uses a buffer, which holds index tuples and the corresponding CTIDs, and it has some maximum size, probably based on work_mem. Tuples in the buffer are kept sorted by CTID. The scan happens in two alternating phases: buffer fill, and buffer drain. In buffer fill mode, we scan the index and add matching tuples and their CTIDs to the buffer. When the buffer is full or the index AM reports that there are no more tuples in the scan, we switch to buffer drain mode. In buffer drain mode, we repeatedly select a heap block number and attempt to return all buffered tuples on that page. We maintain a counter, LastBlockNumber, initially zero, which tracks the last heap block number so selected. To select the next block number, we choose the first block number greater than or equal to LastBlockNumber referenced by any CTID in the buffer (it should be possible to design the buffer so that this value can be computed quickly); if there are none, we instead choose the first block number referenced by any CTID in the buffer, period. Having selected the block number, we check whether the page is all-visible. If so, we can return all the index tuples from that page without further ado. Otherwise, we fetch the heap block, check visibility for each tuple, and return only those index tuples for which the corresponding heap tuples are visible to the scan. If there's now enough buffer space available to be certain that the next index tuple will fit, we switch back to buffer fill mode; otherwise, we remain in buffer drain mode. As compared with a bitmap index scan, this doesn't have the advantage of being able to combine multiple indexes effectively; I don't really see any way to make that work with the index-only scan concept in general, except perhaps for the special case of a zero-argument aggregate. It also doesn't have the advantage that each heap page will be guaranteed to be visited only once. But in practice duplicate visits to the same page should be uncommon; they'll be avoided when either work_mem is sufficient to buffer the whole scan, or when there's some degree of table clustering with respect to the index. While I'm building castles in the sky, a further idea would be to try to optimize the case where there's a LIMIT node above the scan. If we could pass down a hint indicating how many rows are thought likely to be needed, we could enter buffer drain mode after about that many tuples, instead of waiting until the buffer was full. If the hint is right, we win (and if it's wrong, we can still go back and fetch some more tuples, at a cost of possible performance loss). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Nov 21, 2011 at 6:43 PM, Robert Haas <robertmhaas@gmail.com> wrote: > In buffer fill mode, we scan the index and add matching tuples and > their CTIDs to the buffer. When the buffer is full or the index AM > reports that there are no more tuples in the scan, we switch to buffer > drain mode. Instead you could do the two phases concurrently the way tuplesort implements the tapesort from Knuth. You keep a heap of ctids with an epoch. You fill the heap then you return the first one. Whenever you return one you read the next one and add it to the heap. If it falls before the last returned value you insert it with the next epoch but if it falls afterwards you can insert it into the heap in its correct position. -- greg
On Wed, Dec 14, 2011 at 6:58 AM, Greg Stark <stark@mit.edu> wrote: > On Mon, Nov 21, 2011 at 6:43 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> In buffer fill mode, we scan the index and add matching tuples and >> their CTIDs to the buffer. When the buffer is full or the index AM >> reports that there are no more tuples in the scan, we switch to buffer >> drain mode. > > Instead you could do the two phases concurrently the way tuplesort > implements the tapesort from Knuth. You keep a heap of ctids with an > epoch. You fill the heap then you return the first one. Whenever you > return one you read the next one and add it to the heap. If it falls > before the last returned value you insert it with the next epoch but > if it falls afterwards you can insert it into the heap in its correct > position. Yeah, that's pretty much what I was imagining, although my explanation of it was slightly muddled. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company