Thread: Improving count(*)

Improving count(*)

From
Simon Riggs
Date:
One of the major complaints is always "Select count(*) is slow".

I have a somewhat broadbrush idea as to how we might do this (for larger
tables). 

Previously, we've said that maintaining a running count of the tuples in
a table is hard, or at least costly. However, maintaining a partial
count may be somewhat easier and offer the performance required. 

Bearing in mind other RDBMS' approach is to count the number of rows in
an index, their cost is probably about the same as scanning table
blocks/10 very roughly - so the cost is far from zero for them. In order
to get similar performance we would need a technique that involved
scanning a small percentage of the data blocks in a table, typically
less than 10%.

Prelims: As tuples are inserted they go into blocks in a varied
sequence. However, since only VACUUM ever reduces the size of the data
in a block, we notice that blocks eventually fill up. If they are on the
FSM, they are removed. At that point, we could begin keeping track of
the number of live tuples in the block.

So we could imagine an on-disk data structure that is an array of this
struct: blockid - the block in questioncount - the number of live tuples in that blocktxnid - the txnid of the last
tuplewritten
 

When we run a SELECT COUNT(*) we would read this data structure and use
it to build a bitmap of blocks that still need to be scanned to get the
count(*) result. So this would give us a partially cached result for
count(*), which then would require scanning only the last few blocks of
a table to get at the correct answer.

This is beginning to sound very much like the same sort of solution as
the one recently proposed for VACUUM: we keep track of the status of
each block, then use it as a way of speeding things up.

For VACUUM we want to keep a data structure that notes which blocks
require vacuums. For count(*) we want to keep track of which blocks do
not require vacuums, nearly. There are some blocks in the middle that
wouldn't go on either list, true.

So, why not combine the two purposes into a single solution?

We would be able to manage at least 800 data blocks for each block of
this structure, which would mean 1.25 MB per GB of data. For tables with
a reasonably non-random block write pattern the key parts of that could
reside in memory with reasonable ease even for busy tables. In those
cases, it also seems like this technique might lead to the need to scan
only a small percentage of heap blocks - so this would give roughly
equivalent performance to other RDBMS for SELECT count(*). If its a data
structure that we were thinking of maintaining for VACUUM anyway, this
improves the value of the cost we would have to pay to maintain the a
cache data structure.

This is thinking only, I'm not proposing this as a fully worked proposal
nor am I pursuing this myself. I can see immediately that there are some
major obstacles in the way, like MVCC, autovacuum, overheads etc but it
seems worth pointing out a possible angle of attack, since this looks
like it might hit two birds with one stone.

I'm not aiming to start "open season" on crazy ideas for this; this idea
may take months to ruminate into a solution.

Best Regards, Simon Riggs




Re: Improving count(*)

From
Martijn van Oosterhout
Date:
On Thu, Nov 17, 2005 at 07:28:10PM +0000, Simon Riggs wrote:
> One of the major complaints is always "Select count(*) is slow".
>
> I have a somewhat broadbrush idea as to how we might do this (for larger
> tables).

It's an interesting idea, but you still run into the issue of
visibility. If two people start a transaction, one of them inserts a
row and then both run a select count(*), they should get different
answers. I just don't see a way that your suggestion could possibly
lead to that result...

There is no unique answer to count(*), it all depends on who is looking
(sounds like relativity :) ). If you can sort that, you're well over
90% of the way.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Improving count(*)

From
"Jonah H. Harris"
Date:
Simon,
 
Nice suggestion, I think it's workable but (like all other methods) has some technical/pseudo-political challenges.
 
I'm still voting for my old, "Much Ado About COUNT(*)" topic; adding visibiility to the indexes and counting them like the other RDBMS vendors.  True, it would add storage overhead that several people don't want, but such as the life of the COUNT(*) discussion for PostgreSQL...
 
-Jonah

 
On 11/17/05, Martijn van Oosterhout <kleptog@svana.org> wrote:
On Thu, Nov 17, 2005 at 07:28:10PM +0000, Simon Riggs wrote:
> One of the major complaints is always "Select count(*) is slow".
>
> I have a somewhat broadbrush idea as to how we might do this (for larger
> tables).

It's an interesting idea, but you still run into the issue of
visibility. If two people start a transaction, one of them inserts a
row and then both run a select count(*), they should get different
answers. I just don't see a way that your suggestion could possibly
lead to that result...

There is no unique answer to count(*), it all depends on who is looking
(sounds like relativity :) ). If you can sort that, you're well over
90% of the way.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.



Re: Improving count(*)

From
Rod Taylor
Date:
On Thu, 2005-11-17 at 20:38 +0100, Martijn van Oosterhout wrote:
> On Thu, Nov 17, 2005 at 07:28:10PM +0000, Simon Riggs wrote:
> > One of the major complaints is always "Select count(*) is slow".
> > 
> > I have a somewhat broadbrush idea as to how we might do this (for larger
> > tables). 
> 
> It's an interesting idea, but you still run into the issue of
> visibility. If two people start a transaction, one of them inserts a
> row and then both run a select count(*), they should get different
> answers. I just don't see a way that your suggestion could possibly
> lead to that result...

The instant someone touches a block it would no longer be marked as
frozen (vacuum or analyze or other is not required) and count(*) would
visit the tuples in the block making the correct decision at that time.

-- 



Re: Improving count(*)

From
Martijn van Oosterhout
Date:
On Thu, Nov 17, 2005 at 02:55:09PM -0500, Rod Taylor wrote:
> On Thu, 2005-11-17 at 20:38 +0100, Martijn van Oosterhout wrote:
> > It's an interesting idea, but you still run into the issue of
> > visibility. If two people start a transaction, one of them inserts a
> > row and then both run a select count(*), they should get different
> > answers. I just don't see a way that your suggestion could possibly
> > lead to that result...
>
> The instant someone touches a block it would no longer be marked as
> frozen (vacuum or analyze or other is not required) and count(*) would
> visit the tuples in the block making the correct decision at that time.

Hmm, so the idea would be that if a block no longer contained any
tuples hidden from any active transaction, you could store the count
and skip reading that page. Ofcourse, as soon as someone UPDATEs a
tuple, that block comes into play again because it would be visible
from some but not other transactions. Then again, for count(*) UPDATEs
are irrelevent.

The other way, storing visibility in the index seems awfully expensive,
since any changes to the tuple would require updating the index. Still,
people have thought about this already, I'm sure the issues are
known...

Have a niceday,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Improving count(*)

From
Simon Riggs
Date:
On Thu, 2005-11-17 at 14:46 -0500, Jonah H. Harris wrote:
> Nice suggestion, I think it's workable but (like all other methods)
> has some technical/pseudo-political challenges.
>  
> I'm still voting for my old, "Much Ado About COUNT(*)" topic; adding
> visibiility to the indexes and counting them like the other RDBMS
> vendors.  True, it would add storage overhead that several people
> don't want, but such as the life of the COUNT(*) discussion for
> PostgreSQL...

[When no idea is good, we take the least-worst path. When we only have
one bad idea, nobody does anything. So we need a couple of ugly but
workable alternatives to flush out the one we will pick to resolve
things.]

As Martijn points out *any* solution must take account of visibility
rules. So abstracting from both these ideas gives shape to the solution,
which must be:
- a data structure smaller than the table itself
- including visibility data, explicitly/implicitly, exactly/lossily
- must serve multiple purposes to ensure the overhead of maintaining the
structure is amortised across many potential benefits, since various
needs share similar solution requirements

Would having visibility on an index be of use to a VACUUM?
Yes, I guess it could be. If we knew when the table was last vacuumed,
we could build a bitmap of changed blocks by scanning the index. We
would only need visibility on *one* of the indexes on a table, so
perhaps index visibility could be an option rather than a
one-size-fits-all. 

Adding visibility to an index would add substantial bulk to any index.
If we could do this at the same time as adding leading key, full field
compression (*not* prefix compression), then it might be worth doing. 

I would also note that DELETE would need to touch all visible index
rows, which currently is not required for btree indexes. (But as we
already noted, any solution must include visibility data and so any
solution must update some data structure on delete).

Index-only plans could help with various GROUP BY and join queries also,
so it certainly is attractive, though costly.

Best Regards, Simon Riggs



Re: Improving count(*)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Bearing in mind other RDBMS' approach is to count the number of rows in
> an index, their cost is probably about the same as scanning table
> blocks/10 very roughly - so the cost is far from zero for them.

Really?  The impression I get is that people who ask for this expect the
answer to be instantaneous, ie they think the system will maintain a
running net total for each table.  (In a non-MVCC system that isn't
necessarily an unreasonable thing to do.)

I really can't get excited about adding this level of complexity and
overhead to the system just to support COUNT(*)-with-no-WHERE slightly
better than we do now.

The triggers-and-deltas approach previously proposed seems considerably
more attractive to me, because (1) it's not invasive and (2) you only
have to pay the overhead on tables where you want it.
        regards, tom lane


Re: Improving count(*)

From
Simon Riggs
Date:
On Thu, 2005-11-17 at 16:34 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Bearing in mind other RDBMS' approach is to count the number of rows in
> > an index, their cost is probably about the same as scanning table
> > blocks/10 very roughly - so the cost is far from zero for them.
> 
> Really?  The impression I get is that people who ask for this expect the
> answer to be instantaneous, ie they think the system will maintain a
> running net total for each table.  (In a non-MVCC system that isn't
> necessarily an unreasonable thing to do.)

Yeh. I think Informix keeps a running total, IIRC, but certainly Oracle
and various others do not.

People probably have given the impression that count(*) is
instantaneous, but that doesn't mean it actually is - they're just
talking up the problems of pg.

> I really can't get excited about adding this level of complexity and
> overhead to the system just to support COUNT(*)-with-no-WHERE slightly
> better than we do now.

Well, I was pointing out the cross-over with the requirements for a
faster VACUUM also. Taken together, it might be a winner.

> The triggers-and-deltas approach previously proposed seems considerably
> more attractive to me, because (1) it's not invasive and (2) you only
> have to pay the overhead on tables where you want it.

This would need to either be optional whichever way we did it, just as
with the creation of an index. I also think that taking the Oracle path
of adding new features in functions/packages would be a good thing,
rather than over-burdening the parser constantly with new syntax to cope
with.

Did the triggers-and-deltas approach cope with MVCC correctly?

Best Regards, Simon Riggs




Re: Improving count(*)

From
Martijn van Oosterhout
Date:
On Thu, Nov 17, 2005 at 09:34:08PM +0000, Simon Riggs wrote:
> Adding visibility to an index would add substantial bulk to any index.
> If we could do this at the same time as adding leading key, full field
> compression (*not* prefix compression), then it might be worth doing.

I think the single biggest problem with visibility-in-index is that
there is no link from the tuple to the index. So if you update a tuple,
the only way to update the index is the start from the top and go down
until you find it. If your table/index is of any size, you can imagine
the overhead will kill you.

Now, lets say you add a field to the tuple which you the position of
the index entry. You can only reasonably do this for one index, say the
primary key. Now you have a two-way link the updating becomes much
quicker, at the cost of even more overhead.

Doing it only for one index per table may be sensible anyway since you
don't really want to store visibility any more times than necessary.

> I would also note that DELETE would need to touch all visible index
> rows, which currently is not required for btree indexes. (But as we
> already noted, any solution must include visibility data and so any
> solution must update some data structure on delete).

Remember, UPDATE = DELETE + UPDATE, so you have to handle all updates
too. Inserts are the only easy case (well, except the fact that they
have to point to eachother. locking nastyness).

> Index-only plans could help with various GROUP BY and join queries also,
> so it certainly is attractive, though costly.

Only in cases where you don't need the data (ie EXISTS), otherwise you
still need the tuple.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Improving count(*)

From
"Kevin Grittner"
Date:
In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL
Server) the leaf level of the narrowest index on the table is scanned,
following a linked list of leaf pages.  Leaf pages can be pretty dense
under Sybase, because they do use prefix compression.  A count(*)
on a table with 100 million rows is going to take a few minutes, but it
is going to be at least an order of magnitude faster than a data page
scan -- maybe two orders of magnitude faster.

What I don't understand is why people need to do such things so
frequently that it's a major issue, rather than an occassional
annoyance.  A solution which not only helped the count(*) issue
but also allowed index scans to skip the trip to the data page to
see if it's an active version seems like it would boost performance
overall.  As pointed out elsewhere, it could also allow new
techniques for vacuum which could be beneficial.

My view is that when tables get so big that a count(*) takes that
much time, you don't typiclally need an EXACT count anyway --
you could normally check the statistics from your nightly analyze.

-Kevin


>>> Tom Lane <tgl@sss.pgh.pa.us>  >>>
Simon Riggs <simon@2ndquadrant.com> writes:
> Bearing in mind other RDBMS' approach is to count the number of rows
in
> an index, their cost is probably about the same as scanning table
> blocks/10 very roughly - so the cost is far from zero for them.

Really?  The impression I get is that people who ask for this expect the
answer to be instantaneous, ie they think the system will maintain a
running net total for each table.  (In a non-MVCC system that isn't
necessarily an unreasonable thing to do.)

I really can't get excited about adding this level of complexity and
overhead to the system just to support COUNT(*)-with-no-WHERE slightly
better than we do now.

The triggers-and-deltas approach previously proposed seems considerably
more attractive to me, because (1) it's not invasive and (2) you only
have to pay the overhead on tables where you want it.
        regards, tom lane



Re: Improving count(*)

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Now, lets say you add a field to the tuple which you the position of
> the index entry. You can only reasonably do this for one index, say the
> primary key. Now you have a two-way link the updating becomes much
> quicker, at the cost of even more overhead.

I think this is fairly infeasible --- consider what it does to the cost
and (lack of) atomicity of an index page split, for instance.
        regards, tom lane


Re: Improving count(*)

From
Simon Riggs
Date:
On Thu, 2005-11-17 at 16:30 -0600, Kevin Grittner wrote:
> In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL
> Server) the leaf level of the narrowest index on the table is scanned,
> following a linked list of leaf pages.  Leaf pages can be pretty dense
> under Sybase, because they do use prefix compression.  A count(*)
> on a table with 100 million rows is going to take a few minutes, but it
> is going to be at least an order of magnitude faster than a data page
> scan -- maybe two orders of magnitude faster.
> 
> What I don't understand is why people need to do such things so
> frequently that it's a major issue, rather than an occassional
> annoyance.  

Agreed, completely. (And it galls me to agree with multiple, potentially
opposed opinions on my own thread).

The trouble is, people moan and constantly. Perhaps we should stick to
our guns and say, why do you care? From here, I think we should say,
"show me an application package that needs this so badly we'll change
PostgreSQL just for them". Prove it and we'll do it. Kinda polite in the
TODO, but I think we should put something in there that says "things we
haven't yet had any good reason to improve".

> A solution which not only helped the count(*) issue
> but also allowed index scans to skip the trip to the data page to
> see if it's an active version seems like it would boost performance
> overall.  As pointed out elsewhere, it could also allow new
> techniques for vacuum which could be beneficial.
> 
> My view is that when tables get so big that a count(*) takes that
> much time, you don't typiclally need an EXACT count anyway --
> you could normally check the statistics from your nightly analyze.

Amen.

>From here, another proposal. We have a GUC called count_uses_estimate
that is set to off by default. If set to true, then a count(*) will use
the planner logic to estimate number of rows in the table and return
that as the answer, rather than actually count the row. Unless analyze
statistics are not available, in which case it does the real count.

Best Regards, Simon Riggs



Re: Improving count(*)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> From here, another proposal. We have a GUC called count_uses_estimate
> that is set to off by default. If set to true, then a count(*) will use
> the planner logic to estimate number of rows in the table and return
> that as the answer, rather than actually count the row.

Ugh.  Why not just provide a function to retrieve the planner estimate,
but *not* call it count(*)?  It would fit nicely with the contrib/dbsize
stuff (or I should say, the stuff formerly in dbsize...)
        regards, tom lane


Re: Improving count(*)

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Thursday, November 17, 2005 4:17 PM
> To: Simon Riggs
> Cc: Kevin Grittner; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Improving count(*)
>
> Simon Riggs <simon@2ndquadrant.com> writes:
> > From here, another proposal. We have a GUC called
count_uses_estimate
> > that is set to off by default. If set to true, then a count(*) will
use
> > the planner logic to estimate number of rows in the table and return
> > that as the answer, rather than actually count the row.
>
> Ugh.  Why not just provide a function to retrieve the planner
estimate,
> but *not* call it count(*)?  It would fit nicely with the
contrib/dbsize
> stuff (or I should say, the stuff formerly in dbsize...)


An estimate of the number of rows would be nice to have.
A function called cardinality_estimate() or something of that nature
seems more natural than count(*)


Re: Improving count(*)

From
Gavin Sherry
Date:
On Fri, 18 Nov 2005, Simon Riggs wrote:

> >From here, another proposal. We have a GUC called count_uses_estimate
> that is set to off by default. If set to true, then a count(*) will use
> the planner logic to estimate number of rows in the table and return
> that as the answer, rather than actually count the row. Unless analyze
> statistics are not available, in which case it does the real count.

I'm finishing off a tablesample patch a grad student on #postgresql was
working on.

template1=# select count(*)*100 from a tablesample system(1) repeatable
(2);?column?
---------- 8371100
(1 row)

Time: 6366.757 ms
template1=# select count(*)*50 from a tablesample system(2) repeatable
(11);?column?
---------- 8453550
(1 row)

Time: 10521.871 ms
template1=# select count(*)*10 from a tablesample system(10) repeatable
(3);?column?
---------- 8314350
(1 row)

Time: 28744.498 ms
template1=# select count(*) from a; count
---------8388608
(1 row)

Time: 33897.857 ms


Seems like a better solution. I can finish the patch pretty soon. I need
to contact the original author, who has disappeared, but I'll send it over
to you.

Gavin


Re: Improving count(*)

From
mark@mark.mielke.cc
Date:
Probably obvious, and already mentioned, count(*) isn't the only query
that would benefit from visibility information in the index. It's
rather unfortunate that MVCC requires table lookups, when all values
queried or matched are found in the index key itself. The idea of an
all index table is appealing to me for some applications (Oracle
supports this, I believe?). In effect, a sorted, and searchable table,
that doesn't double in size, just because it is indexed.

So what's the real cost here? Larger index size to include the
visibility information (optional?) and UPDATE/DELETE need to
set expirations on the index rows as well as the table rows,
for only the indexes that have visibility information? A flag
in the table structure in memory to know whether the table has
any indexes with visibility information that require updating?

It doesn't sound that bad to me. Perhaps I just don't know better? :-)

The per-block counts idea, seem useful to me. A database that
frequently modifies every page of an index, would seem
inefficient. What if the per-block counts were kept, but associated
with index blocks instead of table blocks, for indexes that maintain
visibility information? The per-block counts only need to be able to
provide enough information for the reader to know whether the count is
valid, or invalid, perhaps updated at vacuum time?

The idea of a partial index, that keeps this information (visibility
information + per-block live row count cache) seems fascinating to
me. Many new optimization opportunities to hang myself with... :-)

Maybe PostgreSQL could be FASTER than other databases?

Or are we just dreaming?

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Improving count(*)

From
Jeff Davis
Date:
mark@mark.mielke.cc wrote:
> Probably obvious, and already mentioned, count(*) isn't the only query
> that would benefit from visibility information in the index. It's
> rather unfortunate that MVCC requires table lookups, when all values
> queried or matched are found in the index key itself. The idea of an
> all index table is appealing to me for some applications (Oracle
> supports this, I believe?). In effect, a sorted, and searchable table,
> that doesn't double in size, just because it is indexed.
> 

I've been thinking about that lately also. It seems like it would be
useful to have the entire table in a Btree in some situations, but there
are some drawbacks:
(1) probably hard to implement
(2) only works with one key
(3) since tuples would not be at a fixed location on disk, you can't
just use a noraml secondary index. The secondary index would have to
point to the key of the tuple in the Btree table, and then do another
lookup in the actual table.
(4) of course, insert performance goes down due to btree maintenence

Range queries (or queries on equality when there are many duplicates)
might benefit a lot. But I would think that in many situations, the fact
that you could only have one key indexed on the table would counteract
those benefits.

I haven't noticed any recent comments by the hackers on this subject.
Maybe they have some more details? I think MS SQL has something similar
to that also.

Regards,Jeff Davis


Re: Improving count(*)

From
Greg Stark
Date:
I think the important thing to keep track of is a single bit: 
 Which the following apply?
 a) all of the tuples in this block are visible
 b) at least one tuple in this block is in-doubt and may need to be vacuumed

That isn't enough to calculate count(*) on its own but it means you could scan
an index like any other database and avoid checking every single tuple. If the
tuple lies in a block that is known not to contain any in-doubt records then
the tuple can be counted immediately.

This has the advantage that it helps with a lot more cases than a simple
"select count(*) from tab". As Tom pointed out that case can be tackled more
directly with a O(1) solution anyways. More complex cases are where fast index
scans are really important.

So you could do "SELECT count(*) FROM tab WHERE a > ?" and have it scan an
index on <a> without having to check the visibility of every single tuple. It
only has to check the visibility of tuples that lie on blocks that contain at
least one in-doubt tuple.

You could even imagine using the new bitmap index scan machinery to combine
these bits of information too.

And this is exactly the same information that vacuum needs. Once vacuum has
run and cleaned out a block it knows whether there are any records that are
still in-doubt or whether every record it left is universally visible. It can
note that and allow future vacuums to skip that block if no deletes or inserts
have changed that bit since.

-- 
greg



Re: Improving count(*)

From
Simon Riggs
Date:
On Fri, 2005-11-18 at 11:51 +1100, Gavin Sherry wrote:
> Seems like a better solution. I can finish the patch pretty soon. I need
> to contact the original author, who has disappeared, but I'll send it over
> to you.

Sounds good. I wondered where he'd gone to.

Sampling would be useful for 8,2

Best Regards, Simon Riggs



Re: Improving count(*)

From
Varun Kacholia
Date:
> > Seems like a better solution. I can finish the patch pretty soon. I need
> > to contact the original author, who has disappeared, but I'll send it over
> > to you.

> Sounds good. I wondered where he'd gone to.

Still here :-)
Just got swamped with too much work that the tablesample
patch got paged out. Gavin has a working version of the patch.
I can give a hand if need be.


Thanks


Re: Improving count(*)

From
"Zeugswetter Andreas DCP SD"
Date:
> > The instant someone touches a block it would no longer be marked as
> > frozen (vacuum or analyze or other is not required) and count(*)
would
> > visit the tuples in the block making the correct decision at that
time.
>
> Hmm, so the idea would be that if a block no longer contained any
tuples hidden from any active transaction,
> you could store the count and skip reading that page.

I like the approach of informix and maxdb, that can tell the count(*)
instantly without looking at index leaf or data pages.

Imho we could do that with a central storage of count(*) even with mvcc.
The idea is a base value for count(*) and corrective values per open
xid.
To tell the count you add all corrective values whose xid is visible in
snapshot.
Each backend is responsibe for compacting xid counters below min open
xid.
Periodically (e.g. at checkpoint time) you compact (aggregate committed
xid counters
into the base value) and persist the count.

Since that costs, I guess I would make it optional and combine it with
materialized
views that are automatically used at runtime, and can at the same time
answer other
aggregates or aggregates for groups.
create materialized view xx_agg enable query rewrite as select count(*),
sum (col1) from xx
[group by col2];

Your page flag storage could possibly also be used for btree access, to
short circuit
the heap visibility lookup (e.g. for pages where all rows are visible
(vacuumed)).
I think that your proposal is too complex if it is not used to also
improve other
performance areas.

Andreas


Re: Improving count(*)

From
Tino Wildenhain
Date:
Zeugswetter Andreas DCP SD schrieb:
>>>The instant someone touches a block it would no longer be marked as 
>>>frozen (vacuum or analyze or other is not required) and count(*)
> 
> would 
> 
>>>visit the tuples in the block making the correct decision at that
> 
> time.
> 
>>Hmm, so the idea would be that if a block no longer contained any
> 
> tuples hidden from any active transaction,
> 
>>you could store the count and skip reading that page.
> 
> 
> I like the approach of informix and maxdb, that can tell the count(*)
> instantly without looking at index leaf or data pages.
> 
> Imho we could do that with a central storage of count(*) even with mvcc.
> The idea is a base value for count(*) and corrective values per open
> xid.
> To tell the count you add all corrective values whose xid is visible in
> snapshot.
> Each backend is responsibe for compacting xid counters below min open
> xid.
> Periodically (e.g. at checkpoint time) you compact (aggregate committed
> xid counters 
> into the base value) and persist the count.
> 
> Since that costs, I guess I would make it optional and combine it with
> materialized 
> views that are automatically used at runtime, and can at the same time
> answer other 
> aggregates or aggregates for groups. 
> create materialized view xx_agg enable query rewrite as select count(*),
> sum (col1) from xx
> [group by col2];
> 

I wonder how many times you really need a count(*) w/o where clause.
If I understand you correctly you are trying to optimize just this
one case?

Regards
Tino


Re: Improving count(*)

From
"Zeugswetter Andreas DCP SD"
Date:
> > Since that costs, I guess I would make it optional and combine it
with
> > materialized views that are automatically used at runtime, and can
at
> > the same time answer other aggregates or aggregates for groups.
> > create materialized view xx_agg enable query rewrite as select
> > count(*), sum (col1) from xx [group by col2];
> >
>
> I wonder how many times you really need a count(*) w/o where clause.
> If I understand you correctly you are trying to optimize just this one
case?

I guess you have not read to the end. A materialized view with a group
by
as indicated in the example is able to answer all sorts of queries
with or without where clauses ( e.g. ... where col2 = 'x').

Andreas


Re: Improving count(*)

From
Tino Wildenhain
Date:
Zeugswetter Andreas DCP SD schrieb:
>>>Since that costs, I guess I would make it optional and combine it
> 
> with 
> 
>>>materialized views that are automatically used at runtime, and can
> 
> at 
> 
>>>the same time answer other aggregates or aggregates for groups.
>>>create materialized view xx_agg enable query rewrite as select 
>>>count(*), sum (col1) from xx [group by col2];
>>>
>>
>>I wonder how many times you really need a count(*) w/o where clause.
>>If I understand you correctly you are trying to optimize just this one
> 
> case?
> 
> I guess you have not read to the end. A materialized view with a group
> by 
> as indicated in the example is able to answer all sorts of queries
> with or without where clauses ( e.g. ... where col2 = 'x').

But wouldn't that mean I need a materialized view (does we have
that now or do I need to play the usual games with triggers?)
for every possible where condition?


Re: Improving count(*)

From
Steve Wampler
Date:
Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> 
>>From here, another proposal. We have a GUC called count_uses_estimate
>>that is set to off by default. If set to true, then a count(*) will use
>>the planner logic to estimate number of rows in the table and return
>>that as the answer, rather than actually count the row.
> 
> 
> Ugh.  Why not just provide a function to retrieve the planner estimate,
> but *not* call it count(*)?  It would fit nicely with the contrib/dbsize
> stuff (or I should say, the stuff formerly in dbsize...)

That would completely remove my needs for a fast count() - all I want
is a way to quickly estimate table sizes for an operator's display.  Tom's
suggestion would provide exactly that.

-- 
Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.


Re: Improving count(*)

From
Richard Huxton
Date:
Simon Riggs wrote:
> One of the major complaints is always "Select count(*) is slow".

Although there seem to have been plenty of ideas on this they all seem 
to just provide a solution for the "whole table" case. It might be that 
the solution provides other benefits, but for this one case it does seem 
like a lot of work.

Might it be possible to apply rule-style rewriting to a clause of an 
ordinary select query? That is, is it prohibitively expensive to get PG 
to recognise  SELECT count(*) FROM big_table
and replace it with  SELECT sum(summary_count) FROM my_materialised_view

This should allow you to have where-clauses and apply to a range of 
cases. What I fear is that checking to see if the rule applies will cost  too much on all those queries where it
doesn'tapply.
 

--   Richard Huxton  Archonet Ltd


Re: Improving count(*)

From
Tom Lane
Date:
Richard Huxton <dev@archonet.com> writes:
> Might it be possible to apply rule-style rewriting to a clause of an 
> ordinary select query? That is, is it prohibitively expensive to get PG 
> to recognise
>    SELECT count(*) FROM big_table
> and replace it with
>    SELECT sum(summary_count) FROM my_materialised_view

> This should allow you to have where-clauses and apply to a range of 
> cases. What I fear is that checking to see if the rule applies will cost 
>   too much on all those queries where it doesn't apply.

There is already code in the optimizer that does similar rewriting
for min/max queries.   However, that's a hard-wired transformation.
I don't see any very simple way to provide a user-configurable
equivalent.
        regards, tom lane


Re: Improving count(*)

From
"Merlin Moncure"
Date:
> In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL
> Server) the leaf level of the narrowest index on the table is scanned,
> following a linked list of leaf pages.  Leaf pages can be pretty dense
> under Sybase, because they do use prefix compression.  A count(*)
> on a table with 100 million rows is going to take a few minutes, but
it
> is going to be at least an order of magnitude faster than a data page
> scan -- maybe two orders of magnitude faster.

MS SQL server (pre 2005) is not an MVCC database, so it's not apples to
apples with pg.  Many of the people who wander on this list and complain
about count(*) either come from one of those or some other non-MVCC
database or worse, a flat-file xbase type system.  A performance
comparison between MS 2005 and pg would be much more interesting.
Personally, I don't know what all the fuss is about [although I wouldn't
complain about an optimization ;)].

Merlin



Re: Improving count(*)

From
Alvaro Herrera
Date:
Merlin Moncure wrote:
> > In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL
> > Server) the leaf level of the narrowest index on the table is scanned,
> > following a linked list of leaf pages.  Leaf pages can be pretty dense
> > under Sybase, because they do use prefix compression.  A count(*)
> > on a table with 100 million rows is going to take a few minutes, but
> it
> > is going to be at least an order of magnitude faster than a data page
> > scan -- maybe two orders of magnitude faster.
> 
> MS SQL server (pre 2005) is not an MVCC database, so it's not apples to
> apples with pg.

Oh, also it was mentioned on pgsql-advocacy that InnoDB is MVCC.  If
that's the case, I wonder how do they do the count(*) thing?  Is it fast?


-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Improving count(*)

From
Gregory Maxwell
Date:
On 11/18/05, Merlin Moncure <merlin.moncure@rcsonline.com> wrote:
> > In Sybase ASE (and I'm pretty sure the same is true in Microsoft SQL
> > Server) the leaf level of the narrowest index on the table is scanned,
> > following a linked list of leaf pages.  Leaf pages can be pretty dense
> > under Sybase, because they do use prefix compression.  A count(*)
> > on a table with 100 million rows is going to take a few minutes, but
> it
> > is going to be at least an order of magnitude faster than a data page
> > scan -- maybe two orders of magnitude faster.
>
> MS SQL server (pre 2005) is not an MVCC database, so it's not apples to
> apples with pg.  Many of the people who wander on this list and complain
> about count(*) either come from one of those or some other non-MVCC
> database or worse, a flat-file xbase type system.  A performance
> comparison between MS 2005 and pg would be much more interesting.
> Personally, I don't know what all the fuss is about [although I wouldn't
> complain about an optimization ;)].

count(*) WHERE 1  is indeed a corner case that few to no real
applications should care about... If we were having to choose between
improving that case and preserving the performance and maintainability
of PG then I think the discussion would already be over.

However, some great ideas have been proposed here which would not only
help in that case but would otherwise be quite useful.

*Inclusion of a 'MVCC inflight' bit in indexes which would allow
skipping MVCC checks in clumps of an index scan which have no pending
changes. This would further close the performance gap between PG and
non-MVCC databases for some workloads.
*Introduction of high performance table sampling, which would be
useful in many applications (including counting where there is a where
clause) as well as for testing and adhoc queries.
and
*a estimate_count() that provides the planner estimate, which would
return right away and provide what is really needed most of the time
people try to count(*) on a large table.

So, while this is a silly case to optimize for it's one where it
appears that the proposed solutions will make PG better all around.


Re: Improving count(*)

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Richard Huxton <dev@archonet.com> writes:
> > Might it be possible to apply rule-style rewriting to a clause of an 
> > ordinary select query? That is, is it prohibitively expensive to get PG 
> > to recognise
> >    SELECT count(*) FROM big_table
> > and replace it with
> >    SELECT sum(summary_count) FROM my_materialised_view
> 
> > This should allow you to have where-clauses and apply to a range of 
> > cases. What I fear is that checking to see if the rule applies will cost 
> >   too much on all those queries where it doesn't apply.
> 
> There is already code in the optimizer that does similar rewriting
> for min/max queries.   However, that's a hard-wired transformation.
> I don't see any very simple way to provide a user-configurable
> equivalent.

I guess there must be a query-rewriting mechanism for implementing
materialized views.  With that in place we may be able to implement this
other thing ...  Is anybody working on materialized views?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Improving count(*)

From
mark@mark.mielke.cc
Date:
On Fri, Nov 18, 2005 at 03:46:42PM +0000, Richard Huxton wrote:
> Simon Riggs wrote:
> >One of the major complaints is always "Select count(*) is slow".
> Although there seem to have been plenty of ideas on this they all seem 
> to just provide a solution for the "whole table" case. It might be that 
> the solution provides other benefits, but for this one case it does seem 
> like a lot of work.

Or, it wasn't explained properly as to how the WHERE clause would
function.

The solution involving an index that has visibility information should
work fine with a WHERE clause. Only index rows that match the clause
are counted.

A solution enhancing the above mentioned indexes, to maintain a count
for whole index blocks, would allow whole index blocks that satisfy
the WHERE clause to be counted, assuming the whole index block is
visible in the current transaction.

Not to say these are the best ideas, or the only ideas - but it isn't
true that most of the solution presented only apply to the 'whole table'
case. The *simplest* solutions, apply only to the 'whole table' case. :-)

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Improving count(*)

From
Josh Berkus
Date:
Alvaro,

> I guess there must be a query-rewriting mechanism for implementing
> materialized views.  With that in place we may be able to implement this
> other thing ...  Is anybody working on materialized views?

I have a bundle of academic code designed to do exactly this, if any hacker
wants to take on the task of getting it into production shape.

--
Josh Berkus
Aglio Database Solutions
San Francisco


Materialized views (Was Re: Improving count(*))

From
Heikki Linnakangas
Date:
On Fri, 18 Nov 2005, Josh Berkus wrote:

> Alvaro,
>
>> I guess there must be a query-rewriting mechanism for implementing
>> materialized views.  With that in place we may be able to implement this
>> other thing ...  Is anybody working on materialized views?
>
> I have a bundle of academic code designed to do exactly this, if any hacker
> wants to take on the task of getting it into production shape.

Could you post it to the list? I'd be interested to take a look, though 
I'm afraid don't have the time to work on it.

I've been reading some papers on materialized views lately. Here's some 
interesting ones:

Blakeley, Larson, Tompa: Efficiently Updating Materialized View
http://tinyurl.com/8hqeo
Describes a fairly simple algorithm for keeping select-project-join views 
up to date.

Vista: View Maintenance in Relational and Deductive Databases by 
Incremental Query Evaluation
http://tinyurl.com/exb8o
A survey of various algorithms.

Gupta, Mumick, Subrahmanian: Maintaining Views Incrementally
http://portal.acm.org/citation.cfm?id=170066
Extended abstract of a paper that presents two algorithms: one similar to 
the Blakeley paper, and another one that can also handle recursion.

Ross, Srivastava, Sudarshan: Materialized View Maintenance and Integrity 
Constraint Checking: Trading Space for Time
http://citeseer.ist.psu.edu/ross96materialized.html
Describes how materialized views can be used for implementing database 
assertions.

- Heikki


Re: Materialized views (Was Re: Improving count(*))

From
Nicolas Barbier
Date:
(CCed to the matview-devel mailing list)

On 11/19/05, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> I've been reading some papers on materialized views lately. Here's some
> interesting ones:

(snip)

You might want to take a look at the pages that I set up to track the
progress on my master's thesis:

<url:http://www.nicolas.barbier.easynet.be/itsme/thesis/>

especially the literature page:

<url:http://www.nicolas.barbier.easynet.be/itsme/thesis/literature/>

IMO, GL95, Qua97 and GM99 are the ones that are most applicable to
view maintenance with bag-semantics (thus, SQL). You should be able to
find all these papers with Google (Scholar) in case my computer is
shut down, otherwise you can download them directly from me.

Greetings,
Nicolas

--
Nicolas Barbier
http://www.gnu.org/philosophy/no-word-attachments.html


Re: Materialized views (Was Re: Improving count(*))

From
Josh Berkus
Date:
Heikki,

> Could you post it to the list? I'd be interested to take a look, though
> I'm afraid don't have the time to work on it.

Yeah, I should put it up on pgFoundry.  I'm not sure exactly where, though -- 
I don't want to launch a new project if it's not going to take off.   Maybe 
Bizgres.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Materialized views (Was Re: Improving count(*))

From
Heikki Linnakangas
Date:
On Sat, 19 Nov 2005, Josh Berkus wrote:

>> Could you post it to the list? I'd be interested to take a look, though
>> I'm afraid don't have the time to work on it.
>
> Yeah, I should put it up on pgFoundry.  I'm not sure exactly where, though --
> I don't want to launch a new project if it's not going to take off.   Maybe
> Bizgres.

There's also a "matview" project on gborg.

- Heikki


Re: Materialized views (Was Re: Improving count(*))

From
Heikki Linnakangas
Date:
On Sat, 19 Nov 2005, Nicolas Barbier wrote:

> You might want to take a look at the pages that I set up to track the
> progress on my master's thesis:
>
> <url:http://www.nicolas.barbier.easynet.be/itsme/thesis/>
>
> especially the literature page:
>
> <url:http://www.nicolas.barbier.easynet.be/itsme/thesis/literature/>
>
> IMO, GL95, Qua97 and GM99 are the ones that are most applicable to
> view maintenance with bag-semantics (thus, SQL). You should be able to
> find all these papers with Google (Scholar) in case my computer is
> shut down, otherwise you can download them directly from me.

Thanks, interesting stuff.

BTW: Does the GL95 algorithm handle outer joins?

- Heikki


Re: Materialized views (Was Re: Improving count(*))

From
Nicolas Barbier
Date:
On 11/20/05, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> On Sat, 19 Nov 2005, Nicolas Barbier wrote:
>
> > You might want to take a look at the pages that I set up to track the
> > progress on my master's thesis:
> >
> > <url:http://www.nicolas.barbier.easynet.be/itsme/thesis/>
> >
> > especially the literature page:
> >
> > <url:http://www.nicolas.barbier.easynet.be/itsme/thesis/literature/>
> >
> > IMO, GL95, Qua97 and GM99 are the ones that are most applicable to
> > view maintenance with bag-semantics (thus, SQL). You should be able to
> > find all these papers with Google (Scholar) in case my computer is
> > shut down, otherwise you can download them directly from me.
>
> Thanks, interesting stuff.
>
> BTW: Does the GL95 algorithm handle outer joins?

No, but GM99 does (although only in the cases where it can be
applied). I guess that a slightly adapted version of the technique
from Qua97 can also be used. Investigating :-).

greetings,
Nicolas

--
Nicolas Barbier
http://www.gnu.org/philosophy/no-word-attachments.html


Re: Improving count(*)

From
"Jim C. Nasby"
Date:
On Fri, Nov 18, 2005 at 12:08:03AM +0000, Simon Riggs wrote:
> The trouble is, people moan and constantly. Perhaps we should stick to
> our guns and say, why do you care? From here, I think we should say,
> "show me an application package that needs this so badly we'll change
> PostgreSQL just for them". Prove it and we'll do it. Kinda polite in the
> TODO, but I think we should put something in there that says "things we
> haven't yet had any good reason to improve".

FWIW, this is one of Tom Kyte's (of http://asktom.oracle.com fame) big
complaints: if you have a query where count(*) isn't nearly instant then
you probably don't need an exact count in the first place and should be
happy enough with an estimate. He constantly cites Google ('Result 1-10
of about 38,923') as an example of this.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Improving count(*)

From
"Jim C. Nasby"
Date:
On Fri, Nov 18, 2005 at 02:56:52PM -0500, Gregory Maxwell wrote:
> However, some great ideas have been proposed here which would not only
> help in that case but would otherwise be quite useful.
> 
> *Inclusion of a 'MVCC inflight' bit in indexes which would allow
> skipping MVCC checks in clumps of an index scan which have no pending
> changes. This would further close the performance gap between PG and
> non-MVCC databases for some workloads.
> *Introduction of high performance table sampling, which would be
> useful in many applications (including counting where there is a where
> clause) as well as for testing and adhoc queries.
> and
> *a estimate_count() that provides the planner estimate, which would
> return right away and provide what is really needed most of the time
> people try to count(*) on a large table.

What about Greg Stark's idea of combining Simon's idea of storing
per-heap-block xmin/xmax with using that information in an index scan?
ISTM that's the best of everything that's been presented: it allows for
faster index scans without adding a lot of visibility overhead to the
index heap, and it also allows VACUUM to hit only pages that need
vacuuming. Presumably this could also be used as the on-disk backing for
the FSM, or it could potentially replace the FSM.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Improving count(*)

From
Gregory Maxwell
Date:
On 11/21/05, Jim C. Nasby <jnasby@pervasive.com> wrote:
> What about Greg Stark's idea of combining Simon's idea of storing
> per-heap-block xmin/xmax with using that information in an index scan?
> ISTM that's the best of everything that's been presented: it allows for
> faster index scans without adding a lot of visibility overhead to the
> index heap, and it also allows VACUUM to hit only pages that need
> vacuuming. Presumably this could also be used as the on-disk backing for
> the FSM, or it could potentially replace the FSM.

This should be a big win all around, especially now since in memory
bitmaps make it more likely that some classes of queries will be pure
index.  I still think it would be useful to have a estimated_count()
which switches to whatever method is needed to get a reasonably
accurate count quickly (stats when there are no wheres we can't
predict, sampling otherwise if the involved tables are large, and a
normal count in other cases.)


Re: Improving count(*)

From
Bruce Momjian
Date:
Gregory Maxwell wrote:
> On 11/21/05, Jim C. Nasby <jnasby@pervasive.com> wrote:
> > What about Greg Stark's idea of combining Simon's idea of storing
> > per-heap-block xmin/xmax with using that information in an index scan?
> > ISTM that's the best of everything that's been presented: it allows for
> > faster index scans without adding a lot of visibility overhead to the
> > index heap, and it also allows VACUUM to hit only pages that need
> > vacuuming. Presumably this could also be used as the on-disk backing for
> > the FSM, or it could potentially replace the FSM.
> 
> This should be a big win all around, especially now since in memory
> bitmaps make it more likely that some classes of queries will be pure
> index.  I still think it would be useful to have a estimated_count()
> which switches to whatever method is needed to get a reasonably
> accurate count quickly (stats when there are no wheres we can't
> predict, sampling otherwise if the involved tables are large, and a
> normal count in other cases.)

Added to TODO:
* Add estimated_count(*) to return an estimate of COUNT(*)  This would use the planner ANALYZE statistatics to return
anestimated  count.
 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Improving count(*)

From
Bruce Momjian
Date:
mark@mark.mielke.cc wrote:
> On Fri, Nov 18, 2005 at 03:46:42PM +0000, Richard Huxton wrote:
> > Simon Riggs wrote:
> > >One of the major complaints is always "Select count(*) is slow".
> > Although there seem to have been plenty of ideas on this they all seem 
> > to just provide a solution for the "whole table" case. It might be that 
> > the solution provides other benefits, but for this one case it does seem 
> > like a lot of work.
> 
> Or, it wasn't explained properly as to how the WHERE clause would
> function.
> 
> The solution involving an index that has visibility information should
> work fine with a WHERE clause. Only index rows that match the clause
> are counted.
> 
> A solution enhancing the above mentioned indexes, to maintain a count
> for whole index blocks, would allow whole index blocks that satisfy
> the WHERE clause to be counted, assuming the whole index block is
> visible in the current transaction.

I think it would be very difficult to generate a per-index-page
visibility bit because I can't think of a normal database operation that
would allow us to update it.  It requires that an index scan visit every
heap page to check the visibility of the tuples.  However, we almost
never do a full-index scan because it is so much slower than a heap
scan.  It would be easy to keep a heap-visible bit up-to-date (because
we do full-heap scans occasionally), but that would require the index
to load the heap page to find the bit.  (The bit isn't on the index, it
is on the heap).

Jan has been talking about have a bitmap to track pages that need
vacuuming, and I am wondering if the same system could be used to track
the heap-dirty bits.  Putting one bit on every 8k disk page means we have
to load the 8k page to read the bit, while a centralized bitmap would
load 64k page bits in a single 8k page.  That one page would cover 500MB
of a table.  Seems vacuum could use the same bitmap values.

Assuming we track 100 tables regularly, that would require 800k of shared
memory.  We would have to pagein/out the bitmaps as needed, but we could
throw them away on a crash and rebuild as part of normal operation.

FSM has not been a concurrency bottleneck, so I am thinking this would
not be either.

I suppose it would require a new filesystem file for each table.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Improving count(*)

From
Bruce Momjian
Date:
Jim C. Nasby wrote:
> On Fri, Nov 18, 2005 at 02:56:52PM -0500, Gregory Maxwell wrote:
> > However, some great ideas have been proposed here which would not only
> > help in that case but would otherwise be quite useful.
> > 
> > *Inclusion of a 'MVCC inflight' bit in indexes which would allow
> > skipping MVCC checks in clumps of an index scan which have no pending
> > changes. This would further close the performance gap between PG and
> > non-MVCC databases for some workloads.
> > *Introduction of high performance table sampling, which would be
> > useful in many applications (including counting where there is a where
> > clause) as well as for testing and adhoc queries.
> > and
> > *a estimate_count() that provides the planner estimate, which would
> > return right away and provide what is really needed most of the time
> > people try to count(*) on a large table.
> 
> What about Greg Stark's idea of combining Simon's idea of storing
> per-heap-block xmin/xmax with using that information in an index scan?
> ISTM that's the best of everything that's been presented: it allows for
> faster index scans without adding a lot of visibility overhead to the
> index heap, and it also allows VACUUM to hit only pages that need
> vacuuming. Presumably this could also be used as the on-disk backing for
> the FSM, or it could potentially replace the FSM.

Right, but xmin/xmax is too detailed.  We just need a single bit to say
all the rows in the heap page are visible to everyone.  Seem my earlier
posting.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Improving count(*)

From
"Jim C. Nasby"
Date:
On Tue, Nov 22, 2005 at 06:11:01PM -0500, Bruce Momjian wrote:
> mark@mark.mielke.cc wrote:
> Jan has been talking about have a bitmap to track pages that need
> vacuuming, and I am wondering if the same system could be used to track
> the heap-dirty bits.  Putting one bit on every 8k disk page means we have
> to load the 8k page to read the bit, while a centralized bitmap would
> load 64k page bits in a single 8k page.  That one page would cover 500MB
> of a table.  Seems vacuum could use the same bitmap values.
> 
> Assuming we track 100 tables regularly, that would require 800k of shared
> memory.  We would have to pagein/out the bitmaps as needed, but we could
> throw them away on a crash and rebuild as part of normal operation.
> 
> FSM has not been a concurrency bottleneck, so I am thinking this would
> not be either.
> 
> I suppose it would require a new filesystem file for each table.

ISTM that the requirements here are very similar to the requirements for
the FSM, at least from a high-level: Track all pages where condition X
is true. Is there value to using the same framework for both cases?
Maybe it makes more sense to store free space info for a relation using
a bitmap, rather than storing individual page numbers. Or conversely,
storing 'dirty page info' should maybe be done in a similar fasion to
FSM instead of a bitmap.

If we wanted to provide the ultimate in tunability we'd offer both
solutions; some tables will have a large number of pages with free space
on them (which would favor storing that info in a bitmap); likewise some
tables will have a small number of pages that are 'dirty', which would
favor storing a list of page numbers instead of a bitmap.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Improving count(*)

From
mark@mark.mielke.cc
Date:
On Tue, Nov 22, 2005 at 06:11:01PM -0500, Bruce Momjian wrote:
> mark@mark.mielke.cc wrote:
> > A solution enhancing the above mentioned indexes, to maintain a count
> > for whole index blocks, would allow whole index blocks that satisfy
> > the WHERE clause to be counted, assuming the whole index block is
> > visible in the current transaction.
> I think it would be very difficult to generate a per-index-page
> visibility bit because I can't think of a normal database operation that
> would allow us to update it.  It requires that an index scan visit every
> heap page to check the visibility of the tuples.  However, we almost
> never do a full-index scan because it is so much slower than a heap
> scan.  It would be easy to keep a heap-visible bit up-to-date (because
> we do full-heap scans occasionally), but that would require the index
> to load the heap page to find the bit.  (The bit isn't on the index, it
> is on the heap).

Vacuum time?

> Jan has been talking about have a bitmap to track pages that need
> vacuuming, and I am wondering if the same system could be used to track
> the heap-dirty bits.  Putting one bit on every 8k disk page means we have
> to load the 8k page to read the bit, while a centralized bitmap would
> load 64k page bits in a single 8k page.  That one page would cover 500MB
> of a table.  Seems vacuum could use the same bitmap values.

Sounds correct.

> Assuming we track 100 tables regularly, that would require 800k of shared
> memory.  We would have to pagein/out the bitmaps as needed, but we could
> throw them away on a crash and rebuild as part of normal operation.
> FSM has not been a concurrency bottleneck, so I am thinking this would
> not be either.
> I suppose it would require a new filesystem file for each table.

*nod*

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Materialized views (Was Re: Improving count(*))

From
Josh Berkus
Date:
Guys,

I'm not going to get the university research up before American Thanksgiving.   
Sorry.  Look for it next week.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Improving count(*)

From
Bruce Momjian
Date:
Add idea to TODO:
* Allow data to be pulled directly from indexes  Currently indexes do not have enough tuple visibility information  to
allowdata to be pulled from the index without also accessing  the heap.  One way to allow this is to set a bit on index
tuples to indicate if a tuple is currently visible to all transactions  when the first valid heap lookup happens.  This
bitwould have to  be cleared when a heap tuple is expired.  Another idea is to maintain  a bitmap of heap pages where
allrows are visible to all backends,  and allow index lookups to reference that bitmap to avoid heap  lookups, perhaps
thesame bitmap we might add someday to determine  which heap pages need vacuuming.
 

I am thinking we are at the point where between this usage and vacuum
that a bitmap for each table is something we should work on for 8.2.

---------------------------------------------------------------------------

Bruce Momjian wrote:
> mark@mark.mielke.cc wrote:
> > On Fri, Nov 18, 2005 at 03:46:42PM +0000, Richard Huxton wrote:
> > > Simon Riggs wrote:
> > > >One of the major complaints is always "Select count(*) is slow".
> > > Although there seem to have been plenty of ideas on this they all seem 
> > > to just provide a solution for the "whole table" case. It might be that 
> > > the solution provides other benefits, but for this one case it does seem 
> > > like a lot of work.
> > 
> > Or, it wasn't explained properly as to how the WHERE clause would
> > function.
> > 
> > The solution involving an index that has visibility information should
> > work fine with a WHERE clause. Only index rows that match the clause
> > are counted.
> > 
> > A solution enhancing the above mentioned indexes, to maintain a count
> > for whole index blocks, would allow whole index blocks that satisfy
> > the WHERE clause to be counted, assuming the whole index block is
> > visible in the current transaction.
> 
> I think it would be very difficult to generate a per-index-page
> visibility bit because I can't think of a normal database operation that
> would allow us to update it.  It requires that an index scan visit every
> heap page to check the visibility of the tuples.  However, we almost
> never do a full-index scan because it is so much slower than a heap
> scan.  It would be easy to keep a heap-visible bit up-to-date (because
> we do full-heap scans occasionally), but that would require the index
> to load the heap page to find the bit.  (The bit isn't on the index, it
> is on the heap).
> 
> Jan has been talking about have a bitmap to track pages that need
> vacuuming, and I am wondering if the same system could be used to track
> the heap-dirty bits.  Putting one bit on every 8k disk page means we have
> to load the 8k page to read the bit, while a centralized bitmap would
> load 64k page bits in a single 8k page.  That one page would cover 500MB
> of a table.  Seems vacuum could use the same bitmap values.
> 
> Assuming we track 100 tables regularly, that would require 800k of shared
> memory.  We would have to pagein/out the bitmaps as needed, but we could
> throw them away on a crash and rebuild as part of normal operation.
> 
> FSM has not been a concurrency bottleneck, so I am thinking this would
> not be either.
> 
> I suppose it would require a new filesystem file for each table.
> 
> -- 
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 359-1001
>   +  If your life is a hard drive,     |  13 Roberts Road
>   +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073