Thread: Proposal: COUNT(*) (and related) speedup

Proposal: COUNT(*) (and related) speedup

From
Joshua Yanovski
Date:
Hey all,

First off, please feel free to let me know if this idea is a waste of time :)

I was thinking about COUNT(*) and its slow performance today, and I
thought about whether we could get better performance by taking a page
out of index-only-scans.

Essentially, the idea is that you would store a counter (let's say, as
a special index type) that would initially (on index creation) be set
to the total count of
all rows on fully visible pages (visibility map bit set to 1).
Whenever a query was made on the table for COUNT(*), we would not even
bother to visit any pages
that were in the visibility map, but instead visit the heap for every
page that was not marked visible and compute the delta between the
count of tuples visible
at relfrozenxid and tuples currently visible.  The deltas on all the
pages would then be summed with the original counter and returned as
the count of the table.
The counter would be updated only by VACUUM, which would perform the
same computation performed by the COUNT operation but add it
permanently to counter just before
it set the visible_map bit to 1 (I am not certain whether this would
require unusual changes to WAL replay or not).  There may be some
issues with this idea, since
I am not too familiar with the Postgres codebase, but Andrew Gierth
was kind enough to put up with my bugging him in the IRC channel all
night and helped me work through
some of this stuff so I think it should at least be coherent.

The upshot of the above proposal is that in situations where an index
only scan would have helped before, now you might be able to skip
almost all the I/O for COUNT(*),
while still getting a completely accurate result.

The disadvantages are pretty clear, I think, but let me list them
anyway (that I can think of):
* Adding an extra index type that can do only one thing
* Adding an index type that can only do that one thing efficiently if
you happen to want to scan over the entire table
* Adding an index type for a use case that isn't very compelling
(SELECT COUNT(*) FROM table)
* There may be additional concerns about this being costed correctly,
although I think there is probably enough information to make an
informed decision here.
* It can't be added as a contrib module (probably) and also be crash-safe.

However, if the patch for negative transition aggregate functions
lands, I think this idea becomes a bit more compelling.  Then instead
of just being a technique for
COUNT(*), it effectively becomes a very low-memory way of speeding up
a large number of aggregate queries (any that have an inverse) while
maintaining accuracy, in
situations where most of the pages never get modified.  The only
change to the above algorithm is that delta is computed as application
of the inverse function to "old"
tuples and the direct function to "new" ones on the page (I think this
can likely be done during vacuum index cleanup).

Please critique this idea and let me know whether it is worth pursuing further.

Thanks!

-- Joshua Yanovski



Re: Proposal: COUNT(*) (and related) speedup

From
Joshua Yanovski
Date:
From feedback on IRC, two immediately obvious technical problems:

* Heap pruning can happen at any time, not just during VACUUM and HOT
updates.  This is obviously a pretty significant issue and I doubt the
easy solution (don't do heap pruning for tables with an index like
this) would be acceptable.
* While relfrozenxid works for this in 9.4, it's not really related to
this purpose and its implementation could change at any time, so it is
not reasonable to rely on it.  A solution that would still work, but
not be as fast, would be to just not have a minimum for calculating
deltas.

On Fri, Apr 4, 2014 at 5:38 AM, Joshua Yanovski <pythonesque@gmail.com> wrote:
> Hey all,
>
> First off, please feel free to let me know if this idea is a waste of time :)
>
> I was thinking about COUNT(*) and its slow performance today, and I
> thought about whether we could get better performance by taking a page
> out of index-only-scans.
>
> Essentially, the idea is that you would store a counter (let's say, as
> a special index type) that would initially (on index creation) be set
> to the total count of
> all rows on fully visible pages (visibility map bit set to 1).
> Whenever a query was made on the table for COUNT(*), we would not even
> bother to visit any pages
> that were in the visibility map, but instead visit the heap for every
> page that was not marked visible and compute the delta between the
> count of tuples visible
> at relfrozenxid and tuples currently visible.  The deltas on all the
> pages would then be summed with the original counter and returned as
> the count of the table.
> The counter would be updated only by VACUUM, which would perform the
> same computation performed by the COUNT operation but add it
> permanently to counter just before
> it set the visible_map bit to 1 (I am not certain whether this would
> require unusual changes to WAL replay or not).  There may be some
> issues with this idea, since
> I am not too familiar with the Postgres codebase, but Andrew Gierth
> was kind enough to put up with my bugging him in the IRC channel all
> night and helped me work through
> some of this stuff so I think it should at least be coherent.
>
> The upshot of the above proposal is that in situations where an index
> only scan would have helped before, now you might be able to skip
> almost all the I/O for COUNT(*),
> while still getting a completely accurate result.
>
> The disadvantages are pretty clear, I think, but let me list them
> anyway (that I can think of):
> * Adding an extra index type that can do only one thing
> * Adding an index type that can only do that one thing efficiently if
> you happen to want to scan over the entire table
> * Adding an index type for a use case that isn't very compelling
> (SELECT COUNT(*) FROM table)
> * There may be additional concerns about this being costed correctly,
> although I think there is probably enough information to make an
> informed decision here.
> * It can't be added as a contrib module (probably) and also be crash-safe.
>
> However, if the patch for negative transition aggregate functions
> lands, I think this idea becomes a bit more compelling.  Then instead
> of just being a technique for
> COUNT(*), it effectively becomes a very low-memory way of speeding up
> a large number of aggregate queries (any that have an inverse) while
> maintaining accuracy, in
> situations where most of the pages never get modified.  The only
> change to the above algorithm is that delta is computed as application
> of the inverse function to "old"
> tuples and the direct function to "new" ones on the page (I think this
> can likely be done during vacuum index cleanup).
>
> Please critique this idea and let me know whether it is worth pursuing further.
>
> Thanks!
>
> -- Joshua Yanovski



-- 
Josh



Re: Proposal: COUNT(*) (and related) speedup

From
Tom Lane
Date:
Joshua Yanovski <pythonesque@gmail.com> writes:
> Essentially, the idea is that you would store a counter (let's say, as
> a special index type) that would initially (on index creation) be set
> to the total count of
> all rows on fully visible pages (visibility map bit set to 1).

It seems to me this can't possibly work because of race conditions.
In particular, what happens when some query dirties a page and thereby
clears its fully-visible bit?  Presumably, any such query would have
to (1) recompute the number of all-visible rows on that page (already
an expensive thing) and then (2) go and subtract that from the counter
(meaning the counter becomes a serialization bottleneck for all updates
on the table, which is exactly the reason we don't just have a globally
maintained row counter already).  But worse, what happens if a count(*)
is in progress?  It might or might not have scanned this page already,
and there's no way to get the right answer in both cases.  Counter
updates done by VACUUM would have a similar race-condition problem.

> Please critique this idea and let me know whether it is worth pursuing further.

I doubt it.
        regards, tom lane



Re: Proposal: COUNT(*) (and related) speedup

From
Joshua Yanovski
Date:
 t> It seems to me this can't possibly work because of race conditions.
> In particular, what happens when some query dirties a page and thereby
> clears its fully-visible bit?  Presumably, any such query would have
> to (1) recompute the number of all-visible rows on that page (already
> an expensive thing) and then (2) go and subtract that from the counter
> (meaning the counter becomes a serialization bottleneck for all updates
> on the table, which is exactly the reason we don't just have a globally
> maintained row counter already)..
Unless what you're saying is that the index creation itself invites
race conditions, I don't think this is true.  The general idea here is
to not actually compute the number of rows on a page (or whatever
other aggregate) unless (1) the index is being created, (2) it's
requested by the user, (3) VACUUM is occurring, and not to update the
counter except on (1) and (3).  I am not sure off the top of my head
about (1), but it seems from comments in vac_update_relstats that we
are already making the implicit assumption that VACUUM is not running
concurrently with another VACUUM on the same table.  A query that
dirties a page doesn't have to perform any additional work than it
does already, under any circumstances, under this model.

> But worse, what happens if a count(*)
> is in progress?  It might or might not have scanned this page already,
> and there's no way to get the right answer in both cases.  Counter
> updates done by VACUUM would have a similar race-condition problem.
I don't think the first part really matters.  If the page was visible
when COUNT(*) started and then got dirtied by some other transaction,
any changes by the second transaction shouldn't alter the actual count
in our transaction anyway, so whether we scan the page needlessly or
not seems beside the point.  If that did matter, index only scans
using COUNT(*) wouldn't work properly, since they make the same basic
assumption.

VACUUM counter updates, on the other hand, initially seem more
problematic, since if we grab the value of the counter, then VACUUM
updates the counter and the visbility bits, and then we check which
visibility bits are set, we might skip pages we really need to check
(since we're using an old value for the counter).  But in practice I
am pretty sure this is a nonissue, because as long as our COUNT(*)
transaction is going, no pages can have their visibility bits marked
unless they haven't been updated since the transaction started, which
consequently means we won't see the counter updated (the counter is
intended to be updated immediately before the visibility bit is set
while VACUUMing).  Took me a while to figure this out, for thanks for
pointing it out :)

>
>> Please critique this idea and let me know whether it is worth pursuing further.
>
> I doubt it.
So do I, but not for the reasons you listed, but because we can't rely
on dead tuples sticking around for us to count them and generate
deltas consistently.  I think there is probably a way to get around
that problem in a sane way with a more traditional index, though.
>
>                         regards, tom lane



-- 
Josh



Re: Proposal: COUNT(*) (and related) speedup

From
Joshua Yanovski
Date:
> VACUUM counter updates, on the other hand, initially seem more
> problematic, since if we grab the value of the counter, then VACUUM
> updates the counter and the visbility bits, and then we check which
> visibility bits are set, we might skip pages we really need to check
> (since we're using an old value for the counter).  But in practice I
> am pretty sure this is a nonissue, because as long as our COUNT(*)
> transaction is going, no pages can have their visibility bits marked
> unless they haven't been updated since the transaction started, which
> consequently means we won't see the counter updated (the counter is
> intended to be updated immediately before the visibility bit is set
> while VACUUMing).  Took me a while to figure this out, for thanks for
> pointing it out :)

Clarification: the counter may be updated, but never in a way that
affects the final result, because it can only be updated to reflect
statistics on pages that are fully visible to all current
transactions.  As long as we scan every page that was "really" dirty
at the start of our transactions, we can't really go wrong, since we
can recompute that information ourselves--we can just be better or
worse at avoiding redundant heap fetches.

-- 
Josh



Re: Proposal: COUNT(*) (and related) speedup

From
Tom Lane
Date:
Joshua Yanovski <pythonesque@gmail.com> writes:
>> But worse, what happens if a count(*)
>> is in progress?  It might or might not have scanned this page already,
>> and there's no way to get the right answer in both cases.  Counter
>> updates done by VACUUM would have a similar race-condition problem.

> I don't think the first part really matters.  If the page was visible
> when COUNT(*) started and then got dirtied by some other transaction,
> any changes by the second transaction shouldn't alter the actual count
> in our transaction anyway, so whether we scan the page needlessly or
> not seems beside the point.

The question is not whether you scan a page "needlessly" or not, it's
whether you double-count the tuples on it.  I don't think it's possible to
be sure that when you add the central counter value into your local sum,
you are neither double-counting nor omitting pages whose all-visible state
changed while you were scanning the table.
        regards, tom lane



Re: Proposal: COUNT(*) (and related) speedup

From
Joshua Yanovski
Date:
Yeah, you're right,  I believe that every code path in VACUUM that
leads to the visibility map bit being set also leads to all remaining
tuples on the page being frozen.  So in a world without heap pruning,
frozen should be a reliable proxy for "value of the tuple the last
time it was added to the counter".  If you count -1 for a frozen
tuple, and 1 for a visible tuple, you should end up at precisely the
delta from the last time the page was turned visible.

It seems to me that with this definition of delta, the concurrent
COUNT(*) case is fine.  The counter isn't affected by the dirtying of
the page, nor is the delta for the page.  The result will be
completely consistent with what it would have been had the page dirty
never happened. But concurrent VACUUM under this definition fails.  So
yeah, I'll drop this until I have a concrete idea of how (or if) to
proceed.  Thanks for the feedback.



Re: Proposal: COUNT(*) (and related) speedup

From
Rajeev rastogi
Date:
On 04 April 2014 18:09, Joshua Yanovski Wrote:

> The counter would be updated only by VACUUM, which would perform the 
> same computation performed by the COUNT operation but add it 
> permanently to counter just before it set the visible_map bit to 1 (I 
> am not certain whether this would require unusual changes to WAL 
> replay or not).

I think there is one more disadvantages with this (or maybe I have missed something): User may not use count(*) or may
usejust once after creating new kind of index proposed but VACUUM will have to keep performing operation equivalent to
iterativecount(*), which might degrade performance for VACUUM.
 

Thanks and Regards,
Kumar Rajeev Rastogi


Re: Proposal: COUNT(*) (and related) speedup

From
Robert Haas
Date:
On Fri, Apr 4, 2014 at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Joshua Yanovski <pythonesque@gmail.com> writes:
>>> But worse, what happens if a count(*)
>>> is in progress?  It might or might not have scanned this page already,
>>> and there's no way to get the right answer in both cases.  Counter
>>> updates done by VACUUM would have a similar race-condition problem.
>
>> I don't think the first part really matters.  If the page was visible
>> when COUNT(*) started and then got dirtied by some other transaction,
>> any changes by the second transaction shouldn't alter the actual count
>> in our transaction anyway, so whether we scan the page needlessly or
>> not seems beside the point.
>
> The question is not whether you scan a page "needlessly" or not, it's
> whether you double-count the tuples on it.  I don't think it's possible to
> be sure that when you add the central counter value into your local sum,
> you are neither double-counting nor omitting pages whose all-visible state
> changed while you were scanning the table.

I think it would work if you also had information about which pages
you needed to scan.  For example, suppose you had a data structure
sorta like the visibility map, except that instead of one bit per page
you have one 16-bit integer per page.  The integer is -1 if the page
isn't known to be all-visible, and is otherwise the count of tuples on
the page.  Now, we can do a count(*) as follows:

1. Lock the first as-yet-unexamined page of the data structure.
2. Add all the values that aren't -1 to your running count of tuples,
and remember the page numbers corresponding to the remaining entries.
3. Unlock the page you locked in step 1.
4. Visit each heap page you remembered in step 2 and add the number of
tuples visible to your scan to your running count of tuples.
5. Provided there's a page of the data structure you haven't visited
yet, go to step 1.

On the write side, any operation that clears the visibility map bit
must also set the entry for that page to -1.  When the page is
all-visible, the value can be set to the total number of tuples on the
page.  This is safe because, even if the page ceases to be all-visible
after we add its tuples to the count, the new tuples that were added
aren't visible to our scan, and any tuples deleted are still visible
to our scan - because the transaction making the changes, in either
case, is obviously still running, and therefore didn't commit before
we took our snapshot.

Now, this wouldn't be O(1) but it would presumably be quite fast if
your visibility map bits are mostly set.  One 8kB page of 16-bit
counters would cover 32MB of heap.

The bad news is that I am pretty sure there's no easy way for an index
AM to get callbacks at the right time, so it's really unclear how
something like this would fit in with our existing abstractions.  And
even if we answered that question, it would be a lot of work for a
pretty narrow use case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company