Thread: Proposal: COUNT(*) (and related) speedup
Hey all, First off, please feel free to let me know if this idea is a waste of time :) I was thinking about COUNT(*) and its slow performance today, and I thought about whether we could get better performance by taking a page out of index-only-scans. Essentially, the idea is that you would store a counter (let's say, as a special index type) that would initially (on index creation) be set to the total count of all rows on fully visible pages (visibility map bit set to 1). Whenever a query was made on the table for COUNT(*), we would not even bother to visit any pages that were in the visibility map, but instead visit the heap for every page that was not marked visible and compute the delta between the count of tuples visible at relfrozenxid and tuples currently visible. The deltas on all the pages would then be summed with the original counter and returned as the count of the table. The counter would be updated only by VACUUM, which would perform the same computation performed by the COUNT operation but add it permanently to counter just before it set the visible_map bit to 1 (I am not certain whether this would require unusual changes to WAL replay or not). There may be some issues with this idea, since I am not too familiar with the Postgres codebase, but Andrew Gierth was kind enough to put up with my bugging him in the IRC channel all night and helped me work through some of this stuff so I think it should at least be coherent. The upshot of the above proposal is that in situations where an index only scan would have helped before, now you might be able to skip almost all the I/O for COUNT(*), while still getting a completely accurate result. The disadvantages are pretty clear, I think, but let me list them anyway (that I can think of): * Adding an extra index type that can do only one thing * Adding an index type that can only do that one thing efficiently if you happen to want to scan over the entire table * Adding an index type for a use case that isn't very compelling (SELECT COUNT(*) FROM table) * There may be additional concerns about this being costed correctly, although I think there is probably enough information to make an informed decision here. * It can't be added as a contrib module (probably) and also be crash-safe. However, if the patch for negative transition aggregate functions lands, I think this idea becomes a bit more compelling. Then instead of just being a technique for COUNT(*), it effectively becomes a very low-memory way of speeding up a large number of aggregate queries (any that have an inverse) while maintaining accuracy, in situations where most of the pages never get modified. The only change to the above algorithm is that delta is computed as application of the inverse function to "old" tuples and the direct function to "new" ones on the page (I think this can likely be done during vacuum index cleanup). Please critique this idea and let me know whether it is worth pursuing further. Thanks! -- Joshua Yanovski
From feedback on IRC, two immediately obvious technical problems: * Heap pruning can happen at any time, not just during VACUUM and HOT updates. This is obviously a pretty significant issue and I doubt the easy solution (don't do heap pruning for tables with an index like this) would be acceptable. * While relfrozenxid works for this in 9.4, it's not really related to this purpose and its implementation could change at any time, so it is not reasonable to rely on it. A solution that would still work, but not be as fast, would be to just not have a minimum for calculating deltas. On Fri, Apr 4, 2014 at 5:38 AM, Joshua Yanovski <pythonesque@gmail.com> wrote: > Hey all, > > First off, please feel free to let me know if this idea is a waste of time :) > > I was thinking about COUNT(*) and its slow performance today, and I > thought about whether we could get better performance by taking a page > out of index-only-scans. > > Essentially, the idea is that you would store a counter (let's say, as > a special index type) that would initially (on index creation) be set > to the total count of > all rows on fully visible pages (visibility map bit set to 1). > Whenever a query was made on the table for COUNT(*), we would not even > bother to visit any pages > that were in the visibility map, but instead visit the heap for every > page that was not marked visible and compute the delta between the > count of tuples visible > at relfrozenxid and tuples currently visible. The deltas on all the > pages would then be summed with the original counter and returned as > the count of the table. > The counter would be updated only by VACUUM, which would perform the > same computation performed by the COUNT operation but add it > permanently to counter just before > it set the visible_map bit to 1 (I am not certain whether this would > require unusual changes to WAL replay or not). There may be some > issues with this idea, since > I am not too familiar with the Postgres codebase, but Andrew Gierth > was kind enough to put up with my bugging him in the IRC channel all > night and helped me work through > some of this stuff so I think it should at least be coherent. > > The upshot of the above proposal is that in situations where an index > only scan would have helped before, now you might be able to skip > almost all the I/O for COUNT(*), > while still getting a completely accurate result. > > The disadvantages are pretty clear, I think, but let me list them > anyway (that I can think of): > * Adding an extra index type that can do only one thing > * Adding an index type that can only do that one thing efficiently if > you happen to want to scan over the entire table > * Adding an index type for a use case that isn't very compelling > (SELECT COUNT(*) FROM table) > * There may be additional concerns about this being costed correctly, > although I think there is probably enough information to make an > informed decision here. > * It can't be added as a contrib module (probably) and also be crash-safe. > > However, if the patch for negative transition aggregate functions > lands, I think this idea becomes a bit more compelling. Then instead > of just being a technique for > COUNT(*), it effectively becomes a very low-memory way of speeding up > a large number of aggregate queries (any that have an inverse) while > maintaining accuracy, in > situations where most of the pages never get modified. The only > change to the above algorithm is that delta is computed as application > of the inverse function to "old" > tuples and the direct function to "new" ones on the page (I think this > can likely be done during vacuum index cleanup). > > Please critique this idea and let me know whether it is worth pursuing further. > > Thanks! > > -- Joshua Yanovski -- Josh
Joshua Yanovski <pythonesque@gmail.com> writes: > Essentially, the idea is that you would store a counter (let's say, as > a special index type) that would initially (on index creation) be set > to the total count of > all rows on fully visible pages (visibility map bit set to 1). It seems to me this can't possibly work because of race conditions. In particular, what happens when some query dirties a page and thereby clears its fully-visible bit? Presumably, any such query would have to (1) recompute the number of all-visible rows on that page (already an expensive thing) and then (2) go and subtract that from the counter (meaning the counter becomes a serialization bottleneck for all updates on the table, which is exactly the reason we don't just have a globally maintained row counter already). But worse, what happens if a count(*) is in progress? It might or might not have scanned this page already, and there's no way to get the right answer in both cases. Counter updates done by VACUUM would have a similar race-condition problem. > Please critique this idea and let me know whether it is worth pursuing further. I doubt it. regards, tom lane
t> It seems to me this can't possibly work because of race conditions. > In particular, what happens when some query dirties a page and thereby > clears its fully-visible bit? Presumably, any such query would have > to (1) recompute the number of all-visible rows on that page (already > an expensive thing) and then (2) go and subtract that from the counter > (meaning the counter becomes a serialization bottleneck for all updates > on the table, which is exactly the reason we don't just have a globally > maintained row counter already).. Unless what you're saying is that the index creation itself invites race conditions, I don't think this is true. The general idea here is to not actually compute the number of rows on a page (or whatever other aggregate) unless (1) the index is being created, (2) it's requested by the user, (3) VACUUM is occurring, and not to update the counter except on (1) and (3). I am not sure off the top of my head about (1), but it seems from comments in vac_update_relstats that we are already making the implicit assumption that VACUUM is not running concurrently with another VACUUM on the same table. A query that dirties a page doesn't have to perform any additional work than it does already, under any circumstances, under this model. > But worse, what happens if a count(*) > is in progress? It might or might not have scanned this page already, > and there's no way to get the right answer in both cases. Counter > updates done by VACUUM would have a similar race-condition problem. I don't think the first part really matters. If the page was visible when COUNT(*) started and then got dirtied by some other transaction, any changes by the second transaction shouldn't alter the actual count in our transaction anyway, so whether we scan the page needlessly or not seems beside the point. If that did matter, index only scans using COUNT(*) wouldn't work properly, since they make the same basic assumption. VACUUM counter updates, on the other hand, initially seem more problematic, since if we grab the value of the counter, then VACUUM updates the counter and the visbility bits, and then we check which visibility bits are set, we might skip pages we really need to check (since we're using an old value for the counter). But in practice I am pretty sure this is a nonissue, because as long as our COUNT(*) transaction is going, no pages can have their visibility bits marked unless they haven't been updated since the transaction started, which consequently means we won't see the counter updated (the counter is intended to be updated immediately before the visibility bit is set while VACUUMing). Took me a while to figure this out, for thanks for pointing it out :) > >> Please critique this idea and let me know whether it is worth pursuing further. > > I doubt it. So do I, but not for the reasons you listed, but because we can't rely on dead tuples sticking around for us to count them and generate deltas consistently. I think there is probably a way to get around that problem in a sane way with a more traditional index, though. > > regards, tom lane -- Josh
> VACUUM counter updates, on the other hand, initially seem more > problematic, since if we grab the value of the counter, then VACUUM > updates the counter and the visbility bits, and then we check which > visibility bits are set, we might skip pages we really need to check > (since we're using an old value for the counter). But in practice I > am pretty sure this is a nonissue, because as long as our COUNT(*) > transaction is going, no pages can have their visibility bits marked > unless they haven't been updated since the transaction started, which > consequently means we won't see the counter updated (the counter is > intended to be updated immediately before the visibility bit is set > while VACUUMing). Took me a while to figure this out, for thanks for > pointing it out :) Clarification: the counter may be updated, but never in a way that affects the final result, because it can only be updated to reflect statistics on pages that are fully visible to all current transactions. As long as we scan every page that was "really" dirty at the start of our transactions, we can't really go wrong, since we can recompute that information ourselves--we can just be better or worse at avoiding redundant heap fetches. -- Josh
Joshua Yanovski <pythonesque@gmail.com> writes: >> But worse, what happens if a count(*) >> is in progress? It might or might not have scanned this page already, >> and there's no way to get the right answer in both cases. Counter >> updates done by VACUUM would have a similar race-condition problem. > I don't think the first part really matters. If the page was visible > when COUNT(*) started and then got dirtied by some other transaction, > any changes by the second transaction shouldn't alter the actual count > in our transaction anyway, so whether we scan the page needlessly or > not seems beside the point. The question is not whether you scan a page "needlessly" or not, it's whether you double-count the tuples on it. I don't think it's possible to be sure that when you add the central counter value into your local sum, you are neither double-counting nor omitting pages whose all-visible state changed while you were scanning the table. regards, tom lane
Yeah, you're right, I believe that every code path in VACUUM that leads to the visibility map bit being set also leads to all remaining tuples on the page being frozen. So in a world without heap pruning, frozen should be a reliable proxy for "value of the tuple the last time it was added to the counter". If you count -1 for a frozen tuple, and 1 for a visible tuple, you should end up at precisely the delta from the last time the page was turned visible. It seems to me that with this definition of delta, the concurrent COUNT(*) case is fine. The counter isn't affected by the dirtying of the page, nor is the delta for the page. The result will be completely consistent with what it would have been had the page dirty never happened. But concurrent VACUUM under this definition fails. So yeah, I'll drop this until I have a concrete idea of how (or if) to proceed. Thanks for the feedback.
On 04 April 2014 18:09, Joshua Yanovski Wrote: > The counter would be updated only by VACUUM, which would perform the > same computation performed by the COUNT operation but add it > permanently to counter just before it set the visible_map bit to 1 (I > am not certain whether this would require unusual changes to WAL > replay or not). I think there is one more disadvantages with this (or maybe I have missed something): User may not use count(*) or may usejust once after creating new kind of index proposed but VACUUM will have to keep performing operation equivalent to iterativecount(*), which might degrade performance for VACUUM. Thanks and Regards, Kumar Rajeev Rastogi
On Fri, Apr 4, 2014 at 1:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Joshua Yanovski <pythonesque@gmail.com> writes: >>> But worse, what happens if a count(*) >>> is in progress? It might or might not have scanned this page already, >>> and there's no way to get the right answer in both cases. Counter >>> updates done by VACUUM would have a similar race-condition problem. > >> I don't think the first part really matters. If the page was visible >> when COUNT(*) started and then got dirtied by some other transaction, >> any changes by the second transaction shouldn't alter the actual count >> in our transaction anyway, so whether we scan the page needlessly or >> not seems beside the point. > > The question is not whether you scan a page "needlessly" or not, it's > whether you double-count the tuples on it. I don't think it's possible to > be sure that when you add the central counter value into your local sum, > you are neither double-counting nor omitting pages whose all-visible state > changed while you were scanning the table. I think it would work if you also had information about which pages you needed to scan. For example, suppose you had a data structure sorta like the visibility map, except that instead of one bit per page you have one 16-bit integer per page. The integer is -1 if the page isn't known to be all-visible, and is otherwise the count of tuples on the page. Now, we can do a count(*) as follows: 1. Lock the first as-yet-unexamined page of the data structure. 2. Add all the values that aren't -1 to your running count of tuples, and remember the page numbers corresponding to the remaining entries. 3. Unlock the page you locked in step 1. 4. Visit each heap page you remembered in step 2 and add the number of tuples visible to your scan to your running count of tuples. 5. Provided there's a page of the data structure you haven't visited yet, go to step 1. On the write side, any operation that clears the visibility map bit must also set the entry for that page to -1. When the page is all-visible, the value can be set to the total number of tuples on the page. This is safe because, even if the page ceases to be all-visible after we add its tuples to the count, the new tuples that were added aren't visible to our scan, and any tuples deleted are still visible to our scan - because the transaction making the changes, in either case, is obviously still running, and therefore didn't commit before we took our snapshot. Now, this wouldn't be O(1) but it would presumably be quite fast if your visibility map bits are mostly set. One 8kB page of 16-bit counters would cover 32MB of heap. The bad news is that I am pretty sure there's no easy way for an index AM to get callbacks at the right time, so it's really unclear how something like this would fit in with our existing abstractions. And even if we answered that question, it would be a lot of work for a pretty narrow use case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company