Improving count(*) - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Improving count(*) |
Date | |
Msg-id | 1132255690.4959.207.camel@localhost.localdomain Whole thread Raw |
Responses |
Re: Improving count(*)
Re: Improving count(*) Re: Improving count(*) Re: Improving count(*) |
List | pgsql-hackers |
One of the major complaints is always "Select count(*) is slow". I have a somewhat broadbrush idea as to how we might do this (for larger tables). Previously, we've said that maintaining a running count of the tuples in a table is hard, or at least costly. However, maintaining a partial count may be somewhat easier and offer the performance required. Bearing in mind other RDBMS' approach is to count the number of rows in an index, their cost is probably about the same as scanning table blocks/10 very roughly - so the cost is far from zero for them. In order to get similar performance we would need a technique that involved scanning a small percentage of the data blocks in a table, typically less than 10%. Prelims: As tuples are inserted they go into blocks in a varied sequence. However, since only VACUUM ever reduces the size of the data in a block, we notice that blocks eventually fill up. If they are on the FSM, they are removed. At that point, we could begin keeping track of the number of live tuples in the block. So we could imagine an on-disk data structure that is an array of this struct: blockid - the block in questioncount - the number of live tuples in that blocktxnid - the txnid of the last tuplewritten When we run a SELECT COUNT(*) we would read this data structure and use it to build a bitmap of blocks that still need to be scanned to get the count(*) result. So this would give us a partially cached result for count(*), which then would require scanning only the last few blocks of a table to get at the correct answer. This is beginning to sound very much like the same sort of solution as the one recently proposed for VACUUM: we keep track of the status of each block, then use it as a way of speeding things up. For VACUUM we want to keep a data structure that notes which blocks require vacuums. For count(*) we want to keep track of which blocks do not require vacuums, nearly. There are some blocks in the middle that wouldn't go on either list, true. So, why not combine the two purposes into a single solution? We would be able to manage at least 800 data blocks for each block of this structure, which would mean 1.25 MB per GB of data. For tables with a reasonably non-random block write pattern the key parts of that could reside in memory with reasonable ease even for busy tables. In those cases, it also seems like this technique might lead to the need to scan only a small percentage of heap blocks - so this would give roughly equivalent performance to other RDBMS for SELECT count(*). If its a data structure that we were thinking of maintaining for VACUUM anyway, this improves the value of the cost we would have to pay to maintain the a cache data structure. This is thinking only, I'm not proposing this as a fully worked proposal nor am I pursuing this myself. I can see immediately that there are some major obstacles in the way, like MVCC, autovacuum, overheads etc but it seems worth pointing out a possible angle of attack, since this looks like it might hit two birds with one stone. I'm not aiming to start "open season" on crazy ideas for this; this idea may take months to ruminate into a solution. Best Regards, Simon Riggs
pgsql-hackers by date: