Thread: Slow count(*) again...

Slow count(*) again...

From
Neil Whelchel
Date:
I know that there haven been many discussions on the slowness of count(*) even
when an index is involved because the visibility of the rows has to be
checked. In the past I have seen many suggestions about using triggers and
tables to keep track of counts and while this works fine in a situation where
you know what the report is going to be ahead of time, this is simply not an
option when an unknown WHERE clause is to be used (dynamically generated).
I ran into a fine example of this when I was searching this mailing list,
"Searching in 856,646 pages took 13.48202 seconds. Site search powered by
PostgreSQL 8.3." Obviously at some point count(*) came into play here because
the site made a list of pages (1 2 3 4 5 6 > next). I very commonly make a
list of pages from search results, and the biggest time killer here is the
count(*) portion, even worse yet, I sometimes have to hit the database with
two SELECT statements, one with OFFSET and LIMIT to get the page of results I
need and another to get the amount of total rows so I can estimate how many
pages of results are available. The point I am driving at here is that since
building a list of pages of results is such a common thing to do, there need
to be some specific high speed ways to do this in one query. Maybe an
estimate(*) that works like count but gives an answer from the index without
checking visibility? I am sure that this would be good enough to make a page
list, it is really no big deal if it errors on the positive side, maybe the
list of pages has an extra page off the end. I can live with that. What I
can't live with is taking 13 seconds to get a page of results from 850,000
rows in a table.
-Neil-

Re: Slow count(*) again...

From
Scott Marlowe
Date:
On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel <neil.whelchel@gmail.com> wrote:
> I know that there haven been many discussions on the slowness of count(*) even
> when an index is involved because the visibility of the rows has to be
> checked. In the past I have seen many suggestions about using triggers and
> tables to keep track of counts and while this works fine in a situation where
> you know what the report is going to be ahead of time, this is simply not an
> option when an unknown WHERE clause is to be used (dynamically generated).
> I ran into a fine example of this when I was searching this mailing list,
> "Searching in 856,646 pages took 13.48202 seconds. Site search powered by
> PostgreSQL 8.3." Obviously at some point count(*) came into play here because
> the site made a list of pages (1 2 3 4 5 6 > next). I very commonly make a
> list of pages from search results, and the biggest time killer here is the
> count(*) portion, even worse yet, I sometimes have to hit the database with
> two SELECT statements, one with OFFSET and LIMIT to get the page of results I
> need and another to get the amount of total rows so I can estimate how many
> pages of results are available. The point I am driving at here is that since
> building a list of pages of results is such a common thing to do, there need
> to be some specific high speed ways to do this in one query. Maybe an
> estimate(*) that works like count but gives an answer from the index without
> checking visibility? I am sure that this would be good enough to make a page
> list, it is really no big deal if it errors on the positive side, maybe the
> list of pages has an extra page off the end. I can live with that. What I
> can't live with is taking 13 seconds to get a page of results from 850,000
> rows in a table.

99% of the time in the situations you don't need an exact measure, and
assuming analyze has run recently, select rel_tuples from pg_class for
a given table is more than close enough.  I'm sure wrapping that in a
simple estimated_rows() function would be easy enough to do.

Re: Slow count(*) again...

From
Mladen Gogala
Date:
Neil Whelchel wrote:
> I know that there haven been many discussions on the slowness of count(*) even
> when an index is involved because the visibility of the rows has to be
> checked. In the past I have seen many suggestions about using triggers and
> tables to keep track of counts and while this works fine in a situation where
> you know what the report is going to be ahead of time, this is simply not an
> option when an unknown WHERE clause is to be used (dynamically generated).
> I ran into a fine example of this when I was searching this mailing list,
> "Searching in 856,646 pages took 13.48202 seconds. Site search powered by
> PostgreSQL 8.3." Obviously at some point count(*) came into play here because
> the site made a list of pages (1 2 3 4 5 6 > next). I very commonly make a
> list of pages from search results, and the biggest time killer here is the
> count(*) portion, even worse yet, I sometimes have to hit the database with
> two SELECT statements, one with OFFSET and LIMIT to get the page of results I
> need and another to get the amount of total rows so I can estimate how many
> pages of results are available. The point I am driving at here is that since
> building a list of pages of results is such a common thing to do, there need
> to be some specific high speed ways to do this in one query. Maybe an
> estimate(*) that works like count but gives an answer from the index without
> checking visibility? I am sure that this would be good enough to make a page
> list, it is really no big deal if it errors on the positive side, maybe the
> list of pages has an extra page off the end. I can live with that. What I
> can't live with is taking 13 seconds to get a page of results from 850,000
> rows in a table.
> -Neil-
>
>
Unfortunately, the problem is in the rather primitive way PostgreSQL
does I/O. It didn't change in 9.0 so there is nothing you could gain by
upgrading. If you execute strace -o /tmp/pg.out -e read <PID of the
sequential scan process> and inspect the file /tmp/pg.out when the query
finishes, you will notice a gazillion of read requests, all of them 8192
bytes in size. That means that PostgreSQL is reading the table block by
block, without any merging of the requests. You can alleviate the pain
by using the OS tricks, like specifying the deadline I/O scheduler in
the grub.conf and set prefetch on the FS block devices by using
blockdev, but there is nothing special that can be done, short of
rewriting the way PostgreSQL does I/O. There were rumors about the
version 9.0 and asynchronous I/O, but that didn't materialize. That is
really strange to me, because PostgreSQL tables are files or groups of
files, if the table size exceeds 1GB. It wouldn't be very hard to try
reading 1MB at a time and that would speed up the full table scan
significantly.
Problem with single block I/O is that there is a context switch for each
request, the I/O scheduler has to work hard to merge requests
appropriately and there is really no need for that, tables are files
navigating through files is not a problem, even with much larger blocks.
In another database, whose name I will not mention, there is a parameter
db_file_multiblock_read_count which specifies how many blocks will be
read by a single read when doing a full table scan. PostgreSQL is in
dire need of something similar and it wouldn't even be that hard to
implement.


--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Joe Conway
Date:
On 10/09/2010 06:54 PM, Mladen Gogala wrote:
> In another database, whose name I will not mention, there is a parameter
> db_file_multiblock_read_count which specifies how many blocks will be
> read by a single read when doing a full table scan. PostgreSQL is in
> dire need of something similar and it wouldn't even be that hard to
> implement.

You're correct in that it isn't particularly difficult to implement for
sequential scans. But I have done some testing with aggressive read
ahead, and although it is clearly a big win with a single client, the
benefit was less clear as concurrency was increased.

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support


Attachment

Re: Slow count(*) again...

From
Mladen Gogala
Date:
Joe Conway wrote:
> On 10/09/2010 06:54 PM, Mladen Gogala wrote:
>
>> In another database, whose name I will not mention, there is a parameter
>> db_file_multiblock_read_count which specifies how many blocks will be
>> read by a single read when doing a full table scan. PostgreSQL is in
>> dire need of something similar and it wouldn't even be that hard to
>> implement.
>>
>
> You're correct in that it isn't particularly difficult to implement for
> sequential scans. But I have done some testing with aggressive read
> ahead, and although it is clearly a big win with a single client, the
> benefit was less clear as concurrency was increased.
>
> Joe
>
>
Well, in my opinion that should be left to the DBA, the same as in the
"other database".  The mythical DBA, the creature that mighty Larry
Ellison himself is on a crusade against, usually can  figure out the
right value for the database he or she's is in charge of. I humbly
confess to being an Oracle DBA for more than 2 decades and now branching
into Postgres because my employer is less than enthusiastic about
Oracle, with the special accent on their pricing.

Modern databases, Postgres included, are quite complex and companies
need DBA personnel to help fine tune the applications. I know that good
DBA personnel is quite expensive but without a competent DBA who knows
the database software well enough,  companies can and will suffer from
blunders with performance, downtime, lost data and alike. In the world
where almost every application is written for the web, performance,
uptime and user experience are of the critical importance. The
architects of Postgres database would be well advised to operate under
the assumption that every production database has a competent DBA
keeping an eye on the database.

Every application has its own mix of sequential and index scans, you
cannot possibly test all possible applications.  Aggressive read-ahead
or "multi-block reads" can be a performance problem and it will
complicate the optimizer, because the optimizer now has a new variable
to account for: the block size, potentially making  seq_page_cost even
cheaper and random_page_cost even more expensive, depending on the
blocking. However,  slow sequential scan is, in my humble opinion, the
single biggest performance problem of the PostgreSQL databases and
should be improved, the sooner, the better. You should, however, count
on the DBA personnel to help with the tuning.
We're the Tinkerbells of the database world. I am 6'4", 240 LBS, no wings.


--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel <neil.whelchel@gmail.com>
wrote:
> > I know that there haven been many discussions on the slowness of count(*)
> > even when an index is involved because the visibility of the rows has to
> > be checked. In the past I have seen many suggestions about using
> > triggers and tables to keep track of counts and while this works fine in
> > a situation where you know what the report is going to be ahead of time,
> > this is simply not an option when an unknown WHERE clause is to be used
> > (dynamically generated). I ran into a fine example of this when I was
> > searching this mailing list, "Searching in 856,646 pages took 13.48202
> > seconds. Site search powered by PostgreSQL 8.3." Obviously at some point
> > count(*) came into play here because the site made a list of pages (1 2
> > 3 4 5 6 > next). I very commonly make a list of pages from search
> > results, and the biggest time killer here is the count(*) portion, even
> > worse yet, I sometimes have to hit the database with two SELECT
> > statements, one with OFFSET and LIMIT to get the page of results I need
> > and another to get the amount of total rows so I can estimate how many
> > pages of results are available. The point I am driving at here is that
> > since building a list of pages of results is such a common thing to do,
> > there need to be some specific high speed ways to do this in one query.
> > Maybe an estimate(*) that works like count but gives an answer from the
> > index without checking visibility? I am sure that this would be good
> > enough to make a page list, it is really no big deal if it errors on the
> > positive side, maybe the list of pages has an extra page off the end. I
> > can live with that. What I can't live with is taking 13 seconds to get a
> > page of results from 850,000 rows in a table.
>
> 99% of the time in the situations you don't need an exact measure, and
> assuming analyze has run recently, select rel_tuples from pg_class for
> a given table is more than close enough.  I'm sure wrapping that in a
> simple estimated_rows() function would be easy enough to do.

This is a very good approach and it works very well when you are counting the
entire table, but when you have no control over the WHERE clause, it doesn't
help. IE: someone puts in a word to look for in a web form.

From my perspective, this issue is the biggest problem there is when using
Postgres to create web pages, and it is so commonly used, I think that there
should be a specific way to deal with it so that you don't have to run the
same WHERE clause twice.
IE: SELECT count(*) FROM <table> WHERE <clause>; to get the total amount of
items to make page navigation links, then:
SELECT <columns> FROM table WHERE <clause> LIMIT <items_per_page> OFFSET
<(page_no-1)*items_per_page>; to get the actual page contents.

It's bad enough that count(*) is slow, then you have to do it all over again
to get the results you need! I have not dug into this much yet, but would it
be possible to return the amount of rows that a WHERE clause would actually
return if the LIMIT and OFFSET were not applied. IE: When a normal query is
executed, the server returns the number of rows aside from the actual row
data. Would it be a big deal to modify this to allow it to return the amount
of rows before the LIMIT and OFFSET is applied as well? This would sure cut
down on time it takes to do the same WHERE clause twice... I have considered
using a cursor to do this, however this requires a transfer of all of the rows
to the client to get a total count, then setting the cursor to get the rows
that you are interested in. Or is there a way around this that I am not aware
of?
-Neil-


Re: Slow count(*) again...

From
Samuel Gendler
Date:


On Sat, Oct 9, 2010 at 7:44 PM, Mladen Gogala <mladen.gogala@vmsinfo.com> wrote:
 The architects of Postgres database would be well advised to operate under the assumption that every production database has a competent DBA keeping an eye on the database.

I'd actually go so far as to say that they have already made this assumption.  The out of the box config needs modification for all but the most low-volume applications and postgres really benefits from having some attention paid to performance.  Not only does tuning the db provide enormous gains, but it is often possible to dramatically improve query responsiveness by simply restructuring a query (assuming an aggregating query over a fairly large table with a few joins thrown in).  My team does not have a competent DBA (though I've got 15+ years of experience developing on top of various dbs and certainly don't make overly naive assumptions about how things work) and the gains that we made, when I finally just sat down and read everything I could get my hands on about postgres and started reading this list, were really quite impressive.  I intend to take some of the courses offered by some of the companies that are active on this list when my schedule allows in order to expand my knowledge even farther, as a DBA is a luxury we cannot really afford at the moment. 

Re: Slow count(*) again...

From
Craig Ringer
Date:
On 10/10/2010 11:02 AM, Neil Whelchel wrote:
> On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
>> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com>
> wrote:
>>> I know that there haven been many discussions on the slowness of count(*)
>>> even when an index is involved because the visibility of the rows has to
>>> be checked. In the past I have seen many suggestions about using
>>> triggers and tables to keep track of counts and while this works fine in
>>> a situation where you know what the report is going to be ahead of time,
>>> this is simply not an option when an unknown WHERE clause is to be used
>>> (dynamically generated). I ran into a fine example of this when I was
>>> searching this mailing list, "Searching in 856,646 pages took 13.48202
>>> seconds. Site search powered by PostgreSQL 8.3." Obviously at some point
>>> count(*) came into play here because the site made a list of pages (1 2
>>> 3 4 5 6>  next). I very commonly make a list of pages from search
>>> results, and the biggest time killer here is the count(*) portion, even
>>> worse yet, I sometimes have to hit the database with two SELECT
>>> statements, one with OFFSET and LIMIT to get the page of results I need
>>> and another to get the amount of total rows so I can estimate how many
>>> pages of results are available. The point I am driving at here is that
>>> since building a list of pages of results is such a common thing to do,
>>> there need to be some specific high speed ways to do this in one query.
>>> Maybe an estimate(*) that works like count but gives an answer from the
>>> index without checking visibility? I am sure that this would be good
>>> enough to make a page list, it is really no big deal if it errors on the
>>> positive side, maybe the list of pages has an extra page off the end. I
>>> can live with that. What I can't live with is taking 13 seconds to get a
>>> page of results from 850,000 rows in a table.
>>
>> 99% of the time in the situations you don't need an exact measure, and
>> assuming analyze has run recently, select rel_tuples from pg_class for
>> a given table is more than close enough.  I'm sure wrapping that in a
>> simple estimated_rows() function would be easy enough to do.
>
> This is a very good approach and it works very well when you are counting the
> entire table, but when you have no control over the WHERE clause, it doesn't
> help. IE: someone puts in a word to look for in a web form.

For that sort of thing, there isn't much that'll help you except
visibility-aware indexes, covering indexes, etc if/when they're
implemented. Even then, they'd only help when it was a simple
index-driven query with no need to hit the table to recheck any test
conditions, etc.

I guess there could be *some* way to expose the query planner's cost
estimates in a manner useful for result count estimation ... but given
how coarse its stats are and how wildly out the estimates can be, I kind
of doubt it. It's really intended for query planning decisions and more
interested in orders of magnitude, "0, 1, or more than that" measures,
etc, and seems to consider 30% here or there to be pretty insignificant
most of the time.

> It's bad enough that count(*) is slow, then you have to do it all over again
> to get the results you need! I have not dug into this much yet, but would it
> be possible to return the amount of rows that a WHERE clause would actually
> return if the LIMIT and OFFSET were not applied. IE: When a normal query is
> executed, the server returns the number of rows aside from the actual row
> data. Would it be a big deal to modify this to allow it to return the amount
> of rows before the LIMIT and OFFSET is applied as well?

It'd force the server to fully execute the query. Then again, it sounds
like you're doing that anyway.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: Slow count(*) again...

From
Craig Ringer
Date:
On 10/10/2010 9:54 AM, Mladen Gogala wrote:

> Unfortunately, the problem is in the rather primitive way PostgreSQL
> does I/O. It didn't change in 9.0 so there is nothing you could gain by
> upgrading. If you execute strace -o /tmp/pg.out -e read <PID of the
> sequential scan process> and inspect the file /tmp/pg.out when the query
> finishes, you will notice a gazillion of read requests, all of them 8192
> bytes in size. That means that PostgreSQL is reading the table block by
> block, without any merging of the requests.

I'd be really interested in any measurements you've done to determine
the cost of this over doing reads in larger chunks. If they're properly
detailed and thought out, the -hackers list is likely to be interested
as well.

The Linux kernel, at least, does request merging (and splitting, and
merging, and more splitting) along the request path, and I'd personally
expect that most of the cost of 8k requests would be in the increased
number of system calls, buffer copies, etc required. Measurements
demonstrating or contradicting this would be good to see.

It's worth being aware that there are memory costs to doing larger
reads, especially when you have many backends each of which want to
allocate a larger buffer for reading. If you can use a chunk of
shared_buffers as the direct destination for the read that's OK, but
otherwise you're up for (1mb-8kb)*num_backends extra memory use on I/O
buffers that could otherwise be used as shared_buffers or OS cache.

Async I/O, too, has costs.

 > PostgreSQL is in
> dire need of something similar and it wouldn't even be that hard to
> implement.

I'd really like to see both those assertions backed with data or patches ;-)

Personally, I know just enough about how PG's I/O path works to suspect
that "not that hard to implement" is probably a little ...
over-optimistic. Sure, it's not that hard to implement in a new program
with no wired-in architectural and design choices; that doesn't mean
it's easy to retrofit onto existing code, especially a bunch of
co-operating processes with their own buffer management.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Saturday 09 October 2010 23:56:15 Craig Ringer wrote:
> On 10/10/2010 11:02 AM, Neil Whelchel wrote:
> > On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
> >> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com>
> >
> > wrote:
> >>> I know that there haven been many discussions on the slowness of
> >>> count(*) even when an index is involved because the visibility of the
> >>> rows has to be checked. In the past I have seen many suggestions about
> >>> using triggers and tables to keep track of counts and while this works
> >>> fine in a situation where you know what the report is going to be
> >>> ahead of time, this is simply not an option when an unknown WHERE
> >>> clause is to be used (dynamically generated). I ran into a fine
> >>> example of this when I was searching this mailing list, "Searching in
> >>> 856,646 pages took 13.48202 seconds. Site search powered by PostgreSQL
> >>> 8.3." Obviously at some point count(*) came into play here because the
> >>> site made a list of pages (1 2 3 4 5 6>  next). I very commonly make a
> >>> list of pages from search results, and the biggest time killer here is
> >>> the count(*) portion, even worse yet, I sometimes have to hit the
> >>> database with two SELECT statements, one with OFFSET and LIMIT to get
> >>> the page of results I need and another to get the amount of total rows
> >>> so I can estimate how many pages of results are available. The point I
> >>> am driving at here is that since building a list of pages of results
> >>> is such a common thing to do, there need to be some specific high
> >>> speed ways to do this in one query. Maybe an estimate(*) that works
> >>> like count but gives an answer from the index without checking
> >>> visibility? I am sure that this would be good enough to make a page
> >>> list, it is really no big deal if it errors on the positive side,
> >>> maybe the list of pages has an extra page off the end. I can live with
> >>> that. What I can't live with is taking 13 seconds to get a page of
> >>> results from 850,000 rows in a table.
> >>
> >> 99% of the time in the situations you don't need an exact measure, and
> >> assuming analyze has run recently, select rel_tuples from pg_class for
> >> a given table is more than close enough.  I'm sure wrapping that in a
> >> simple estimated_rows() function would be easy enough to do.
> >
> > This is a very good approach and it works very well when you are counting
> > the entire table, but when you have no control over the WHERE clause, it
> > doesn't help. IE: someone puts in a word to look for in a web form.
>
> For that sort of thing, there isn't much that'll help you except
> visibility-aware indexes, covering indexes, etc if/when they're
> implemented. Even then, they'd only help when it was a simple
> index-driven query with no need to hit the table to recheck any test
> conditions, etc.

Good point, maybe this is turning more into a discussion of how to generate a
list of pages of results and one page of results with one query so we don't
have to do the same painfully slow query twice to do a very common task.

On the other hand, I copied a table out of one of my production servers that
has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp,
text). The first numeric column has numbers evenly spread between 0 and 100
and it is indexed. I put the table in a pair of database servers both running
on the same physical hardware. One server is Postgres, the other is a popular
server (I am not mentioning names here). on Postgres: SELECT count(*) FROM
table where column>50; takes about 8 seconds to run. The other database server
took less than one second (about 25 ms) as it is using the index (I assume) to
come up with the results. It is true that this is not a fair test because both
servers were tested with their default settings, and the defaults for Postgres
are much more conservative, however, I don't think that any amount of settings
tweaking will bring them even in the same ball park. There has been discussion
about the other server returning an incorrect count because all of the indexed
rows may not be live at the time. This is not a problem for the intended use,
that is why I suggested another function like estimate(*). It's name suggests
that the result will be close, not 100% correct, which is plenty good enough
for generating a list of results pages in most cases. I am faced with a very
serious problem here. If the query to make a list of pages takes say 6 seconds
and it takes another 6 seconds to generate a page of results, the customer is
waiting 12 seconds. This is not going to work. If count made a quick estimate,
say less than a second, and it took 6 seconds to come up with the actual
results, I could live with that. Or if coming up with the window of results
via (OFFSET and LIMIT) and returned the total number of rows that would have
matched the query, then I would still have everything I need to render the
page in a reasonable time. I really think that this needs to be addressed
somewhere. It's not like I am the only one that does this. You see it nearly
everywhere a long list of results is (expected to be) returned in a web site.
Among the people I work with, this seems to be the most mentioned reason that
they claim that they don't use Postgres for their projects.

It would be nice to see how the server comes up with the search results and
list of links to pages of results for this mailing list.
(http://search.postgresql.org/search?q=slow+count%28%29&m=1&l=&d=365&s=r) I am
guessing that it probably uses the count and query method I am talking about.

>
> I guess there could be *some* way to expose the query planner's cost
> estimates in a manner useful for result count estimation ... but given
> how coarse its stats are and how wildly out the estimates can be, I kind
> of doubt it. It's really intended for query planning decisions and more
> interested in orders of magnitude, "0, 1, or more than that" measures,
> etc, and seems to consider 30% here or there to be pretty insignificant
> most of the time.
>
> > It's bad enough that count(*) is slow, then you have to do it all over
> > again to get the results you need! I have not dug into this much yet,
> > but would it be possible to return the amount of rows that a WHERE
> > clause would actually return if the LIMIT and OFFSET were not applied.
> > IE: When a normal query is executed, the server returns the number of
> > rows aside from the actual row data. Would it be a big deal to modify
> > this to allow it to return the amount of rows before the LIMIT and
> > OFFSET is applied as well?
>
> It'd force the server to fully execute the query. Then again, it sounds
> like you're doing that anyway.

Re: Slow count(*) again...

From
Віталій Тимчишин
Date:


2010/10/10 Neil Whelchel <neil.whelchel@gmail.com>
On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel <neil.whelchel@gmail.com>
wrote:
> > I know that there haven been many discussions on the slowness of count(*)
> > even when an index is involved because the visibility of the rows has to
> > be checked. In the past I have seen many suggestions about using
> > triggers and tables to keep track of counts and while this works fine in
> > a situation where you know what the report is going to be ahead of time,
> > this is simply not an option when an unknown WHERE clause is to be used
> > (dynamically generated). I ran into a fine example of this when I was
> > searching this mailing list, "Searching in 856,646 pages took 13.48202
> > seconds. Site search powered by PostgreSQL 8.3." Obviously at some point
> > count(*) came into play here because the site made a list of pages (1 2
> > 3 4 5 6 > next). I very commonly make a list of pages from search
> > results, and the biggest time killer here is the count(*) portion, even
> > worse yet, I sometimes have to hit the database with two SELECT
> > statements, one with OFFSET and LIMIT to get the page of results I need
> > and another to get the amount of total rows so I can estimate how many
> > pages of results are available. The point I am driving at here is that
> > since building a list of pages of results is such a common thing to do,
> > there need to be some specific high speed ways to do this in one query.
> > Maybe an estimate(*) that works like count but gives an answer from the
> > index without checking visibility? I am sure that this would be good
> > enough to make a page list, it is really no big deal if it errors on the
> > positive side, maybe the list of pages has an extra page off the end. I
> > can live with that. What I can't live with is taking 13 seconds to get a
> > page of results from 850,000 rows in a table.
>
> 99% of the time in the situations you don't need an exact measure, and
> assuming analyze has run recently, select rel_tuples from pg_class for
> a given table is more than close enough.  I'm sure wrapping that in a
> simple estimated_rows() function would be easy enough to do.

This is a very good approach and it works very well when you are counting the
entire table, but when you have no control over the WHERE clause, it doesn't
help. IE: someone puts in a word to look for in a web form.

From my perspective, this issue is the biggest problem there is when using
Postgres to create web pages, and it is so commonly used, I think that there
should be a specific way to deal with it so that you don't have to run the
same WHERE clause twice.
IE: SELECT count(*) FROM <table> WHERE <clause>; to get the total amount of
items to make page navigation links, then:
SELECT <columns> FROM table WHERE <clause> LIMIT <items_per_page> OFFSET
<(page_no-1)*items_per_page>; to get the actual page contents.

How about
select * from (select *, count(*) over () as total_count from <table> where <clause)  a LIMIT <items_per_page> OFFSET
<(page_no-1)*items_per_page>
It will return you total_count column with equal value in each row. You may have problems if no rows are returned (e.g. page num is too high).
--
Best regards,
 Vitalii Tymchyshyn

Re: Slow count(*) again...

From
Craig Ringer
Date:
On 10/10/2010 6:29 PM, Neil Whelchel wrote:
> On the other hand, I copied a table out of one of my production servers that
> has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp,
> text). The first numeric column has numbers evenly spread between 0 and 100
> and it is indexed. I put the table in a pair of database servers both running
> on the same physical hardware. One server is Postgres, the other is a popular
> server (I am not mentioning names here).

Please do. Your comment is pretty meaningless otherwise.

If you're talking about MySQL: Were you using InnoDB or MyISAM table
storage? Of course it's fast with MyISAM, it relies on locks to do
updates and has bugger all capability for write concurrency, or to
permit readers while writing is going on.

If you're using InnoDB, then I'd like to know how they've managed that.

If you're talking about some *other* database, please name it and
provide any useful details, because the hand waving is not helpful.

 > I don't think that any amount of settings
> tweaking will bring them even in the same ball park.

If you are, in fact, comparing MySQL+MyISAM and PostgreSQL, then you're
quite right. Pg will never have such a fast count() as MyISAM does or
the same insanely fast read performance, and MyISAM will never be as
reliable, robust or concurrency-friendly as Pg is. Take your pick, you
can't have both.

> There has been discussion
> about the other server returning an incorrect count because all of the indexed
> rows may not be live at the time. This is not a problem for the intended use,
> that is why I suggested another function like estimate(*). It's name suggests
> that the result will be close, not 100% correct, which is plenty good enough
> for generating a list of results pages in most cases.

Do you have any practical suggestions for generating such an estimate,
though? I find it hard to think of any way the server can do that
doesn't involve executing the query. The table stats are WAY too general
and a bit hit-and-miss, and there isn't really any other way to do it.

If all you want is a way to retrieve both a subset of results AND a
count of how many results would've been generated, it sounds like all
you really need is a way to get the total number of results returned by
a cursor query, which isn't a big engineering challenge. I expect that
in current Pg versions a trivial PL/PgSQL function could be used to
slurp and discard unwanted results, but a better in-server option to
count the results from a cursor query would certainly be nice.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: Slow count(*) again...

From
Reid Thompson
Date:
  On 10/10/2010 6:29 AM, Neil Whelchel wrote:
> On Saturday 09 October 2010 23:56:15 Craig Ringer wrote:
>> On 10/10/2010 11:02 AM, Neil Whelchel wrote:
>>> On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
>>>> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com>
>>> wrote:
>>>>> I know that there haven been many discussions on the slowness of
>>>>> count(*) even when an index is involved because the visibility of the
>>>>> rows has to be checked. In the past I have seen many suggestions about
>>>>> using triggers and tables to keep track of counts and while this works
>>>>> fine in a situation where you know what the report is going to be
>>>>> ahead of time, this is simply not an option when an unknown WHERE
>>>>> clause is to be used (dynamically generated). I ran into a fine
>>>>> example of this when I was searching this mailing list, "Searching in
>>>>> 856,646 pages took 13.48202 seconds. Site search powered by PostgreSQL
>>>>> 8.3." Obviously at some point count(*) came into play here because the
>>>>> site made a list of pages (1 2 3 4 5 6>   next). I very commonly make a
>>>>> list of pages from search results, and the biggest time killer here is
>>>>> the count(*) portion, even worse yet, I sometimes have to hit the
>>>>> database with two SELECT statements, one with OFFSET and LIMIT to get
>>>>> the page of results I need and another to get the amount of total rows
>>>>> so I can estimate how many pages of results are available. The point I
>>>>> am driving at here is that since building a list of pages of results
>>>>> is such a common thing to do, there need to be some specific high
>>>>> speed ways to do this in one query. Maybe an estimate(*) that works
>>>>> like count but gives an answer from the index without checking
>>>>> visibility? I am sure that this would be good enough to make a page
>>>>> list, it is really no big deal if it errors on the positive side,
>>>>> maybe the list of pages has an extra page off the end. I can live with
>>>>> that. What I can't live with is taking 13 seconds to get a page of
>>>>> results from 850,000 rows in a table.
> Good point, maybe this is turning more into a discussion of how to generate a
> list of pages of results and one page of results with one query so we don't
> have to do the same painfully slow query twice to do a very common task.
>
> On the other hand, I copied a table out of one of my production servers that
> has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp,
> text). The first numeric column has numbers evenly spread between 0 and 100
> and it is indexed. I put the table in a pair of database servers both running
> on the same physical hardware. One server is Postgres, the other is a popular
> server (I am not mentioning names here). on Postgres: SELECT count(*) FROM
> table where column>50; takes about 8 seconds to run. The other database server
> took less than one second (about 25 ms) as it is using the index (I assume) to
> come up with the results. It is true that this is not a fair test because both
> servers were tested with their default settings, and the defaults for Postgres
> are much more conservative, however, I don't think that any amount of settings
> tweaking will bring them even in the same ball park. There has been discussion
> about the other server returning an incorrect count because all of the indexed
> rows may not be live at the time. This is not a problem for the intended use,
> that is why I suggested another function like estimate(*). It's name suggests
> that the result will be close, not 100% correct, which is plenty good enough
> for generating a list of results pages in most cases. I am faced with a very
> serious problem here. If the query to make a list of pages takes say 6 seconds
> and it takes another 6 seconds to generate a page of results, the customer is
> waiting 12 seconds. This is not going to work. If count made a quick estimate,
> say less than a second, and it took 6 seconds to come up with the actual
> results, I could live with that. Or if coming up with the window of results
> via (OFFSET and LIMIT) and returned the total number of rows that would have
> matched the query, then I would still have everything I need to render the
> page in a reasonable time. I really think that this needs to be addressed
> somewhere. It's not like I am the only one that does this. You see it nearly
> everywhere a long list of results is (expected to be) returned in a web site.
> Among the people I work with, this seems to be the most mentioned reason that
> they claim that they don't use Postgres for their projects.
>
> It would be nice to see how the server comes up with the search results and
> list of links to pages of results for this mailing list.
> (http://search.postgresql.org/search?q=slow+count%28%29&m=1&l=&d=365&s=r) I am
> guessing that it probably uses the count and query method I am talking about.
>
>> I guess there could be *some* way to expose the query planner's cost
>> estimates in a manner useful for result count estimation ... but given
>> how coarse its stats are and how wildly out the estimates can be, I kind
>> of doubt it. It's really intended for query planning decisions and more
>> interested in orders of magnitude, "0, 1, or more than that" measures,
>> etc, and seems to consider 30% here or there to be pretty insignificant
>> most of the time.
>>
>>> It's bad enough that count(*) is slow, then you have to do it all over
>>> again to get the results you need! I have not dug into this much yet,
>>> but would it be possible to return the amount of rows that a WHERE
>>> clause would actually return if the LIMIT and OFFSET were not applied.
>>> IE: When a normal query is executed, the server returns the number of
>>> rows aside from the actual row data. Would it be a big deal to modify
>>> this to allow it to return the amount of rows before the LIMIT and
>>> OFFSET is applied as well?
>> It'd force the server to fully execute the query. Then again, it sounds
>> like you're doing that anyway.
How big is your DB?
How fast is your disk access?
Any chance disks/RAM can be addressed?

My disk access is pitiful...
first run, 2.3 million rows.. 0m35.38s, subsequent runs.. real    0m2.55s

rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
   count
---------
  2340704
(1 row)


real    0m35.38s
user    0m0.25s
sys     0m0.03s

subsequent runs.... (count changes due to inserts.)

rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
   count
---------
  2363707
(1 row)


real    0m2.70s
user    0m0.27s
sys     0m0.02s
rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
   count
---------
  2363707
(1 row)


real    0m2.55s
user    0m0.26s
sys     0m0.02s
rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
   count
---------
  2363707
(1 row)


real    0m2.50s
user    0m0.26s
sys     0m0.02s

reporting=# SELECT pg_size_pretty(pg_total_relation_size('my_production_table'));
  pg_size_pretty
----------------
  1890 MB
(1 row)


Re: Slow count(*) again...

From
Mladen Gogala
Date:
Craig Ringer wrote:
> On 10/10/2010 9:54 AM, Mladen Gogala wrote:
>
>
>> Unfortunately, the problem is in the rather primitive way PostgreSQL
>> does I/O. It didn't change in 9.0 so there is nothing you could gain by
>> upgrading. If you execute strace -o /tmp/pg.out -e read <PID of the
>> sequential scan process> and inspect the file /tmp/pg.out when the query
>> finishes, you will notice a gazillion of read requests, all of them 8192
>> bytes in size. That means that PostgreSQL is reading the table block by
>> block, without any merging of the requests.
>>
>
> I'd be really interested in any measurements you've done to determine
> the cost of this over doing reads in larger chunks. If they're properly
> detailed and thought out, the -hackers list is likely to be interested
> as well.
>
I can provide measurements, but from  Oracle RDBMS. Postgres doesn't
allow tuning of that aspect, so no measurement can be done. Would the
numbers from Oracle RDBMS be acceptable?


> The Linux kernel, at least, does request merging (and splitting, and
> merging, and more splitting) along the request path, and I'd personally
> expect that most of the cost of 8k requests would be in the increased
> number of system calls, buffer copies, etc required. Measurements
> demonstrating or contradicting this would be good to see.
>

Even the cost of hundreds of thousands of context switches is far from
negligible. What kind of measurements do you expect me to do with the
database which doesn't support  tweaking of that aspect of its operation?

> It's worth being aware that there are memory costs to doing larger
> reads, especially when you have many backends each of which want to
> allocate a larger buffer for reading.

Oh, it's not only larger memory, the buffer management would have to be
changed too, to prevent process doing a sequential scan from inundating
the shared buffers. Alternatively, the blocks would have to be written
into the private memory and immediately thrown away after that. However,
the experience with Oracle tells me that this is well worth it. Here are
the numbers:

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options

SQL> show parameter db_file_multi

NAME                                 TYPE        VALUE
------------------------------------ -----------
------------------------------
db_file_multiblock_read_count        integer     16
SQL> alter session set db_file_multiblock_read_count=1;

Session altered.
SQL> select count(*) from ni_occurrence;

  COUNT(*)
----------
 402062638

Elapsed: 00:08:20.88
SQL> alter session set db_file_multiblock_read_count=128;

Session altered.

Elapsed: 00:00:00.50
SQL>  select count(*) from ni_occurrence;

  COUNT(*)
----------
 402062638

Elapsed: 00:02:17.58


In other words, when I batched the sequential scan to do 128 blocks I/O,
it was 4 times faster than when I did the single block I/O.
Does that provide enough of an evidence and, if not, why not?


> If you can use a chunk of
> shared_buffers as the direct destination for the read that's OK, but
> otherwise you're up for (1mb-8kb)*num_backends extra memory use on I/O
> buffers that could otherwise be used as shared_buffers or OS cache.
>
> Async I/O, too, has costs.
>

There is a common platitude that says that there is no such thing as
free lunch. However, both Oracle RDBMS and IBM DB2 use asynchronous I/O,
probably because they're unaware of the danger. Let me now give you a
full table scan of a much smaller table located in a Postgres database:

news=> select count(*) from internet_web_sites;
  count
---------
 1290133
(1 row)

Time: 12838.958 ms


Oracle counts 400 million records in 2 minutes and Postgres 9.01 takes
12.8 seconds to count 1.2 million records?  Do you see the disparity?

Both databases, Oracle and Postgres, are utilizing the same 3Par SAN
device, the machines housing both databases are comparable HP 64 bit
Linux machines,  both running 64 bit version of Red Hat 5.5. Respective
table sizes are here:

SQL> select bytes/1048576 as MB from user_segments
  2  where segment_name='NI_OCCURRENCE';

        MB
----------
     35329

news=> select pg_size_pretty(pg_table_size('moreover.internet_web_sites'));
 pg_size_pretty
----------------
 216 MB
(1 row)

So, I really pushed Oracle much harder than I pushed Postgres.

>  > PostgreSQL is in
>
>> dire need of something similar and it wouldn't even be that hard to
>> implement.
>>
>
> I'd really like to see both those assertions backed with data or patches ;-)
>

With the database that doesn't allow tuning of that aspect, it's the
self-defeating proposition. However, I did my best to give you the numbers.

> Personally, I know just enough about how PG's I/O path works to suspect
> that "not that hard to implement" is probably a little ...
> over-optimistic. Sure, it's not that hard to implement in a new program
> with no wired-in architectural and design choices; that doesn't mean
> it's easy to retrofit onto existing code, especially a bunch of
> co-operating processes with their own buffer management.
>
>
It maybe so, but slow sequential scan is still the largest single
performance problem of PostgreSQL. The frequency with which that topic
appears on the mailing lists should serve as a good evidence for that. I
did my best to prove my case.  Again, requiring "hard numbers" when
using the database which doesn't allow tweaking of the I/O size is self
defeating proposition. The other databases, like DB2 and Oracle both
allow tweaking of that aspect of its operation, Oracle even on the per
session basis. If you still claim that it wouldn't make the difference,
the onus to prove it is on you.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Reid Thompson
Date:
On 10/10/2010 11:02 AM, Reid Thompson wrote:
>>>> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com>
>>>>
>> On the other hand, I copied a table out of one of my production servers that
>> has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp,
>> text). The first numeric column has numbers evenly spread between 0 and 100
>> and it is indexed. I put the table in a pair of database servers both running
>> on the same physical hardware. One server is Postgres, the other is a popular
>> server (I am not mentioning names here). on Postgres: SELECT count(*) FROM
>> table where column>50; takes about 8 seconds to run. The other database server
>> took less than one second (about 25 ms) as it is using the index (I assume) to
>> come up with the results. It is true that this is not a fair test because both
>> servers were tested with their default settings, and the defaults for Postgres
>> are much more conservative, however, I don't think that any amount of settings
>> tweaking will bring them even in the same ball park. There has been discussion
>> about the other server returning an incorrect count because all of the indexed
>> rows may not be live at the time. This is not a problem for the intended use,
>> that is why I suggested another function like estimate(*). It's name suggests
>> that the result will be close, not 100% correct, which is plenty good enough
>> for generating a list of results pages in most cases. I am faced with a very
>> serious problem here. If the query to make a list of pages takes say 6 seconds
>> and it takes another 6 seconds to generate a page of results, the customer is
>> waiting 12 seconds. This is not going to work. If count made a quick estimate,
>> say less than a second, and it took 6 seconds to come up with the actual
>> results, I could live with that. Or if coming up with the window of results
>> via (OFFSET and LIMIT) and returned the total number of rows that would have
>> matched the query, then I would still have everything I need to render the
>> page in a reasonable time. I really think that this needs to be addressed
>> somewhere. It's not like I am the only one that does this. You see it nearly
>> everywhere a long list of results is (expected to be) returned in a web site.
>> Among the people I work with, this seems to be the most mentioned reason that
>> they claim that they don't use Postgres for their projects. t anyway.
>
> How big is your DB?
> How fast is your disk access?
> Any chance disks/RAM can be addressed?
>
> My disk access is pitiful...
> first run, 2.3 million rows.. 0m35.38s, subsequent runs.. real    0m2.55s
>
> rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
>   count
> ---------
>  2340704
> (1 row)
>
>
> real    0m35.38s
> user    0m0.25s
> sys     0m0.03s
>
> subsequent runs.... (count changes due to inserts.)
>
> rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
>   count
> ---------
>  2363707
> (1 row)
>
>
> real    0m2.70s
> user    0m0.27s
> sys     0m0.02s
> rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
>   count
> ---------
>  2363707
> (1 row)
>
>
> real    0m2.55s
> user    0m0.26s
> sys     0m0.02s
> rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting
>   count
> ---------
>  2363707
> (1 row)
>
>
> real    0m2.50s
> user    0m0.26s
> sys     0m0.02s
>
> reporting=# SELECT pg_size_pretty(pg_total_relation_size('my_production_table'));
>  pg_size_pretty
> ----------------
>  1890 MB
> (1 row)
>
>
forgot to note, my table schema is significantly larger.

rthompso@hw-prod-repdb1> time psql -c "\d my_production_table_201010" reporting
                                               Table "public.my_production_table_201010"
            Column            |            Type             |                           Modifiers

-----------------------------+-----------------------------+----------------------------------------------------------------
                              | integer                     | not null default
nextval('my_production_table_parent_id_seq'::regclass)
                              | character varying(20)       |
                              | character(1)                |
                              | character varying(32)       |
                              | character varying(32)       |
                              | character varying(20)       |
                              | character varying(5)        |
                              | character varying(5)        |
                              | date                        |
                              | character(1)                |
                              | character varying(32)       |
                              | character varying(32)       |
                              | character varying(32)       |
                              | character varying(2)        |
                              | character varying(10)       |
                              | character varying(10)       |
                              | character varying(32)       |
                              | character varying(7)        |
                              | character varying(10)       |
                              | character varying(2)        |
                              | character varying(9)        |
                              | character varying(9)        |
                              | character varying(9)        |
                              | character varying(10)       |
                              | character varying(32)       |
                              | character varying(32)       |
                              | character varying(20)       |
                              | character varying(5)        |
                              | character varying(5)        |
                              | character varying(32)       |
                              | character varying(32)       |
                              | character varying(32)       |
                              | character varying(2)        |
                              | character varying(10)       |
                              | character varying(10)       |
                              | character varying(10)       |
                              | character varying(10)       |
                              | integer                     |
                              | character varying(2)        |
                              | character varying(32)       |
                              | character varying(32)       |
                              | integer                     |
                              | integer                     |
                              | text                        |
                              | character varying(3)        |
                              | date                        |
                              | date                        |
                              | date                        |
                              | integer                     |
                              | integer                     |
                              | integer                     |
                              | integer                     |
                              | character varying(6)        |
                              | character varying(10)       |
                              | character varying(32)       |
                              | character varying(32)       |
                              | character varying(32)       |
                              | character varying(10)       |
                              | character varying(6)        |
                              | character varying(8)        |
                              | boolean                     |
                              | character(1)                |
                              | date                        |
                              | integer                     |
                              | date                        |
                              | character varying(11)       |
                              | character varying(4)        |
                              | character(1)                |
                              | date                        |
                              | character varying(5)        |
                              | character varying(20)       |
                              | date                        |
                              | character(1)                |
                              | character(1)                |
                              | character varying(2)        |
                              | text                        |
                              | integer                     |
                              | integer                     |
                              | timestamp without time zone | default now()
                              | timestamp without time zone |
                              | character varying(64)       |
                              | character varying(64)       |
                              | character varying(64)       |
Indexes:
     "my_production_table_201010_pkey" PRIMARY KEY, btree (id)
     "my_production_table_201010_date_idx" btree (xxxxdate), tablespace "indexspace"
     "my_production_table_201010_epatient_idx" btree (storeid, xxxxxxxxxxxxx), tablespace "indexspace"
     "my_production_table_201010_medicationname_idx" btree (xxxxxxxxxxxxxx), tablespace "indexspace"
     "my_production_table_201010_ndc_idx" btree (xxx), tablespace "indexspace"
Check constraints:
     "my_production_table_201010_filldate_check" CHECK (xxxxdate >= '2010-10-01'::date AND xxxxdate <
'2010-11-01'::date)
Foreign-key constraints:
     "my_production_table_201010_pkgfileid_fkey" FOREIGN KEY (pkgfileid) REFERENCES my_production_tablefiles(id)
Inherits: my_production_table_parent



Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Sunday 10 October 2010 05:02:03 Віталій Тимчишин wrote:
> 2010/10/10 Neil Whelchel <neil.whelchel@gmail.com>
>
> > On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
> > > On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel <neil.whelchel@gmail.com>
> >
> > wrote:
> > > > I know that there haven been many discussions on the slowness of
> >
> > count(*)
> >
> > > > even when an index is involved because the visibility of the rows has
> >
> > to
> >
> > > > be checked. In the past I have seen many suggestions about using
> > > > triggers and tables to keep track of counts and while this works fine
> >
> > in
> >
> > > > a situation where you know what the report is going to be ahead of
> >
> > time,
> >
> > > > this is simply not an option when an unknown WHERE clause is to be
> > > > used (dynamically generated). I ran into a fine example of this when
> > > > I was searching this mailing list, "Searching in 856,646 pages took
> > > > 13.48202 seconds. Site search powered by PostgreSQL 8.3." Obviously
> > > > at some
> >
> > point
> >
> > > > count(*) came into play here because the site made a list of pages (1
> > > > 2 3 4 5 6 > next). I very commonly make a list of pages from search
> > > > results, and the biggest time killer here is the count(*) portion,
> > > > even worse yet, I sometimes have to hit the database with two SELECT
> > > > statements, one with OFFSET and LIMIT to get the page of results I
> > > > need and another to get the amount of total rows so I can estimate
> > > > how many pages of results are available. The point I am driving at
> > > > here is that since building a list of pages of results is such a
> > > > common thing to do, there need to be some specific high speed ways
> > > > to do this in one query. Maybe an estimate(*) that works like count
> > > > but gives an answer from the index without checking visibility? I am
> > > > sure that this would be good enough to make a page list, it is
> > > > really no big deal if it errors on
> >
> > the
> >
> > > > positive side, maybe the list of pages has an extra page off the end.
> > > > I can live with that. What I can't live with is taking 13 seconds to
> > > > get
> >
> > a
> >
> > > > page of results from 850,000 rows in a table.
> > >
> > > 99% of the time in the situations you don't need an exact measure, and
> > > assuming analyze has run recently, select rel_tuples from pg_class for
> > > a given table is more than close enough.  I'm sure wrapping that in a
> > > simple estimated_rows() function would be easy enough to do.
> >
> > This is a very good approach and it works very well when you are counting
> > the
> > entire table, but when you have no control over the WHERE clause, it
> > doesn't
> > help. IE: someone puts in a word to look for in a web form.
> >
> > From my perspective, this issue is the biggest problem there is when
> > using Postgres to create web pages, and it is so commonly used, I think
> > that there
> > should be a specific way to deal with it so that you don't have to run
> > the same WHERE clause twice.
> > IE: SELECT count(*) FROM <table> WHERE <clause>; to get the total amount
> > of items to make page navigation links, then:
> > SELECT <columns> FROM table WHERE <clause> LIMIT <items_per_page> OFFSET
> > <(page_no-1)*items_per_page>; to get the actual page contents.
> >
> > How about
>
> select * from (select *, count(*) over () as total_count from <table> where
> <clause)  a LIMIT <items_per_page> OFFSET
> <(page_no-1)*items_per_page>
> It will return you total_count column with equal value in each row. You may
> have problems if no rows are returned (e.g. page num is too high).

I have done this before, but the speedup from the two hits to the database
that I mentioned above is tiny, just a few ms. It seems to end up doing about
the same thing on the database end. The reason that I don't commonly do this
is what you said about not getting a count result if you run off the end.
-Neil-

Re: Slow count(*) again...

From
Craig Ringer
Date:
On 10/11/2010 01:14 AM, Mladen Gogala wrote:

> I can provide measurements, but from Oracle RDBMS. Postgres doesn't
> allow tuning of that aspect, so no measurement can be done. Would the
> numbers from Oracle RDBMS be acceptable?

Well, they'd tell me a lot about Oracle's performance as I/O chunk size
scales, but almost nothing about the cost of small I/O operations vs
larger ones in general.

Typically dedicated test programs that simulate the database read
patterns would be used for this sort of thing. I'd be surprised if
nobody on -hackers has already done suitable testing; I was mostly
asking because I was interested in how you were backing your assertions.

PostgreSQL isn't Oracle; their design is in many ways very different.
Most importantly, Oracle uses a redo log, where PostgreSQL stores old
rows with visibility information directly in the tables. It is possible
that a larger proportion of Oracle's I/O costs are fixed per-block
overheads rather than per-byte costs, so it seeks to batch requests into
larger chunks. Of course, it's also possible that 8k chunk I/O is just
universally expensive and is something Pg should avoid, too, but we
can't know that without
dedicated testing, which I at least haven't done. I don't follow
-hackers closely, and wouldn't have seen discussion about testing done
there. The archives are likely to contain useful discussions.

Then again, IIRC Pg's page size is also it's I/O size, so you could
actually get larger I/O chunking by rebuilding Pg with larger pages.
Having never had the need, I haven't examined the performance of page
size changes on I/O performance.

>> The Linux kernel, at least, does request merging (and splitting, and
>> merging, and more splitting) along the request path, and I'd
>> personally expect that most of the cost of 8k requests would be in the
>> increased number of system calls, buffer copies, etc required.
>> Measurements demonstrating or contradicting this would be good to see.
>
> Even the cost of hundreds of thousands of context switches is far from
> negligible. What kind of measurements do you expect me to do with the
> database which doesn't support tweaking of that aspect of its operation?

Test programs, or references to testing done by others that demonstrates
these costs in isolation. Of course, they still wouldn't show what gain
Pg might obtain (nothing except hacking on Pg's sources really will) but
they'd help measure the costs of doing I/O that way.

I suspect you're right that large I/O chunks would be desirable and a
potential performance improvement. What I'd like to know is *how*
*much*, or at least how much the current approach costs in pure
overheads like context switches and scheduler delays.

> Does that provide enough of an evidence and, if not, why not?

It shows that it helps Oracle a lot ;-)

Without isolating how much of that is raw costs of the block I/O and how
much is costs internal to Oracle, it's still hard to have much idea how
much it'd benefit Pg to take a similar approach.

I'm sure the folks on -hackers have been over this and know a whole lot
more about it than I do, though.

> Oracle counts 400 million records in 2 minutes and Postgres 9.01 takes
> 12.8 seconds to count 1.2 million records? Do you see the disparity?

Sure. What I don't know is how much of that is due to block sizes. There
are all sorts of areas where Oracle could be gaining.

> It maybe so, but slow sequential scan is still the largest single
> performance problem of PostgreSQL. The frequency with which that topic
> appears on the mailing lists should serve as a good evidence for that.

I'm certainly not arguing that it could use improvement; it's clearly
hurting some users. I just don't know if I/O chunking is the answer - I
suspect that if it were, then it would've become a priority for one or
more people working on Pg much sooner.

It's quite likely that it's one of those things where it makes a huge
difference for Oracle because Oracle has managed to optimize out most of
the other bigger costs. If Pg still has other areas that make I/O more
expensive per-byte (say, visibility checks) and low fixed per-block
costs, then there'd be little point in chunking I/O. My understanding is
that that's pretty much how things stand at the moment, but I'd love
verification from someone who's done the testing.

>If you still claim that it wouldn't make the difference,
> the onus to prove it is on you.

I didn't mean to claim that it would make no difference. If I sounded
like it, sorry.

I just want to know how _much_ , or more accurately how great the
overheads of the current approach in Pg are vs doing much larger reads.

--
Craig Ringer


Re: Slow count(*) again...

From
Jon Nelson
Date:
On Sun, Oct 10, 2010 at 12:14 PM, Mladen Gogala
<mladen.gogala@vmsinfo.com> wrote:
>
>
>
> In other words, when I batched the sequential scan to do 128 blocks I/O, it
> was 4 times faster than when I did the single block I/O.
> Does that provide enough of an evidence and, if not, why not?


These numbers tell us nothing because, unless you dropped the caches
between runs, then at least part of some runs was very probably
cached.

--
Jon

Re: Slow count(*) again...

From
Joshua Tolley
Date:
On Mon, Oct 11, 2010 at 06:41:16AM +0800, Craig Ringer wrote:
> On 10/11/2010 01:14 AM, Mladen Gogala wrote:
>
>> I can provide measurements, but from Oracle RDBMS. Postgres doesn't
>> allow tuning of that aspect, so no measurement can be done. Would the
>> numbers from Oracle RDBMS be acceptable?
>
> Well, they'd tell me a lot about Oracle's performance as I/O chunk size
> scales, but almost nothing about the cost of small I/O operations vs
> larger ones in general.
>
> Typically dedicated test programs that simulate the database read
> patterns would be used for this sort of thing. I'd be surprised if
> nobody on -hackers has already done suitable testing; I was mostly
> asking because I was interested in how you were backing your assertions.

One thing a test program would have to take into account is multiple
concurrent users. What speeds up the single user case may well hurt the
multi user case, and the behaviors that hurt single user cases may have been
put in place on purpose to allow decent multi-user performance. Of course, all
of that is "might" and "maybe", and I can't prove any assertions about block
size either. But the fact of multiple users needs to be kept in mind.

It was asserted that reading bigger chunks would help performance; a response
suggested that, at least in Linux, setting readahead on a device would
essentially do the same thing. Or that's what I got from the thread, anyway.
I'm interested to know how similar performance might be between the large
block size case and the large readahead case. Comments, anyone?

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

Attachment

Re: Slow count(*) again...

From
Craig Ringer
Date:
On 10/11/2010 08:27 AM, Joshua Tolley wrote:

> One thing a test program would have to take into account is multiple
> concurrent users. What speeds up the single user case may well hurt the
> multi user case, and the behaviors that hurt single user cases may have been
> put in place on purpose to allow decent multi-user performance. Of course, all
> of that is "might" and "maybe", and I can't prove any assertions about block
> size either. But the fact of multiple users needs to be kept in mind.

Agreed. I've put together a simple test program to test I/O chunk sizes.
It only tests single-user performance, but it'd be pretty trivial to
adapt it to spawn a couple of worker children or run several threads,
each with a suitable delay as it's rather uncommon to have a bunch of
seqscans all fire off at once.

 From this test it's pretty clear that with buffered I/O of an uncached
700mb file under Linux, the I/O chunk size makes very little difference,
with all chunk sizes taking 9.8s to read the test file, with
near-identical CPU utilization. Caches were dropped between each test run.

For direct I/O (by ORing the O_DIRECT flag to the open() flags), chunk
size is *hugely* significant, with 4k chunk reads of the test file
taking 38s, 8k 22s, 16k 14s, 32k 10.8s, 64k - 1024k 9.8s, then rising a
little again over 1024k.

Apparently Oracle is almost always configured to use direct I/O, so it
would benefit massively from large chunk sizes. PostgreSQL is almost
never used with direct I/O, and at least in terms of the low-level costs
of syscalls and file system activity, shouldn't care at all about read
chunk sizes.

Bumping readahead from 256 to 8192 made no significant difference for
either case. Of course, I'm on a crappy laptop disk...

I'm guessing this is the origin of the OP's focus on I/O chunk sizes.

Anyway, for the single-seqscan case, I see little evidence here that
using a bigger read chunk size would help PostgreSQL reduce overheads or
improve performance.

OP: Is your Oracle instance using direct I/O?

--
Craig Ringer

Attachment

Re: Slow count(*) again...

From
Mladen Gogala
Date:
  On 10/10/2010 8:27 PM, Joshua Tolley wrote:
> It was asserted that reading bigger chunks would help performance; a response
> suggested that, at least in Linux, setting readahead on a device would
> essentially do the same thing. Or that's what I got from the thread, anyway.
> I'm interested to know how similar performance might be between the large
> block size case and the large readahead case. Comments, anyone?
>

Craig maybe right, the fact that Oracle is doing direct I/O probably
does account for the difference. The fact is, however, that the question
about slow sequential scan appears with some regularity on PostgreSQL
forums. My guess that a larger chunk would be helpful may not be
correct, but I do believe that there is a problem with a too slow
sequential scan.  Bigger chunks are a very traditional solution which
may not work but the problem is still there.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Joshua Tolley
Date:
On Sun, Oct 10, 2010 at 11:14:43PM -0400, Mladen Gogala wrote:
> The fact is, however, that the question
> about slow sequential scan appears with some regularity on PostgreSQL
> forums.

Definitely. Whether that's because there's something pathologically wrong with
sequential scans, or just because they're the slowest of the common
operations, remains to be seen. After all, if sequential scans were suddenly
fast, something else would be the slowest thing postgres commonly did.

All that said, if there's gain to be had by increasing block size, or
something else, esp. if it's low hanging fruit, w00t.

--
Joshua Tolley / eggyknap
End Point Corporation
http://www.endpoint.com

Attachment

Re: Slow count(*) again...

From
Craig Ringer
Date:
On 11/10/10 11:14, Mladen Gogala wrote:
>  On 10/10/2010 8:27 PM, Joshua Tolley wrote:
>> It was asserted that reading bigger chunks would help performance; a
>> response
>> suggested that, at least in Linux, setting readahead on a device would
>> essentially do the same thing. Or that's what I got from the thread,
>> anyway.
>> I'm interested to know how similar performance might be between the large
>> block size case and the large readahead case. Comments, anyone?
>>
>
> Craig maybe right, the fact that Oracle is doing direct I/O probably
> does account for the difference. The fact is, however, that the question
> about slow sequential scan appears with some regularity on PostgreSQL
> forums. My guess that a larger chunk would be helpful may not be
> correct, but I do believe that there is a problem with a too slow
> sequential scan.  Bigger chunks are a very traditional solution which
> may not work but the problem is still there.

Now that, I agree with.

BTW, I casually looked into async I/O a little, and it seems the general
situation for async I/O on Linux is "pretty bloody awful". POSIX async
I/O uses signal-driven completion handlers - but signal queue depth
limits mean they aren't necessarily reliable, so you can get lost
completions and have to scan the event buffer periodically to catch
them. The alternative is running completion handlers in threads, but
apparently there are queue depth limit issues there too, as well as the
joy that is getting POSIX threading right. I think there was some talk
about this on -HACKERS a while ago. Here's the main discussion on async
I/O I've found:

http://archives.postgresql.org/pgsql-hackers/2006-10/msg00820.php

... from which it seems that async buffered I/O is poorly supported, if
at all, on current Linux kernels. Don't know about the BSDs. As Pg is
*really* poorly suited to direct I/O, relying on the OS buffer cache as
it does, unbuffered direct I/O isn't really an option.

Linux async I/O seems to be designed for network I/O and for monitoring
lots of files for changes, rather than for highly concurrent I/O on one
or a few files. It shows.


Re slow seqscans, there is still plenty of room to move:

- Sequential scans cannot (AFAIK) use the visibility map introduced in
8.4 to skip sections of tables that are known to contain only dead
tuples not visible to any transaction or free space. This potential
optimization could make a big difference in tables with FILLFACTOR or
with holes created by certain update patterns.

- Covering indexes ("index oriented" table columns) and/or indexes with
embedded visibility information could dramatically improve the
performance of certain queries by eliminating the need to hit the heap
at all, albeit at the cost of trade-offs elsewhere. This would be
particularly useful for those classic count() queries. There have been
discussions about these on -hackers, but I'm not up with the current
status or lack thereof.

- There's been recent talk of using pread() rather than lseek() and
read() to save on syscall overhead. The difference is probably minimal,
but it'd be nice.


It is worth being aware of a few other factors:

- Sometimes seqscans are actually the fastest option, and people don't
realize this, so they try to force index use where it doesn't make
sense. This is the cause of a significant number of list complaints.

- Slow sequential scans are often a consequence of table bloat. It's
worth checking for this. Pg's autovacuum and manual vacuum have improved
in performance and usability dramatically over time, but still have room
to move. Sometimes people disable autovacuum in the name of a
short-lived performance boost, not realizing it'll have horrible effects
on performance in the mid- to long- term.

- Seqscans can be chosen when index scans are more appropriate if the
random_page_cost and seq_page_cost aren't set sensibly, which they
usually aren't. This doesn't make seqscans any faster, but it's even
worse when you have a good index you're not using. I can't help but
wonder if a bundled "quick and dirty benchmark" tool for Pg would be
beneficial in helping to determine appropriate values for these settings
and for effective io concurrency.


--
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Sunday 10 October 2010 15:41:16 you wrote:
> On 10/11/2010 01:14 AM, Mladen Gogala wrote:
> > I can provide measurements, but from Oracle RDBMS. Postgres doesn't
> > allow tuning of that aspect, so no measurement can be done. Would the
> > numbers from Oracle RDBMS be acceptable?
>
> Well, they'd tell me a lot about Oracle's performance as I/O chunk size
> scales, but almost nothing about the cost of small I/O operations vs
> larger ones in general.
>
> Typically dedicated test programs that simulate the database read
> patterns would be used for this sort of thing. I'd be surprised if
> nobody on -hackers has already done suitable testing; I was mostly
> asking because I was interested in how you were backing your assertions.
>
> PostgreSQL isn't Oracle; their design is in many ways very different.
> Most importantly, Oracle uses a redo log, where PostgreSQL stores old
> rows with visibility information directly in the tables. It is possible
> that a larger proportion of Oracle's I/O costs are fixed per-block
> overheads rather than per-byte costs, so it seeks to batch requests into
> larger chunks. Of course, it's also possible that 8k chunk I/O is just
> universally expensive and is something Pg should avoid, too, but we
> can't know that without
> dedicated testing, which I at least haven't done. I don't follow
> -hackers closely, and wouldn't have seen discussion about testing done
> there. The archives are likely to contain useful discussions.
>
> Then again, IIRC Pg's page size is also it's I/O size, so you could
> actually get larger I/O chunking by rebuilding Pg with larger pages.
> Having never had the need, I haven't examined the performance of page
> size changes on I/O performance.

This is a really good idea! I will look into doing this and I will post the
results as soon as I can get it done.

Right now, I am building a test machine with two dual core Intel processors
and two 15KRPM mirrored hard drives, 1 GB ram. I am using a small amount of
ram because I will be using small test tables. I may do testing in the future
with more ram and bigger tables, but I think I can accomplish what we are all
after with what I have. The machine will be limited to running the database
server in test, init, bash, and ssh, no other processes will be running except
for what is directly involved with testing. I will post exact specs when I
post test results. I will create some test tables, and the same tables will be
used in all tests. Suggestions for optimal Postgres and system  configuration
are welcome. I will try any suggested settings that I have time to test.
-Neil-


>
> >> The Linux kernel, at least, does request merging (and splitting, and
> >> merging, and more splitting) along the request path, and I'd
> >> personally expect that most of the cost of 8k requests would be in the
> >> increased number of system calls, buffer copies, etc required.
> >> Measurements demonstrating or contradicting this would be good to see.
> >
> > Even the cost of hundreds of thousands of context switches is far from
> > negligible. What kind of measurements do you expect me to do with the
> > database which doesn't support tweaking of that aspect of its operation?
>
> Test programs, or references to testing done by others that demonstrates
> these costs in isolation. Of course, they still wouldn't show what gain
> Pg might obtain (nothing except hacking on Pg's sources really will) but
> they'd help measure the costs of doing I/O that way.
>
> I suspect you're right that large I/O chunks would be desirable and a
> potential performance improvement. What I'd like to know is *how*
> *much*, or at least how much the current approach costs in pure
> overheads like context switches and scheduler delays.
>
> > Does that provide enough of an evidence and, if not, why not?
>
> It shows that it helps Oracle a lot ;-)
>
> Without isolating how much of that is raw costs of the block I/O and how
> much is costs internal to Oracle, it's still hard to have much idea how
> much it'd benefit Pg to take a similar approach.
>
> I'm sure the folks on -hackers have been over this and know a whole lot
> more about it than I do, though.
>
> > Oracle counts 400 million records in 2 minutes and Postgres 9.01 takes
> > 12.8 seconds to count 1.2 million records? Do you see the disparity?
>
> Sure. What I don't know is how much of that is due to block sizes. There
> are all sorts of areas where Oracle could be gaining.
>
> > It maybe so, but slow sequential scan is still the largest single
> > performance problem of PostgreSQL. The frequency with which that topic
> > appears on the mailing lists should serve as a good evidence for that.
>
> I'm certainly not arguing that it could use improvement; it's clearly
> hurting some users. I just don't know if I/O chunking is the answer - I
> suspect that if it were, then it would've become a priority for one or
> more people working on Pg much sooner.
>
> It's quite likely that it's one of those things where it makes a huge
> difference for Oracle because Oracle has managed to optimize out most of
> the other bigger costs. If Pg still has other areas that make I/O more
> expensive per-byte (say, visibility checks) and low fixed per-block
> costs, then there'd be little point in chunking I/O. My understanding is
> that that's pretty much how things stand at the moment, but I'd love
> verification from someone who's done the testing.
>
> >If you still claim that it wouldn't make the difference,
> >
> > the onus to prove it is on you.
>
> I didn't mean to claim that it would make no difference. If I sounded
> like it, sorry.
>
> I just want to know how _much_ , or more accurately how great the
> overheads of the current approach in Pg are vs doing much larger reads.
>
> --
> Craig Ringer

Re: Slow count(*) again...

From
"Pierre C"
Date:
> I ran into a fine example of this when I was searching this mailing list,
> "Searching in 856,646 pages took 13.48202 seconds. Site search powered by
> PostgreSQL 8.3." Obviously at some point count(*) came into play here

Well, tsearch full text search is excellent, but it has to work inside the
limits of the postgres database itself, which means row visibility checks,
and therefore, yes, extremely slow count(*) on large result sets when the
tables are not cached in RAM.

Also, if you want to use custom sorting (like by date, thread, etc)
possibly all the matching rows will have to be read and sorted.

Consider, for example, the Xapian full text search engine. It is not MVCC
(it is single writer, multiple reader, so only one process can update the
index at a time, but readers are not locked out during writes). Of course,
you would never want something like that for your main database ! However,
in its particular application, which is multi-criteria full text search
(and flexible sorting of results), it just nukes tsearch2 on datasets not
cached in RAM, simply because everything in it including disk layout etc,
has been optimized for the application. Lucene is similar (but I have not
benchmarked it versus tsearch2, so I can't tell).

So, if your full text search is a problem, just use Xapian. You can update
the Xapian index from a postgres trigger (using an independent process, or
simply, a plpython trigger using the python Xapian bindings). You can
query it using an extra process acting as a server, or you can write a
set-returning plpython function which performs Xapian searches, and you
can join the results to your tables.

> Pg will never have such a fast count() as MyISAM does or the same
> insanely fast read performance,

Benchmark it you'll see, MyISAM is faster than postgres for "small simple
selects", only if :
- pg doesn't use prepared queries (planning time takes longer than a
really simple select)
- myisam can use index-only access
- noone is writing to the myisam table at the moment, obviously

On equal grounds (ie, SELECT * FROM table WHERE pk = value) there is no
difference. The TCP/IP overhead is larger than the query anyway, you have
to use unix sockets on both to get valid timings. Since by default on
localhost MySQL seems to use unix sockets and PG uses tcp/ip, PG seem 2x
slower, which is in fact not true.

Re: Slow count(*) again...

From
Craig James
Date:
On 10/9/10 6:47 PM, Scott Marlowe wrote:
> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com>  wrote:
>> I know that there haven been many discussions on the slowness of count(*) even
>> when an index is involved because the visibility of the rows has to be
>> checked. In the past I have seen many suggestions about using triggers and
>> tables to keep track of counts and while this works fine in a situation where
>> you know what the report is going to be ahead of time, this is simply not an
>> option when an unknown WHERE clause is to be used (dynamically generated).
>> I ran into a fine example of this when I was searching this mailing list,
>> "Searching in 856,646 pages took 13.48202 seconds. Site search powered by
>> PostgreSQL 8.3." Obviously at some point count(*) came into play here because
>> the site made a list of pages (1 2 3 4 5 6>  next). I very commonly make a
>> list of pages from search results, and the biggest time killer here is the
>> count(*) portion, even worse yet, I sometimes have to hit the database with
>> two SELECT statements, one with OFFSET and LIMIT to get the page of results I
>> need and another to get the amount of total rows so I can estimate how many
>> pages of results are available. The point I am driving at here is that since
>> building a list of pages of results is such a common thing to do, there need
>> to be some specific high speed ways to do this in one query. Maybe an
>> estimate(*) that works like count but gives an answer from the index without
>> checking visibility? I am sure that this would be good enough to make a page
>> list, it is really no big deal if it errors on the positive side, maybe the
>> list of pages has an extra page off the end. I can live with that. What I
>> can't live with is taking 13 seconds to get a page of results from 850,000
>> rows in a table.
>
> 99% of the time in the situations you don't need an exact measure, and
> assuming analyze has run recently, select rel_tuples from pg_class for
> a given table is more than close enough.  I'm sure wrapping that in a
> simple estimated_rows() function would be easy enough to do.

First of all, it's not true.  There are plenty of applications that need an exact answer.  Second, even if it is only
1%,that means it's 1% of the queries, not 1% of people.  Sooner or later a large fraction of developers will run into
this. It's probably been the most-asked question I've seen on this forum in the four years I've been here.  It's a real
problem,and it needs a real solution. 

I know it's a hard problem to solve, but can we stop hinting that those of us who have this problem are somehow being
dense?

Thanks,
Craig

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Monday 11 October 2010 10:46:17 Craig James wrote:
> On 10/9/10 6:47 PM, Scott Marlowe wrote:
> > On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com>
wrote:
> >> I know that there haven been many discussions on the slowness of
> >> count(*) even when an index is involved because the visibility of the
> >> rows has to be checked. In the past I have seen many suggestions about
> >> using triggers and tables to keep track of counts and while this works
> >> fine in a situation where you know what the report is going to be ahead
> >> of time, this is simply not an option when an unknown WHERE clause is
> >> to be used (dynamically generated). I ran into a fine example of this
> >> when I was searching this mailing list, "Searching in 856,646 pages
> >> took 13.48202 seconds. Site search powered by PostgreSQL 8.3."
> >> Obviously at some point count(*) came into play here because the site
> >> made a list of pages (1 2 3 4 5 6>  next). I very commonly make a list
> >> of pages from search results, and the biggest time killer here is the
> >> count(*) portion, even worse yet, I sometimes have to hit the database
> >> with two SELECT statements, one with OFFSET and LIMIT to get the page
> >> of results I need and another to get the amount of total rows so I can
> >> estimate how many pages of results are available. The point I am
> >> driving at here is that since building a list of pages of results is
> >> such a common thing to do, there need to be some specific high speed
> >> ways to do this in one query. Maybe an estimate(*) that works like
> >> count but gives an answer from the index without checking visibility? I
> >> am sure that this would be good enough to make a page list, it is
> >> really no big deal if it errors on the positive side, maybe the list of
> >> pages has an extra page off the end. I can live with that. What I can't
> >> live with is taking 13 seconds to get a page of results from 850,000
> >> rows in a table.
> >
> > 99% of the time in the situations you don't need an exact measure, and
> > assuming analyze has run recently, select rel_tuples from pg_class for
> > a given table is more than close enough.  I'm sure wrapping that in a
> > simple estimated_rows() function would be easy enough to do.
>
> First of all, it's not true.  There are plenty of applications that need an
> exact answer.  Second, even if it is only 1%, that means it's 1% of the
> queries, not 1% of people.  Sooner or later a large fraction of developers
> will run into this.  It's probably been the most-asked question I've seen
> on this forum in the four years I've been here.  It's a real problem, and
> it needs a real solution.
>
> I know it's a hard problem to solve, but can we stop hinting that those of
> us who have this problem are somehow being dense?
>
> Thanks,
> Craig

That is why I suggested an estimate(*) that works like (a faster) count(*)
except that it may be off a bit. I think that is what he was talking about
when he wrote this.

I don't think that anyone here is trying to cast any blame, we are just
pointing out that there is a real problem here that involves what seems to be
a very common task, and it is placing a huge disadvantage on the use of
Postgres to other systems that can do it in less time. There doesn't seem to
be any disagreement that count(*) is slower than it could be due to MVCC and
other reasons, which is fine. However at the chopping end of the line, if a
slow count(*) makes a Postgres driven website take say a minute to render a
web page, it is completely useless if it can be replaced with a database
engine that can do the same thing in (much) less time. On my servers, this is
the major sticking point. There are so many people waiting on count(*), that
the server runs out of memory and it is forced to stop accepting more
connections until some of the threads finish. This makes many unhappy
customers.
When it comes to serving up web pages that contain a slice of a table with
links to other slices, knowing about how many slices is very important. But I
think that we can all agree that the exact amount is not a make or break (even
better if the estimate is a bit high), so an estimate(*) function that takes
some shortcuts here to get a much faster response (maybe off a bit) would
solve a huge problem.

What it all boils down to is webserver response time, and there are really two
things that are slowing things down more than what should be needed.
So there are really two possible solutions either of which would be a big
help:
1. A faster count(*), or something like my proposed estimate(*).
2. A way to get the total rows matched when using LIMIT and OFFSET before
LIMIT and OFFSET are applied.

If you are making a web page that contains a few results of many possible
results, you need two things for sure which means that there are really two
problems with Postgres for doing this task.
1. You need to know (about) how many total rows. This requires a hit to the
database which requires a scan of the table to get, there is no way to do this
faster than count(*) as far as I know.
2. You need a slice of the data which requires another scan to the table to
get, and using the same WHERE clause as above. This seems like a total waste,
because we just did that with the exception of actually fetching the data.

Why do it twice when if there was a way to get a slice using OFFSET and LIMIT
and  get the amount of rows that matched before the OFFSET and LIMIT was
applied you could do the scan once? I think that this would improve things and
give Postgres an edge over other systems.

I hope this makes sense to at least one person in the right place. ;)

-Neil-

Re: Slow count(*) again...

From
Samuel Gendler
Date:


On Mon, Oct 11, 2010 at 12:54 PM, Neil Whelchel <neil.whelchel@gmail.com> wrote:

2. You need a slice of the data which requires another scan to the table to
get, and using the same WHERE clause as above. This seems like a total waste,
because we just did that with the exception of actually fetching the data.

Why do it twice when if there was a way to get a slice using OFFSET and LIMIT
and  get the amount of rows that matched before the OFFSET and LIMIT was
applied you could do the scan once? I think that this would improve things and
give Postgres an edge over other systems.


I'd go even farther with number 2 and suggest that a form of offset/limit which can return the total count OR have a total count be passed in to be returned the same way as if total count were being computed would make the use of that api even easier, since you could keep re-using the number returned the first time without changing the api that gets used depending upon context.  Of course, you could contrive to set that up via a stored proc relatively easily by simply doing the count(*) once, then appending it to each row of the offset/limit query by including it in the select statement.  Let it optionally receive the total to be used as an input parameter, which if not null will result in the count(*) block being skipped in the proc.  You'd incur the full cost of the table scan plus offset/limit query once, but then not for each and every page.  Since the modified api you suggest for offset/limit would surely have to perform the table scan once, that solution really isn't giving much more value than  implementing as a stored proc other than the flexibility of executing an arbitrary query.   Modified offset/limit combined with the count_estimate functionality would be very useful in this circumstance, though - especially if the estimate would just do a full count if the estimate is under a certain threshold.  A 25% discrepancy when counting millions of rows is a lot less of an issue than a 25% discrepancy when counting 10 rows. 

One issue with an estimation is that you must be certain that the estimate >= actual count or else the app must always attempt to load the page BEYOND the last page of the estimate in order to determine if the estimate must be revised upward. Otherwise, you risk leaving rows out entirely.  Probably ok when returning search results.  Not so much when displaying a list of assets.

 

Re: Slow count(*) again...

From
Mladen Gogala
Date:
  On 10/11/2010 3:54 PM, Neil Whelchel wrote:
> 1. A faster count(*), or something like my proposed estimate(*).
> 2. A way to get the total rows matched when using LIMIT and OFFSET before
> LIMIT and OFFSET are applied.

The biggest single problem with "select count(*)" is that it is
seriously overused. People use that idiom to establish existence, which
usually leads to a performance disaster in the application using it,
unless the table has no more than few hundred records. SQL language, of
which PostgreSQL offers an excellent implementation,  offers [NOT]
EXISTS clause since its inception in the Jurassic era. The problem is
with the sequential scan, not with counting. I'd even go as far as to
suggest that 99% instances of the "select count(*)" idiom are probably
bad use of the SQL language.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Scott Carey
Date:
On Oct 10, 2010, at 10:14 AM, Mladen Gogala wrote:

>
> SQL> show parameter db_file_multi
>
> NAME                                 TYPE        VALUE
> ------------------------------------ -----------
> ------------------------------
> db_file_multiblock_read_count        integer     16
> SQL> alter session set db_file_multiblock_read_count=1;
>
> Session altered.
> SQL> select count(*) from ni_occurrence;
>
>  COUNT(*)
> ----------
> 402062638
>
> Elapsed: 00:08:20.88
> SQL> alter session set db_file_multiblock_read_count=128;
>
> Session altered.
>
> Elapsed: 00:00:00.50
> SQL>  select count(*) from ni_occurrence;
>
>  COUNT(*)
> ----------
> 402062638
>
> Elapsed: 00:02:17.58
>
>
> In other words, when I batched the sequential scan to do 128 blocks I/O,
> it was 4 times faster than when I did the single block I/O.
> Does that provide enough of an evidence and, if not, why not?
>

Did you tune the linux FS read-ahead first?  You can get large gains by doing that if you are on ext3.
blockdev --setra 2048 <device>

would give you a 1MB read-ahead.  Also, consider XFS and its built-in defragmentation.  I have found that a longer
livedpostgres DB will get extreme  
file fragmentation over time and sequential scans end up mostly random.  On-line file defrag helps tremendously.

> It maybe so, but slow sequential scan is still the largest single
> performance problem of PostgreSQL. The frequency with which that topic
> appears on the mailing lists should serve as a good evidence for that. I
> did my best to prove my case.

I'm not sure its all the I/O however.  It seems that Postgres uses a lot more CPU than other DB's to crack open a tuple
andinspect it.  Testing on unindexed tables with count(*) I can get between 200MB and 800MB per second off disk max
withfull cpu utilization (depending on the average tuple size and contents).   This is on a disk array that can do
1200MB/sec.  It always feels dissapointing to not be able to max out the disk throughput on the simplest possible
query.  

> Again, requiring "hard numbers" when
> using the database which doesn't allow tweaking of the I/O size is self
> defeating proposition. The other databases, like DB2 and Oracle both
> allow tweaking of that aspect of its operation, Oracle even on the per
> session basis. If you still claim that it wouldn't make the difference,
> the onus to prove it is on you.
>
> --
> Mladen Gogala
> Sr. Oracle DBA
> 1500 Broadway
> New York, NY 10036
> (212) 329-5251
> www.vmsinfo.com
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Slow count(*) again...

From
Greg Smith
Date:
Joshua Tolley wrote:
> It was asserted that reading bigger chunks would help performance; a response
> suggested that, at least in Linux, setting readahead on a device would
> essentially do the same thing. Or that's what I got from the thread, anyway.
> I'm interested to know how similar performance might be between the large
> block size case and the large readahead case.

Large read-ahead addresses the complaint here (bulk reads are slow) just
fine, which is one reason why this whole topic isn't nearly as
interesting as claimed.  Larger chunk sizes in theory will do the same
thing, but then you're guaranteed to be reading larger blocks than
necessary in all cases.  The nice thing about a good adaptive read-ahead
is that it can target small blocks normally, and only kick into heavy
read-ahead mode when the I/O pattern justifies it.

This is a problem for the operating system to solve, and such solutions
out there are already good enough that PostgreSQL has little reason to
try and innovate in this area.  I routinely see seq scan throughput
double on Linux just by tweaking read-ahead from the tiny defaults to a
sane value.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: Slow count(*) again...

From
Scott Carey
Date:
On Oct 11, 2010, at 7:02 PM, Scott Carey wrote:

>
> On Oct 10, 2010, at 10:14 AM, Mladen Gogala wrote:
>
>>
>> SQL> show parameter db_file_multi
>>
>> NAME                                 TYPE        VALUE
>> ------------------------------------ -----------
>> ------------------------------
>> db_file_multiblock_read_count        integer     16
>> SQL> alter session set db_file_multiblock_read_count=1;
>>
>> Session altered.
>> SQL> select count(*) from ni_occurrence;
>>
>> COUNT(*)
>> ----------
>> 402062638
>>
>> Elapsed: 00:08:20.88
>> SQL> alter session set db_file_multiblock_read_count=128;
>>
>> Session altered.
>>
>> Elapsed: 00:00:00.50
>> SQL>  select count(*) from ni_occurrence;
>>
>> COUNT(*)
>> ----------
>> 402062638
>>
>> Elapsed: 00:02:17.58
>>
>>
>> In other words, when I batched the sequential scan to do 128 blocks I/O,
>> it was 4 times faster than when I did the single block I/O.
>> Does that provide enough of an evidence and, if not, why not?
>>
>
> Did you tune the linux FS read-ahead first?  You can get large gains by doing that if you are on ext3.
> blockdev --setra 2048 <device>
>
Scratch that, if you are using DirectIO, block read-ahead does nothing.  The default is 128K for buffered I/O
read-ahead.

> would give you a 1MB read-ahead.  Also, consider XFS and its built-in defragmentation.  I have found that a longer
livedpostgres DB will get extreme  
> file fragmentation over time and sequential scans end up mostly random.  On-line file defrag helps tremendously.
>
>> It maybe so, but slow sequential scan is still the largest single
>> performance problem of PostgreSQL. The frequency with which that topic
>> appears on the mailing lists should serve as a good evidence for that. I
>> did my best to prove my case.
>
> I'm not sure its all the I/O however.  It seems that Postgres uses a lot more CPU than other DB's to crack open a
tupleand inspect it.  Testing on unindexed tables with count(*) I can get between 200MB and 800MB per second off disk
maxwith full cpu utilization (depending on the average tuple size and contents).   This is on a disk array that can do
1200MB/sec.  It always feels dissapointing to not be able to max out the disk throughput on the simplest possible
query.  
>
>> Again, requiring "hard numbers" when
>> using the database which doesn't allow tweaking of the I/O size is self
>> defeating proposition. The other databases, like DB2 and Oracle both
>> allow tweaking of that aspect of its operation, Oracle even on the per
>> session basis. If you still claim that it wouldn't make the difference,
>> the onus to prove it is on you.
>>
>> --
>> Mladen Gogala
>> Sr. Oracle DBA
>> 1500 Broadway
>> New York, NY 10036
>> (212) 329-5251
>> www.vmsinfo.com
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Slow count(*) again...

From
Mladen Gogala
Date:
On 10/11/2010 10:02 PM, Scott Carey wrote:
> Did you tune the linux FS read-ahead first?  You can get large gains by doing that if you are on ext3.
> blockdev --setra 2048<device>

Actually, I have blockdev --setra 32768

> would give you a 1MB read-ahead.  Also, consider XFS and its built-in defragmentation.  I have found that a longer
livedpostgres DB will get extreme 
> file fragmentation over time and sequential scans end up mostly random.  On-line file defrag helps tremendously.

I agree, but I am afraid that after the demise of SGI, XFS isn't being
developed. The company adopted the policy of using only the plain
vanilla Ext3, which is unfortunate, but I can't do much about it. There
is a lesson to be learned from the story of ReiserFS. I am aware of the
fact that Ext3 is rather basic, block oriented file system which doesn't
perform well when compared to HPFS, VxFS or JFS2 and has no notion of
extents, but I believe that I am stuck with it, until the advent of
Ext4. BTW, there is no defragmenter for Ext4, not even on Ubuntu, which
is rather bleeding edge distribution.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Monday 11 October 2010 16:50:36 you wrote:
>   On 10/11/2010 3:54 PM, Neil Whelchel wrote:
> > 1. A faster count(*), or something like my proposed estimate(*).
> > 2. A way to get the total rows matched when using LIMIT and OFFSET before
> > LIMIT and OFFSET are applied.
>
> The biggest single problem with "select count(*)" is that it is
> seriously overused. People use that idiom to establish existence, which
> usually leads to a performance disaster in the application using it,
> unless the table has no more than few hundred records. SQL language, of
> which PostgreSQL offers an excellent implementation,  offers [NOT]
> EXISTS clause since its inception in the Jurassic era. The problem is
> with the sequential scan, not with counting. I'd even go as far as to
> suggest that 99% instances of the "select count(*)" idiom are probably
> bad use of the SQL language.

I agree, I have seen many very bad examples of using count(*). I will go so
far as to question the use of count(*) in my examples here. It there a better
way to come up with a page list than using count(*)? What is the best method
to make a page of results and a list of links to other pages of results? Am I
barking up the wrong tree here?
-Neil-

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Monday 11 October 2010 19:23:46 Mladen Gogala wrote:
>   On 10/11/2010 10:02 PM, Scott Carey wrote:
> > Did you tune the linux FS read-ahead first?  You can get large gains by
> > doing that if you are on ext3. blockdev --setra 2048<device>
>
> Actually, I have blockdev --setra 32768
>
> > would give you a 1MB read-ahead.  Also, consider XFS and its built-in
> > defragmentation.  I have found that a longer lived postgres DB will get
> > extreme file fragmentation over time and sequential scans end up mostly
> > random.  On-line file defrag helps tremendously.
>
> I agree, but I am afraid that after the demise of SGI, XFS isn't being
> developed. The company adopted the policy of using only the plain
> vanilla Ext3, which is unfortunate, but I can't do much about it. There
> is a lesson to be learned from the story of ReiserFS. I am aware of the
> fact that Ext3 is rather basic, block oriented file system which doesn't
> perform well when compared to HPFS, VxFS or JFS2 and has no notion of
> extents, but I believe that I am stuck with it, until the advent of
> Ext4. BTW, there is no defragmenter for Ext4, not even on Ubuntu, which
> is rather bleeding edge distribution.

When it comes to a database that has many modifications to its tables, it
seems that XFS pulls way ahead of other filesystems (likely) because of its
on-line defragmentation among other reasons. I am not sure that XFS is not
(properly) maintained. The people at xfs.org seem to be making steady
progress, and high quality updates. I have been using it for some time now (9+
years), and as it is, it does everything I need it to do, and it is very
reliable. I really can't see anything changing in the next few years that
would effect its usability as a filesystem for Postgres, so until something
proves to be better, I can't understand why you wouldn't want to use it,
maintained or not.
-Neil-

Re: Slow count(*) again...

From
Samuel Gendler
Date:


On Mon, Oct 11, 2010 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote:

This is a problem for the operating system to solve, and such solutions out there are already good enough that PostgreSQL has little reason to try and innovate in this area.  I routinely see seq scan throughput double on Linux just by tweaking read-ahead from the tiny defaults to a sane value.

I spent some time going through the various tuning docs on the wiki whie bringing some new hardware up and I can't remember seeing any discussion of tweaking read-ahead at all in the normal performance-tuning references.  Do you have any documentation of the kinds of tweaking you have done and its effects on different types of workloads?

Re: Slow count(*) again...

From
Scott Carey
Date:
I can't speak to documentation, but it is something that helps as your I/O subsystem gets more powerful, and how much it helps depends more on your hardware, which may have adaptive read ahead on its own, and your file system which may be more or less efficient at sequential I/O.  For example ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS on a DELL PERC6 RAID card (but still ends up slower).   

Linux Read-ahead has no effect on random access performance.   A workload consisting of mixed sequential scans and random reads can be tuned to favor one over the other based on a combination of the I/O scheduler used and the ammount of read-ahead.    Larger read-ahead helps sequential scans, and the Deadline scheduler tends to favor throughput (sequential scans) over latency (random access) compared to the cfq scheduler.



On Oct 11, 2010, at 8:58 PM, Samuel Gendler wrote:



On Mon, Oct 11, 2010 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote:

This is a problem for the operating system to solve, and such solutions out there are already good enough that PostgreSQL has little reason to try and innovate in this area.  I routinely see seq scan throughput double on Linux just by tweaking read-ahead from the tiny defaults to a sane value.

I spent some time going through the various tuning docs on the wiki whie bringing some new hardware up and I can't remember seeing any discussion of tweaking read-ahead at all in the normal performance-tuning references.  Do you have any documentation of the kinds of tweaking you have done and its effects on different types of workloads?


Re: Slow count(*) again...

From
Samuel Gendler
Date:


On Mon, Oct 11, 2010 at 9:06 PM, Scott Carey <scott@richrelevance.com> wrote:
I can't speak to documentation, but it is something that helps as your I/O subsystem gets more powerful, and how much it helps depends more on your hardware, which may have adaptive read ahead on its own, and your file system which may be more or less efficient at sequential I/O.  For example ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS on a DELL PERC6 RAID card (but still ends up slower).   


Geez.  I wish someone would have written something quite so bold as 'xfs is always faster than ext3' in the standard tuning docs.  I couldn't find anything that made a strong filesystem recommendation.  How does xfs compare to ext4?  I wound up on ext4 on a dell perc6 raid card when an unexpected hardware failure on a production system caused my test system to get thrown into production before I could do any serious testing of xfs.  If there is a strong consensus that xfs is simply better, I could afford the downtime to switch.

As it happens, this is a system where all of the heavy workload is in the form of sequential scan type load. The OLTP workload is very minimal (tens of queries per minute on a small number of small tables), but there are a lot of reporting queries that wind up doing sequential scans of large partitions (millions to tens of millions of rows).  We've sized the new hardware so that the most commonly used partitions fit into memory, but if we could speed the queries that touch less frequently used partitions, that would be good.  I'm the closest thing our team has to a DBA, which really only means that I'm the one person on the dev team or the ops team to have read all of the postgres docs and wiki and the mailing lists.  I claim no actual DBA experience or expertise and have limited cycles to devote to tuning and testing, so if there is an established wisdom for filesystem choice and read ahead tuning, I'd be very interested in hearing it.

Re: Slow count(*) again...

From
david@lang.hm
Date:
On Mon, 11 Oct 2010, Samuel Gendler wrote:

> On Mon, Oct 11, 2010 at 9:06 PM, Scott Carey <scott@richrelevance.com>wrote:
>
>> I can't speak to documentation, but it is something that helps as your I/O
>> subsystem gets more powerful, and how much it helps depends more on your
>> hardware, which may have adaptive read ahead on its own, and your file
>> system which may be more or less efficient at sequential I/O.  For example
>> ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS
>> on a DELL PERC6 RAID card (but still ends up slower).
>>
>>
> Geez.  I wish someone would have written something quite so bold as 'xfs is
> always faster than ext3' in the standard tuning docs.  I couldn't find
> anything that made a strong filesystem recommendation.  How does xfs compare
> to ext4?  I wound up on ext4 on a dell perc6 raid card when an unexpected
> hardware failure on a production system caused my test system to get thrown
> into production before I could do any serious testing of xfs.  If there is a
> strong consensus that xfs is simply better, I could afford the downtime to
> switch.

unfortunantly you are not going to get a clear opinion here.

ext3 has a long track record, and since it is the default, it gets a lot
of testing. it does have known issues

xfs had problems on linux immediatly after it was ported. It continues to
be improved and many people have been using it for years and trust it. XFS
does have a weakness in creating/deleting large numbers of small files.

ext4 is the new kid on the block. it claims good things, but it's so new
that many people don't trust it yet

btrfs is the 'future of filesystems' that is supposed to be better than
anything else, but it's definantly not stable yet, and time will tell if
it really lives up to it's promises.

an this is just on linux

on BSD or solaris (or with out-of-kernel patches) you also have ZFS, which
some people swear by, and other people swear at.

David Lang


> As it happens, this is a system where all of the heavy workload is in the
> form of sequential scan type load. The OLTP workload is very minimal (tens
> of queries per minute on a small number of small tables), but there are a
> lot of reporting queries that wind up doing sequential scans of large
> partitions (millions to tens of millions of rows).  We've sized the new
> hardware so that the most commonly used partitions fit into memory, but if
> we could speed the queries that touch less frequently used partitions, that
> would be good.  I'm the closest thing our team has to a DBA, which really
> only means that I'm the one person on the dev team or the ops team to have
> read all of the postgres docs and wiki and the mailing lists.  I claim no
> actual DBA experience or expertise and have limited cycles to devote to
> tuning and testing, so if there is an established wisdom for filesystem
> choice and read ahead tuning, I'd be very interested in hearing it.
>

Re: Slow count(*) again...

From
Greg Smith
Date:
Mladen Gogala wrote:
> I agree, but I am afraid that after the demise of SGI, XFS isn't being
> developed.

It's back to being well maintained again; see
http://blog.2ndquadrant.com/en/2010/04/the-return-of-xfs-on-linux.html
for some history here and why it's become relevant to RedHat in
particular recently.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: Slow count(*) again...

From
Greg Smith
Date:
Samuel Gendler wrote:
> I spent some time going through the various tuning docs on the wiki
> whie bringing some new hardware up and I can't remember seeing any
> discussion of tweaking read-ahead at all in the normal
> performance-tuning references.  Do you have any documentation of the
> kinds of tweaking you have done and its effects on different types of
> workloads?

Much of my recent research has gone into the book you'll see plugged
below rather than the wiki.  The basics of read-ahead tuning is that you
can see it increase bonnie++ sequential read results when you increase
it, to a point.  Get to that point and stop and you should be in good shape.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: Slow count(*) again...

From
Vitalii Tymchyshyn
Date:
11.10.10 20:46, Craig James написав(ла):
>
> First of all, it's not true. There are plenty of applications that
> need an exact answer. Second, even if it is only 1%, that means it's
> 1% of the queries, not 1% of people. Sooner or later a large fraction
> of developers will run into this. It's probably been the most-asked
> question I've seen on this forum in the four years I've been here.
> It's a real problem, and it needs a real solution.
>
> I know it's a hard problem to solve, but can we stop hinting that
> those of us who have this problem are somehow being dense?
>
BTW: There is a lot of talk about MVCC, but is next solution possible:
1) Create a page information map that for each page in the table will
tell you how may rows are within and if any write (either successful or
not) were done to this page. This even can be two maps to make second
one really small (a bit per page) - so that it could be most time in-memory.
2) When you need to to count(*) or index check - first check if there
were no writes to the page. If not - you can use count information from
page info/index data without going to the page itself
3) Let vacuum clear the bit after frozing all the tuples in the page (am
I using terminology correctly?).

In this case all read-only (archive) data will be this bit off and
index/count(*) will be really fast.
Am I missing something?

Best regards, Vitalii Tymchyshyn.

Re: Slow count(*) again...

From
Craig Ringer
Date:
On 10/12/2010 03:56 PM, Vitalii Tymchyshyn wrote:

> BTW: There is a lot of talk about MVCC, but is next solution possible:
> 1) Create a page information map that for each page in the table will
> tell you how may rows are within and if any write (either successful or
> not) were done to this page. This even can be two maps to make second
> one really small (a bit per page) - so that it could be most time
> in-memory.
> 2) When you need to to count(*) or index check - first check if there
> were no writes to the page. If not - you can use count information from
> page info/index data without going to the page itself
> 3) Let vacuum clear the bit after frozing all the tuples in the page (am
> I using terminology correctly?).

Part of this already exists. It's called the visibility map, and is
present in 8.4 and above. It's not currently used for queries, but can
potentially be used to aid some kinds of query.

http://www.postgresql.org/docs/8.4/static/storage-vm.html

> In this case all read-only (archive) data will be this bit off and
> index/count(*) will be really fast.

A count with any joins or filter criteria would still have to scan all
pages with visible tuples in them. So the visibility map helps speed up
scanning of bloated tables, but doesn't provide a magical "fast count"
except in the utterly trivial "select count(*) from tablename;" case,
and can probably only be used for accurate results when there are no
read/write transactions currently open. Even if you kept a count of
tuples in each page along with the mvcc transaction ID information
required to determine for which transactions that count is valid, it'd
only be useful if you didn't have to do any condition checks, and it'd
be yet another thing to update with every insert/delete/update.

Perhaps for some users that'd be worth having, but it seems to me like
it'd have pretty narrow utility. I'm not sure that's the answer.

--
Craig Ringer

Re: Slow count(*) again...

From
david@lang.hm
Date:
On Tue, 12 Oct 2010, Craig Ringer wrote:

>
>> BTW: There is a lot of talk about MVCC, but is next solution possible:
>> 1) Create a page information map that for each page in the table will
>> tell you how may rows are within and if any write (either successful or
>> not) were done to this page. This even can be two maps to make second
>> one really small (a bit per page) - so that it could be most time
>> in-memory.
>> 2) When you need to to count(*) or index check - first check if there
>> were no writes to the page. If not - you can use count information from
>> page info/index data without going to the page itself
>> 3) Let vacuum clear the bit after frozing all the tuples in the page (am
>> I using terminology correctly?).
>
> Part of this already exists. It's called the visibility map, and is present
> in 8.4 and above. It's not currently used for queries, but can potentially be
> used to aid some kinds of query.
>
> http://www.postgresql.org/docs/8.4/static/storage-vm.html
>
>> In this case all read-only (archive) data will be this bit off and
>> index/count(*) will be really fast.
>
> A count with any joins or filter criteria would still have to scan all pages
> with visible tuples in them. So the visibility map helps speed up scanning of
> bloated tables, but doesn't provide a magical "fast count" except in the
> utterly trivial "select count(*) from tablename;" case, and can probably only
> be used for accurate results when there are no read/write transactions
> currently open. Even if you kept a count of tuples in each page along with
> the mvcc transaction ID information required to determine for which
> transactions that count is valid, it'd only be useful if you didn't have to
> do any condition checks, and it'd be yet another thing to update with every
> insert/delete/update.
>
> Perhaps for some users that'd be worth having, but it seems to me like it'd
> have pretty narrow utility. I'm not sure that's the answer.

from a PR point of view, speeding up the trivil count(*) case could be
worth it, just to avoid people complaining about it not being fast.

in the case where you are doing a count(*) where query and the where is on
an indexed column, could the search just look at the index + the
visibility mapping rather than doing an sequential search through the
table?

as for your worries about the accuracy of a visibility based count in the
face of other transactions, wouldn't you run into the same issues if you
are doing a sequential scan with the same transactions in process?

David Lang

Re: Slow count(*) again...

From
Vitalii Tymchyshyn
Date:
12.10.10 11:14, Craig Ringer написав(ла):
> On 10/12/2010 03:56 PM, Vitalii Tymchyshyn wrote:
>
>> BTW: There is a lot of talk about MVCC, but is next solution possible:
>> 1) Create a page information map that for each page in the table will
>> tell you how may rows are within and if any write (either successful or
>> not) were done to this page. This even can be two maps to make second
>> one really small (a bit per page) - so that it could be most time
>> in-memory.
>> 2) When you need to to count(*) or index check - first check if there
>> were no writes to the page. If not - you can use count information from
>> page info/index data without going to the page itself
>> 3) Let vacuum clear the bit after frozing all the tuples in the page (am
>> I using terminology correctly?).
>
> Part of this already exists. It's called the visibility map, and is
> present in 8.4 and above. It's not currently used for queries, but can
> potentially be used to aid some kinds of query.
>
> http://www.postgresql.org/docs/8.4/static/storage-vm.html
>
>> In this case all read-only (archive) data will be this bit off and
>> index/count(*) will be really fast.
>
> A count with any joins or filter criteria would still have to scan all
> pages with visible tuples in them.
If one don't use parittioning. With proper partitioning, filter can
simply select a partitions.

Also filtering can be mapped on the index lookup. And if one could join
index hash and visibility map, much like two indexes can be bit joined
now, count can be really fast for all but non-frozen tuples.
> So the visibility map helps speed up scanning of bloated tables, but
> doesn't provide a magical "fast count" except in the utterly trivial
> "select count(*) from tablename;" case, and can probably only be used
> for accurate results when there are no read/write transactions
> currently open.
Why so? You simply has to recount the pages that are marked dirty in
usual way. But count problem usually occurs when there are a lot of
archive data (you need to count over 100K records) that is not modified.

Best regards, Vitalii Tymchyshyn

Re: Slow count(*) again...

From
Craig Ringer
Date:
On 10/12/2010 04:22 PM, david@lang.hm wrote:

> from a PR point of view, speeding up the trivil count(*) case could be
> worth it, just to avoid people complaining about it not being fast.

At the cost of a fair bit more complexity, though, and slowing
everything else down.

The proper solution here remains, IMO, support for visibility
information in indexes, whether by storing it once in the index and once
in the heap (ouch!), storing it out-of-line, or using a covering index
where one or more columns are stored wholly in the index not in the
table heap at all.

Here are a few of the many past discussions about this that have already
covered some of the same ground:

http://stackoverflow.com/questions/839015/postgres-could-an-index-organized-tables-paved-way-for-faster-select-count-fr

http://osdir.com/ml/db.postgresql.performance/2003-10/msg00075.html
(and the rest of the thread)

A decent look with Google will find many, many more.

> in the case where you are doing a count(*) where query and the where is
> on an indexed column, could the search just look at the index + the
> visibility mapping rather than doing an sequential search through the
> table?

Nope, because the visibility map, which is IIRC only one bit per page,
doesn't record how many tuples there are on the page, or enough
information about them to determine how many of them are visible to the
current transaction*.

> as for your worries about the accuracy of a visibility based count in
> the face of other transactions, wouldn't you run into the same issues if
> you are doing a sequential scan with the same transactions in process?

No. Every tuple in a table heap in postgresql has hidden fields, some of
which are used to determine whether the current transaction* can "see"
the tuple - it may have been inserted after this transaction started, or
deleted before this transaction started, so it's not visible to this
transaction but may still be to others.

http://www.postgresql.org/docs/current/static/ddl-system-columns.html

This information isn't available in the visibility map, or in indexes.
That's why PostgreSQL has to hit the heap to find it.

* current transaction should really be "current snapshot". The snapshot
is taken at the start of the whole transaction for SERIALIZABLE
isolation, and at the start of each statement for READ COMMITTED isolation.

--
Craig Ringer

Re: Slow count(*) again...

From
Mladen Gogala
Date:
david@lang.hm wrote:
> from a PR point of view, speeding up the trivil count(*) case could be
> worth it, just to avoid people complaining about it not being fast.
>
>
Fixing PR stuff is not the approach that I would take. People are
complaining about select count(*) because they're using it in all the
wrong places.  My assessment that there is a problem with sequential
scan was wrong. Now, let's again take Oracle as the measure.
Someone asked me about caching the data.  Here it is:

SQL> connect system/*********
Connected.
SQL> alter system flush buffer_cache;

System altered.

Elapsed: 00:00:12.68
SQL> connect adbase/*********
Connected.
SQL> alter session set db_file_multiblock_read_Count=128;

Session altered.

Elapsed: 00:00:00.41
SQL> select count(*) from ni_occurrence;

  COUNT(*)
----------
 402062638

Elapsed: 00:02:37.77

SQL> select bytes/1048576 MB from user_segments
  2  where segment_name='NI_OCCURRENCE';

        MB
----------
     35329

Elapsed: 00:00:00.20
SQL>


So, the results weren't cached the first time around. The explanation is
the fact that Oracle, as of the version 10.2.0, reads the table in the
private process memory, not in the shared buffers.  This table alone is
35GB in size,  Oracle took 2 minutes 47 seconds to read it using the
full table scan. If I do the same thing with PostgreSQL and a comparable
table, Postgres is, in fact, faster:

psql (9.0.1)
Type "help" for help.

news=> \timing
Timing is on.
news=> select count(*) from moreover_documents_y2010m09;
  count
----------
 17242655
(1 row)

Time: 113135.114 ms
news=> select pg_size_pretty(pg_table_size('moreover_documents_y2010m09'));
 pg_size_pretty
----------------
 27 GB
(1 row)

Time: 100.849 ms
news=>

The number of rows is significantly smaller, but the table contains
rather significant "text" field which consumes quite a bit of TOAST
storage and the sizes are comparable. Postgres read through 27GB in 113
seconds, less than 2 minutes and oracle took 2 minutes 37 seconds to
read through 35GB.  I stand corrected: there is nothing wrong with the
speed of the Postgres sequential scan.


--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Jon Nelson
Date:
On Tue, Oct 12, 2010 at 7:27 AM, Mladen Gogala
<mladen.gogala@vmsinfo.com> wrote:
>
> So, the results weren't cached the first time around. The explanation is the
> fact that Oracle, as of the version 10.2.0, reads the table in the private
> process memory, not in the shared buffers.  This table alone is  35GB in
> size,  Oracle took 2 minutes 47 seconds to read it using the full table
> scan. If I do the same thing with PostgreSQL and a comparable table,
> Postgres is, in fact, faster:


Well, I didn't quite mean that - having no familiarity with Oracle I
don't know what the alter system statement does, but I was talking
specifically about the linux buffer and page cache. The easiest way to
drop the linux caches in one fell swoop is:

echo 3 > /proc/sys/vm/drop_caches

Is there a command to tell postgresql to drop/clear/reset it's buffer_cache?

Clearing/dropping both the system (Linux) and the DB caches is
important when doing benchmarks that involve I/O.



--
Jon

Re: Slow count(*) again...

From
Greg Smith
Date:
Jon Nelson wrote:
> Is there a command to tell postgresql to drop/clear/reset it's buffer_cache?
>

No.  Usually the sequence used to remove all cached data from RAM before
a benchmark is:

pg_ctl stop
sync
echo 3 > /proc/sys/vm/drop_caches
pg_ctl start

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: Slow count(*) again...

From
Luca Tettamanti
Date:
On Tue, Oct 12, 2010 at 3:07 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> On Tue, Oct 12, 2010 at 7:27 AM, Mladen Gogala
> <mladen.gogala@vmsinfo.com> wrote:
>>
>> So, the results weren't cached the first time around. The explanation is the
>> fact that Oracle, as of the version 10.2.0, reads the table in the private
>> process memory, not in the shared buffers.  This table alone is  35GB in
>> size,  Oracle took 2 minutes 47 seconds to read it using the full table
>> scan. If I do the same thing with PostgreSQL and a comparable table,
>> Postgres is, in fact, faster:
>
> Well, I didn't quite mean that - having no familiarity with Oracle I
> don't know what the alter system statement does, but I was talking
> specifically about the linux buffer and page cache. The easiest way to
> drop the linux caches in one fell swoop is:
>
> echo 3 > /proc/sys/vm/drop_caches

AFAIK this won't affect Oracle when using direct IO (which bypasses
the page cache).

Luca

Re: Slow count(*) again...

From
Jon Nelson
Date:
On Tue, Oct 12, 2010 at 8:18 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> No.  Usually the sequence used to remove all cached data from RAM before a
> benchmark is:

All cached data (as cached in postgresql - *not* the Linux system
caches)..., right?


--
Jon

Re: Slow count(*) again...

From
Tom Lane
Date:
Mladen Gogala <mladen.gogala@vmsinfo.com> writes:
> The number of rows is significantly smaller, but the table contains
> rather significant "text" field which consumes quite a bit of TOAST
> storage and the sizes are comparable. Postgres read through 27GB in 113
> seconds, less than 2 minutes and oracle took 2 minutes 37 seconds to
> read through 35GB.  I stand corrected: there is nothing wrong with the
> speed of the Postgres sequential scan.

Um ... the whole point of TOAST is that the data isn't in-line.
So what Postgres was actually reading through was probably quite a
lot less than 27Gb.  It's probably hard to make a completely
apples-to-apples comparison because the two databases are so different,
but I don't think this one proves that PG is faster than Oracle.

            regards, tom lane

Re: Slow count(*) again...

From
"Kevin Grittner"
Date:
Neil Whelchel <neil.whelchel@gmail.com> wrote:

> What is the best method to make a page of results and a list of
> links to other pages of results?

For our most heavily used web app we decided to have the renderer
just read the list of cases and render the pages to disk, and then
present the first one.  We set a limit of 500 entries on the list;
if we get past 500 we put up a page telling them to refine their
search criteria.  That won't work for all circumstances, but it
works well for out web app.

-Kevin

Re: Slow count(*) again...

From
Mladen Gogala
Date:
Tom Lane wrote:
> Mladen Gogala <mladen.gogala@vmsinfo.com> writes:
>
>> The number of rows is significantly smaller, but the table contains
>> rather significant "text" field which consumes quite a bit of TOAST
>> storage and the sizes are comparable. Postgres read through 27GB in 113
>> seconds, less than 2 minutes and oracle took 2 minutes 37 seconds to
>> read through 35GB.  I stand corrected: there is nothing wrong with the
>> speed of the Postgres sequential scan.
>>
>
> Um ... the whole point of TOAST is that the data isn't in-line.
> So what Postgres was actually reading through was probably quite a
> lot less than 27Gb.  It's probably hard to make a completely
> apples-to-apples comparison because the two databases are so different,
> but I don't think this one proves that PG is faster than Oracle.
>
>             regards, tom lane
>

As is usually the case, you're right. I will try copying the table to
Postgres over the weekend, my management would not look kindly upon my
copying 35GB of the production data during the working hours, for the
scientific reasons. I have the storage and I can test, I will post the
result. I developed quite an efficient Perl script which does copying
without the intervening CSV file, so that the copy should not take more
than 2 hours. I will be able to impose a shared lock on the table over
the weekend, so that I don't blow away the UNDO segments.

--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Re: Slow count(*) again...

From
Joe Uhl
Date:
The biggest single problem with "select count(*)" is that it is
seriously overused. People use that idiom to establish existence, which
usually leads to a performance disaster in the application using it,
unless the table has no more than few hundred records. SQL language, of
which PostgreSQL offers an excellent implementation,  offers [NOT]
EXISTS clause since its inception in the Jurassic era. The problem is
with the sequential scan, not with counting. I'd even go as far as to
suggest that 99% instances of the "select count(*)" idiom are probably
bad use of the SQL language.
I agree, I have seen many very bad examples of using count(*). I will go so 
far as to question the use of count(*) in my examples here. It there a better 
way to come up with a page list than using count(*)? What is the best method 
to make a page of results and a list of links to other pages of results? Am I 
barking up the wrong tree here?
One way I have dealt with this on very large tables is to cache the count(*) at the application level (using memcached, terracotta, or something along those lines) and then increment that cache whenever you add a row to the relevant table.  On application restart that cache is re-initialized with a regular old count(*).  This approach works really well and all large systems in my experience need caching in front of the DB eventually.  If you have a simpler system with say a single application/web server you can simply store the value in a variable, the specifics would depend on the language and framework you are using.

Another more all-DB approach is to create a statistics tables into which you place aggregated statistics rows (num deleted, num inserted, totals, etc) at an appropriate time interval in your code.  So you have rows containing aggregated statistics information for the past and some tiny portion of the new data happening right now that hasn't yet been aggregated.  Queries then look like a summation of the aggregated values in the statistics table plus a count(*) over just the newest portion of the data table and are generally very fast.

Overall I have found that once things get big the layers of your app stack start to blend together and have to be combined in clever ways to keep speed up.  Postgres is a beast but when you run into things it can't do well just find a way to cache it or make it work together with some other persistence tech to handle those cases.


Re: Slow count(*) again...

From
bricklen
Date:
On Sat, Oct 9, 2010 at 4:26 PM, Neil Whelchel <neil.whelchel@gmail.com> wrote:
> Maybe an
> estimate(*) that works like count but gives an answer from the index without
> checking visibility? I am sure that this would be good enough to make a page
> list, it is really no big deal if it errors on the positive side, maybe the
> list of pages has an extra page off the end. I can live with that. What I
> can't live with is taking 13 seconds to get a page of results from 850,000
> rows in a table.
> -Neil-
>

FWIW, Michael Fuhr wrote a small function to parse the EXPLAIN plan a
few years ago and it works pretty well assuming your stats are up to
date.

http://markmail.org/message/gknqthlwry2eoqey

Re: Slow count(*) again...

From
"Kevin Grittner"
Date:
Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> Greg Smith <greg@2ndquadrant.com> wrote:

>> Usually the sequence used to remove all cached data from RAM
>> before a benchmark is:
>
> All cached data (as cached in postgresql - *not* the Linux system
> caches)..., right?

No.  The stop and start of PostgreSQL causes empty PostgreSQL
caches.  These lines, in between the stop and start, force the Linux
cache to be empty (on recent kernel versions):

sync
echo 3 > /proc/sys/vm/drop_caches

-Kevin

Re: Slow count(*) again...

From
Dan Harris
Date:
On 10/11/10 8:02 PM, Scott Carey wrote:
> would give you a 1MB read-ahead.  Also, consider XFS and its built-in defragmentation.  I have found that a longer
livedpostgres DB will get extreme 
> file fragmentation over time and sequential scans end up mostly random.  On-line file defrag helps tremendously.
>
We just had a corrupt table caused by an XFS online defrag.  I'm scared
to use this again while the db is live.  Has anyone else found this to
be safe?  But, I can vouch for the fragmentation issue, it happens very
quickly in our system.

-Dan

Re: Slow count(*) again...

From
david@lang.hm
Date:
On Tue, 12 Oct 2010, Joe Uhl wrote:

>>> The biggest single problem with "select count(*)" is that it is
>>> seriously overused. People use that idiom to establish existence, which
>>> usually leads to a performance disaster in the application using it,
>>> unless the table has no more than few hundred records. SQL language, of
>>> which PostgreSQL offers an excellent implementation,  offers [NOT]
>>> EXISTS clause since its inception in the Jurassic era. The problem is
>>> with the sequential scan, not with counting. I'd even go as far as to
>>> suggest that 99% instances of the "select count(*)" idiom are probably
>>> bad use of the SQL language.
>>
>> I agree, I have seen many very bad examples of using count(*). I will go so
>> far as to question the use of count(*) in my examples here. It there a
>> better
>> way to come up with a page list than using count(*)? What is the best
>> method
>> to make a page of results and a list of links to other pages of results? Am
>> I
>> barking up the wrong tree here?
> One way I have dealt with this on very large tables is to cache the count(*)
> at the application level (using memcached, terracotta, or something along
> those lines) and then increment that cache whenever you add a row to the
> relevant table.  On application restart that cache is re-initialized with a
> regular old count(*).  This approach works really well and all large systems
> in my experience need caching in front of the DB eventually.  If you have a
> simpler system with say a single application/web server you can simply store
> the value in a variable, the specifics would depend on the language and
> framework you are using.

this works if you know ahead of time what the criteria of the search is
going to be.

so it will work for

select count(*) from table;

what this won't work for is cases wher the criteria of the search is
unpredictable, i.e.

ask the user for input

select count(*) from table where field=$input;

David Lang

> Another more all-DB approach is to create a statistics tables into which you
> place aggregated statistics rows (num deleted, num inserted, totals, etc) at
> an appropriate time interval in your code.  So you have rows containing
> aggregated statistics information for the past and some tiny portion of the
> new data happening right now that hasn't yet been aggregated.  Queries then
> look like a summation of the aggregated values in the statistics table plus a
> count(*) over just the newest portion of the data table and are generally
> very fast.
>
> Overall I have found that once things get big the layers of your app stack
> start to blend together and have to be combined in clever ways to keep speed
> up.  Postgres is a beast but when you run into things it can't do well just
> find a way to cache it or make it work together with some other persistence
> tech to handle those cases.
>
>
>

Re: Slow count(*) again...

From
david@lang.hm
Date:
On Tue, 12 Oct 2010, Mladen Gogala wrote:

> david@lang.hm wrote:
>> from a PR point of view, speeding up the trivil count(*) case could be
>> worth it, just to avoid people complaining about it not being fast.
>>
>>
> Fixing PR stuff is not the approach that I would take. People are complaining
> about select count(*) because they're using it in all the wrong places.

that may be the case, but if it's possible to make it less painful it will
mean more people use postgres, both because it works better for them when
they are using the suboptimal programs, but also because when people do
their trivial testing of databases to decide which one they will use, they
won't rule out postgres because "it's so slow"

the fact of the matter is that people do use count(*), and even though
there are usually ways to avoid doing so, having the programmer have to do
something different for postgres than they do for other databases is
raising a barrier against postgres untilization in anything.

David Lang

>  My
> assessment that there is a problem with sequential scan was wrong. Now, let's
> again take Oracle as the measure.
> Someone asked me about caching the data.  Here it is:
>
> SQL> connect system/*********
> Connected.
> SQL> alter system flush buffer_cache;
>
> System altered.
>
> Elapsed: 00:00:12.68
> SQL> connect adbase/*********
> Connected.
> SQL> alter session set db_file_multiblock_read_Count=128;
>
> Session altered.
>
> Elapsed: 00:00:00.41
> SQL> select count(*) from ni_occurrence;
>
> COUNT(*)
> ----------
> 402062638
>
> Elapsed: 00:02:37.77
>
> SQL> select bytes/1048576 MB from user_segments
> 2  where segment_name='NI_OCCURRENCE';
>
>       MB
> ----------
>    35329
>
> Elapsed: 00:00:00.20
> SQL>
>
>
> So, the results weren't cached the first time around. The explanation is the
> fact that Oracle, as of the version 10.2.0, reads the table in the private
> process memory, not in the shared buffers.  This table alone is  35GB in
> size,  Oracle took 2 minutes 47 seconds to read it using the full table scan.
> If I do the same thing with PostgreSQL and a comparable table, Postgres is,
> in fact, faster:
>
> psql (9.0.1)
> Type "help" for help.
>
> news=> \timing
> Timing is on.
> news=> select count(*) from moreover_documents_y2010m09;
> count  ----------
> 17242655
> (1 row)
>
> Time: 113135.114 ms
> news=> select pg_size_pretty(pg_table_size('moreover_documents_y2010m09'));
> pg_size_pretty
> ----------------
> 27 GB
> (1 row)
>
> Time: 100.849 ms
> news=>
>
> The number of rows is significantly smaller, but the table contains rather
> significant "text" field which consumes quite a bit of TOAST storage and the
> sizes are comparable. Postgres read through 27GB in 113 seconds, less than 2
> minutes and oracle took 2 minutes 37 seconds to read through 35GB.  I stand
> corrected: there is nothing wrong with the speed of the Postgres sequential
> scan.
>
>
>

Re: Slow count(*) again...

From
david@lang.hm
Date:
On Tue, 12 Oct 2010, Craig Ringer wrote:

> On 10/12/2010 04:22 PM, david@lang.hm wrote:
>
>> from a PR point of view, speeding up the trivil count(*) case could be
>> worth it, just to avoid people complaining about it not being fast.
>
> At the cost of a fair bit more complexity, though, and slowing everything
> else down.

complexity probably, although given how complex the planner is already is
this significant?

as far as slowing everything else down, why would it do that? (beyond the
simple fact that any new thing the planner can do makes the planner take a
little longer)

David Lang

> The proper solution here remains, IMO, support for visibility information in
> indexes, whether by storing it once in the index and once in the heap
> (ouch!), storing it out-of-line, or using a covering index where one or more
> columns are stored wholly in the index not in the table heap at all.
>
> Here are a few of the many past discussions about this that have already
> covered some of the same ground:
>
>
http://stackoverflow.com/questions/839015/postgres-could-an-index-organized-tables-paved-way-for-faster-select-count-fr
>
> http://osdir.com/ml/db.postgresql.performance/2003-10/msg00075.html
> (and the rest of the thread)
>
> A decent look with Google will find many, many more.
>
>> in the case where you are doing a count(*) where query and the where is
>> on an indexed column, could the search just look at the index + the
>> visibility mapping rather than doing an sequential search through the
>> table?
>
> Nope, because the visibility map, which is IIRC only one bit per page,
> doesn't record how many tuples there are on the page, or enough information
> about them to determine how many of them are visible to the current
> transaction*.
>
>> as for your worries about the accuracy of a visibility based count in
>> the face of other transactions, wouldn't you run into the same issues if
>> you are doing a sequential scan with the same transactions in process?
>
> No. Every tuple in a table heap in postgresql has hidden fields, some of
> which are used to determine whether the current transaction* can "see" the
> tuple - it may have been inserted after this transaction started, or deleted
> before this transaction started, so it's not visible to this transaction but
> may still be to others.
>
> http://www.postgresql.org/docs/current/static/ddl-system-columns.html
>
> This information isn't available in the visibility map, or in indexes. That's
> why PostgreSQL has to hit the heap to find it.
>
> * current transaction should really be "current snapshot". The snapshot is
> taken at the start of the whole transaction for SERIALIZABLE isolation, and
> at the start of each statement for READ COMMITTED isolation.
>
> --
> Craig Ringer
>

Re: Slow count(*) again...

From
Scott Carey
Date:

On Oct 11, 2010, at 9:21 PM, Samuel Gendler wrote:



On Mon, Oct 11, 2010 at 9:06 PM, Scott Carey <scott@richrelevance.com> wrote:
I can't speak to documentation, but it is something that helps as your I/O subsystem gets more powerful, and how much it helps depends more on your hardware, which may have adaptive read ahead on its own, and your file system which may be more or less efficient at sequential I/O.  For example ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS on a DELL PERC6 RAID card (but still ends up slower).   


Geez.  I wish someone would have written something quite so bold as 'xfs is always faster than ext3' in the standard tuning docs.  I couldn't find anything that made a strong filesystem recommendation.  How does xfs compare to ext4?  I wound up on ext4 on a dell perc6 raid card when an unexpected hardware failure on a production system caused my test system to get thrown into production before I could do any serious testing of xfs.  If there is a strong consensus that xfs is simply better, I could afford the downtime to switch.

As it happens, this is a system where all of the heavy workload is in the form of sequential scan type load. The OLTP workload is very minimal (tens of queries per minute on a small number of small tables), but there are a lot of reporting queries that wind up doing sequential scans of large partitions (millions to tens of millions of rows).  We've sized the new hardware so that the most commonly used partitions fit into memory, but if we could speed the queries that touch less frequently used partitions, that would be good.  I'm the closest thing our team has to a DBA, which really only means that I'm the one person on the dev team or the ops team to have read all of the postgres docs and wiki and the mailing lists.  I claim no actual DBA experience or expertise and have limited cycles to devote to tuning and testing, so if there is an established wisdom for filesystem choice and read ahead tuning, I'd be very interested in hearing it.


ext4 is a very fast file system.  Its faster than ext2, but has many more features and has the all-important journaling.

However, for large reporting queries and sequential scans, XFS will win in the long run if you use the online defragmenter.  Otherwise, your sequential scans won't be all that sequential on any file system over time if your tables aren't written once, forever, serially.   Parallel restore will result in a system that is fragmented -- ext4 will do best at limiting this on the restore, but only xfs has online defragmentation.  We schedule ours daily and it noticeably improves sequential scan I/O.

Supposedly, an online defragmenter is in the works for ext4 but it may be years before its available.

Re: Slow count(*) again...

From
Samuel Gendler
Date:
On Tue, Oct 12, 2010 at 9:02 AM, Scott Carey <scott@richrelevance.com> wrote:

However, for large reporting queries and sequential scans, XFS will win in the long run if you use the online defragmenter.  Otherwise, your sequential scans won't be all that sequential on any file system over time if your tables aren't written once, forever, serially.   Parallel restore will result in a system that is fragmented -- ext4 will do best at limiting this on the restore, but only xfs has online defragmentation.  We schedule ours daily and it noticeably improves sequential scan I/O.


Our reporting tables are written sequentially and left unmodified until entire partitions are dropped.  However, equivalent partitions tend to get a little bit larger over time, so newer partitions won't necessarily fit into the gaps left by prior partition drops, so it is possible that partitions will be split into two sections, but should still be very sequential, if not perfectly so.  It would seem that we stumbled into an ideal architecture for doing this kind of work - mostly by virtue of starting with 8.2.x and having huge problems with autovacuum and vacuum taking forever and dragging the db to halt, which caused us to move to an architecture which aggregates and then drops older data in entire partitions instead of updating aggregates individually and then deleting rows. Partitions are sized such that most reporting queries run over entire partitions, too (which was completely accidental since I had not yet delved into individual query optimization at the time), so even though we are doing sequential scans, we at least run as few of them as possible and are able to keep hot data in memory.

--sam


 

Re: Slow count(*) again...

From
Chris Browne
Date:
sgendler@ideasculptor.com (Samuel Gendler) writes:
> Geez.  I wish someone would have written something quite so bold as
> 'xfs is always faster than ext3' in the standard tuning docs.  I
> couldn't find anything that made a strong filesystem
> recommendation.  How does xfs compare to ext4?  I wound up on ext4 on
> a dell perc6 raid card when an unexpected hardware failure on a
> production system caused my test system to get thrown into production
> before I could do any serious testing of xfs.  If there is a strong
> consensus that xfs is simply better, I could afford the downtime to
> switch.

It's news to me (in this thread!) that XFS is actually "getting some
developer love," which is a pretty crucial factor to considering it
relevant.

XFS was an SGI creation, and, with:

 a) the not-scintillating performance of the company,

 b) the lack of a lot of visible work going into the filesystem,

 c) the paucity of support by Linux vendors (for a long time, if you
    told RHAT you were having problems, and were using XFS, the next
    step would be to park the ticket awaiting your installing a
    "supported filesystem")

it didn't look like XFS was a terribly good bet.  Those issues were
certainly causing concern a couple of years ago.

Faster "raw performance" isn't much good if it comes with a risk of:
 - Losing data
 - Losing support from vendors

If XFS now *is* getting support from both the development and support
perspectives, then the above concerns may have been invalidated.  It
would be very encouraging, if so.
--
output = ("cbbrowne" "@" "gmail.com")
Rules of  the Evil Overlord  #228.  "If the  hero claims he  wishes to
confess  in public  or to  me  personally, I  will remind  him that  a
notarized deposition will serve just as well."

Re: Slow count(*) again...

From
Scott Carey
Date:
>>
>
> A count with any joins or filter criteria would still have to scan all
> pages with visible tuples in them. So the visibility map helps speed up
> scanning of bloated tables, but doesn't provide a magical "fast count"
> except in the utterly trivial "select count(*) from tablename;" case,
> and can probably only be used for accurate results when there are no
> read/write transactions currently open.

select count(*) from tablename where [condition or filter that can use an index] [group by on columns in the index]

will also work, I think.

Additionally, I think it can work if other open transactions exist, provided they haven't written to the table being
scanned. If they have, then only those pages that have been altered and marked in the visibility map need to be cracked
openthe normal way. 

> Even if you kept a count of
> tuples in each page along with the mvcc transaction ID information
> required to determine for which transactions that count is valid, it'd
> only be useful if you didn't have to do any condition checks, and it'd
> be yet another thing to update with every insert/delete/update.
>

Yes, lots of drawbacks and added complexity.

> Perhaps for some users that'd be worth having, but it seems to me like
> it'd have pretty narrow utility. I'm not sure that's the answer.
>
> --
> Craig Ringer
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Slow count(*) again...

From
Jesper Krogh
Date:
On 2010-10-12 18:02, Scott Carey wrote:
> However, for large reporting queries and sequential scans, XFS will
> win in the long run if you use the online defragmenter.  Otherwise,
> your sequential scans won't be all that sequential on any file system
> over time if your tables aren't written once, forever, serially.
> Parallel restore will result in a system that is fragmented -- ext4
> will do best at limiting this on the restore, but only xfs has online
> defragmentation.  We schedule ours daily and it noticeably improves
> sequential scan I/O.
>
> Supposedly, an online defragmenter is in the works for ext4 but it
> may be years before its available.

If some clever postgres hacker could teach postgres to allocate blocks
using posix_fallocate in quite large batches, say .. something like:
fallocate(min(current_relation_size *0.1,1073741824));

So if you have a relations filling 10GB allready, they the next "file" for the
relations is just fully allocated on the first byte by the filesystem. That
would ensure that large table is sitting efficiently on the filesystem level with
a minimum of fragmentation on ext4(and other FS's supporting posix_fallocate)
and for small systems it would only fill 10% more of diskspace...   ..

.. last night I spend an hour looking for where its done but couldnt find the
source-file where extention of an existing relation takes place.. can
someone give directions?

--
Jesper




Re: Slow count(*) again...

From
Scott Carey
Date:
On Oct 12, 2010, at 8:39 AM, Dan Harris wrote:

>  On 10/11/10 8:02 PM, Scott Carey wrote:
>> would give you a 1MB read-ahead.  Also, consider XFS and its built-in defragmentation.  I have found that a longer
livedpostgres DB will get extreme 
>> file fragmentation over time and sequential scans end up mostly random.  On-line file defrag helps tremendously.
>>
> We just had a corrupt table caused by an XFS online defrag.  I'm scared
> to use this again while the db is live.  Has anyone else found this to
> be safe?  But, I can vouch for the fragmentation issue, it happens very
> quickly in our system.
>

What version?  I'm using the latest CentoOS extras build.

We've been doing online defrag for a while now on a very busy database with > 8TB of data.  Not that that means there
areno bugs...  

It is a relatively simple thing in xfs -- it writes a new file to temp in a way that allocates contiguous space if
available,then if the file has not been modified since it was re-written it is essentially moved on top of the other
one. This should be safe provided the journaling and storage is safe, etc. 


> -Dan
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Slow count(*) again...

From
Scott Carey
Date:
On Oct 12, 2010, at 8:54 AM, <david@lang.hm> wrote:

> On Tue, 12 Oct 2010, Craig Ringer wrote:
>
>> On 10/12/2010 04:22 PM, david@lang.hm wrote:
>>
>>> from a PR point of view, speeding up the trivil count(*) case could be
>>> worth it, just to avoid people complaining about it not being fast.
>>
>> At the cost of a fair bit more complexity, though, and slowing everything
>> else down.
>
> complexity probably, although given how complex the planner is already is
> this significant?
>
> as far as slowing everything else down, why would it do that? (beyond the
> simple fact that any new thing the planner can do makes the planner take a
> little longer)
>
> David Lang
>
I wouldn't even expect the planner to do more work.  An Index Scan can simply avoid going to the tuples for visibility
undersome circumstances. 


Re: Slow count(*) again...

From
Scott Carey
Date:
On Oct 12, 2010, at 9:46 AM, Scott Carey wrote:

>
> On Oct 12, 2010, at 8:54 AM, <david@lang.hm> wrote:
>
>> On Tue, 12 Oct 2010, Craig Ringer wrote:
>>
>>> On 10/12/2010 04:22 PM, david@lang.hm wrote:
>>>
>>>> from a PR point of view, speeding up the trivil count(*) case could be
>>>> worth it, just to avoid people complaining about it not being fast.
>>>
>>> At the cost of a fair bit more complexity, though, and slowing everything
>>> else down.
>>
>> complexity probably, although given how complex the planner is already is
>> this significant?
>>
>> as far as slowing everything else down, why would it do that? (beyond the
>> simple fact that any new thing the planner can do makes the planner take a
>> little longer)
>>
>> David Lang
>>
> I wouldn't even expect the planner to do more work.  An Index Scan can simply avoid going to the tuples for
visibilityunder some circumstances. 
>
>
Of course, the planner has to ....  Otherwise it won't choose the Index Scan over the sequential scan.  So the cost of
indexscans when all the info other than visibility is in the index would need to be lowered. 


> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: Slow count(*) again...

From
Dan Harris
Date:
On 10/12/10 10:44 AM, Scott Carey wrote:
> On Oct 12, 2010, at 8:39 AM, Dan Harris wrote:
>
>>   On 10/11/10 8:02 PM, Scott Carey wrote:
>>> would give you a 1MB read-ahead.  Also, consider XFS and its built-in defragmentation.  I have found that a longer
livedpostgres DB will get extreme 
>>> file fragmentation over time and sequential scans end up mostly random.  On-line file defrag helps tremendously.
>>>
>> We just had a corrupt table caused by an XFS online defrag.  I'm scared
>> to use this again while the db is live.  Has anyone else found this to
>> be safe?  But, I can vouch for the fragmentation issue, it happens very
>> quickly in our system.
>>
> What version?  I'm using the latest CentoOS extras build.
>
> We've been doing online defrag for a while now on a very busy database with>  8TB of data.  Not that that means there
areno bugs... 
>
> It is a relatively simple thing in xfs -- it writes a new file to temp in a way that allocates contiguous space if
available,then if the file has not been modified since it was re-written it is essentially moved on top of the other
one. This should be safe provided the journaling and storage is safe, etc. 
>
I'm not sure how to figure out what version of XFS we're on.. but it's
Linux kernel 2.6.24-24 x86_64 on Ubuntu Server 8.04.3.  Postgres version 8.3

We're due for an upgrade on that server soon so we'll do some more
testing once we upgrade.  Right now we are just living with the
fragmentation.  I'm glad to hear the regular on-line defrag is working
successfully, at least that gives me hope we can rely on it in the future.

-Dan

Re: Slow count(*) again...

From
Tom Lane
Date:
So I spent a bit of quality time with oprofile this morning, and found
once again that there's no substitute for having actual data before
theorizing.

Test case software: current Git HEAD (plus one code change explained
below), compiled with --enable-debug to support oprofile, cassert off;
no other special configure options.  Running on current Fedora 13 (gcc
4.4.4 in particular).  All postgresql.conf options are out-of-the-box.

Test case hardware: recently purchased mid-grade desktop, dual Xeon
E5503 processors (Nehalem cores, 2GHZ), 4GB DDR3-800 RAM, no-name
SATA disk.

Test query: "select count(*) from t" where t has 4 nonnull integer
columns and 81920000 rows, occupying 3459MB.  I chose that size
specifically to fit into available RAM, so that on repeated executions
no physical I/O will occur.

On this setup I find that "select count(*)" runs in about 7.5sec when
the data is fully cached in RAM, for a scanning speed of 460MB/sec.
This is well in excess of what the machine's disk hardware can do:
bonnie++ rates the machine's disk read speed at 152MB/sec.  So in theory
PG should be able to completely saturate the disk when processing a table
bigger than RAM.  In reality the test case run time if I've just flushed
cache is about 28sec, working out to a scan rate of 123MB/sec.  I expect
if I'd bothered to tune the kernel readahead parameters as outlined
earlier in this thread, I could get to 150MB/sec.

Now of course this disk setup is far from industrial strength, but the
processor isn't what you'd put in a serious database server either (in
particular, its available memory bandwidth is well behind the curve).
Also, the table is pretty narrow (only 16 payload bytes per row), and
any wider test table would show a pretty much linear scaling of achievable
scan rate versus table width.  So I don't see much support here at all
for the notion that we scan slower than available disk bandwidth.

Further details from poking at it with oprofile: in the fully-cached
case the CPU time is about 80% Postgres and 20% kernel.  That kernel
time is of course all to do with moving pages from kernel disk buffers
into Postgres shared memory.  Although I've not bothered to increase
shared_buffers from the default 32MB, it wouldn't matter on this benchmark
unless I were able to make shared_buffers hold the entire table ... and
even then I'd only save 20%.

oprofile further shows that (with stock Postgres sources) the userspace
runtime breaks down like this:

samples  %        symbol name
141267   13.0810  heapgettup_pagemode
85947     7.9585  advance_aggregates
83031     7.6885  ExecProject
78975     7.3129  advance_transition_function
75060     6.9504  heapgetpage
73540     6.8096  ExecClearTuple
69355     6.4221  ExecProcNode
59288     5.4899  heap_getnext
57745     5.3470  ExecScan
55618     5.1501  HeapTupleSatisfiesMVCC
47057     4.3574  MemoryContextReset
41904     3.8802  ExecStoreTuple
37146     3.4396  SeqNext
32206     2.9822  ExecAgg
22135     2.0496  XidInMVCCSnapshot
21142     1.9577  int8inc
19280     1.7853  AllocSetReset
18211     1.6863  hash_search_with_hash_value
16285     1.5079  TransactionIdPrecedes

I also looked at the source-line-level breakdown, though that's too bulky
to post here.  The most interesting fact here is that tuple visibility
testing (MVCC) overhead is simply nonexistent: it'd be in heapgetpage()
if it were being done, which it isn't because all the pages of the table
have the PageIsAllVisible bit set.  In a previous run where those bits
weren't set but the per-tuple hint bits were, visibility testing still
only ate a percent or two of the runtime.  So the theory some people have
espoused in this thread that visibility testing is the bottleneck doesn't
hold water either.  If you go back and look at previous pgsql-hackers
discussions about that, what people have been worried about is not the CPU
cost of visibility testing but the need for indexscan queries to visit
the heap for no other purpose than to check the visibility flags.  In a
seqscan it's not going to matter.

I looked a bit more closely at the heapgettup_pagemode timing.  The
lines shown by opannotate as more than 0.1 percent of the runtime are

 22545  2.2074 :{ /* heapgettup_pagemode total: 153737 15.0528 */
  5685  0.5566 :    bool        backward = ScanDirectionIsBackward(dir);
  5789  0.5668 :        if (!scan->rs_inited)
  5693  0.5574 :            lineindex = scan->rs_cindex + 1;
 11429  1.1190 :        dp = (Page) BufferGetPage(scan->rs_cbuf);
  5693  0.5574 :        linesleft = lines - lineindex;
  5766  0.5646 :        while (linesleft > 0)
  5129  0.5022 :            lineoff = scan->rs_vistuples[lineindex];
 44461  4.3533 :            tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
 11135  1.0903 :            tuple->t_len = ItemIdGetLength(lpp);
  5692  0.5573 :            if (key != NULL)
  5773  0.5653 :                HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd),
  5674  0.5556 :                    scan->rs_cindex = lineindex;
 11406  1.1168 :}

There doesn't seem to be a whole lot of room for improvement there.
Maybe we could shave a couple percent with some tenser coding (I'm
wondering why HeapKeyTest is being reached, in particular, when there's
no WHERE clause).  But any local changes here will be marginal at best.

One thing I did find is that the time spent in ExecProject/ExecClearTuple,
amounting to nearly 15% of the runtime, is just for evaluating the
arguments of the aggregate ... and count(*) hasn't got any arguments.
So a patch like this improves the run speed by about 15%:

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a7dafeb..051e70c 100644
*** a/src/backend/executor/nodeAgg.c
--- b/src/backend/executor/nodeAgg.c
*************** advance_aggregates(AggState *aggstate, A
*** 480,486 ****
          TupleTableSlot *slot;

          /* Evaluate the current input expressions for this aggregate */
!         slot = ExecProject(peraggstate->evalproj, NULL);

          if (peraggstate->numSortCols > 0)
          {
--- 480,489 ----
          TupleTableSlot *slot;

          /* Evaluate the current input expressions for this aggregate */
!         if (peraggstate->evalproj)
!             slot = ExecProject(peraggstate->evalproj, NULL);
!         else
!             slot = peraggstate->evalslot;

          if (peraggstate->numSortCols > 0)
          {
*************** ExecInitAgg(Agg *node, EState *estate, i
*** 1728,1738 ****
          peraggstate->evalslot = ExecInitExtraTupleSlot(estate);
          ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc);

!         /* Set up projection info for evaluation */
!         peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args,
!                                                         aggstate->tmpcontext,
!                                                         peraggstate->evalslot,
!                                                         NULL);

          /*
           * If we're doing either DISTINCT or ORDER BY, then we have a list of
--- 1731,1744 ----
          peraggstate->evalslot = ExecInitExtraTupleSlot(estate);
          ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc);

!         /* Set up projection info for evaluation, if agg has any args */
!         if (aggrefstate->args)
!             peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args,
!                                                             aggstate->tmpcontext,
!                                                             peraggstate->evalslot,
!                                                             NULL);
!         else
!             peraggstate->evalproj = NULL;

          /*
           * If we're doing either DISTINCT or ORDER BY, then we have a list of

bringing the oprofile results to

samples  %        symbol name
181660   17.9017  heapgettup_pagemode
138049   13.6040  advance_transition_function
102865   10.1368  advance_aggregates
80948     7.9770  ExecProcNode
79943     7.8780  heap_getnext
73384     7.2316  ExecScan
60607     5.9725  MemoryContextReset
53889     5.3105  ExecStoreTuple
46666     4.5987  SeqNext
40535     3.9945  ExecAgg
33481     3.2994  int8inc
32202     3.1733  heapgetpage
26068     2.5689  AllocSetReset
18493     1.8224  hash_search_with_hash_value
8679      0.8553  LWLockAcquire
6615      0.6519  ExecSeqScan
6583      0.6487  LWLockRelease
3928      0.3871  hash_any
3715      0.3661  ReadBuffer_common

(note that this, not the stock code, is what corresponds to the 7.5sec
runtime I quoted above --- it's about 8.5sec without that change).

At this point what we've got is 25% of the runtime in nodeAgg.c overhead,
and it's difficult to see how to get any real improvement without tackling
that.  Rather than apply the patch shown above, I'm tempted to think about
hard-wiring COUNT(*) as a special case in nodeAgg.c such that we don't go
through advance_aggregates/advance_transition_function at all, but just
increment a counter directly.  However, that would very clearly be
optimizing COUNT(*) and nothing else.  Given the opinions expressed
elsewhere in this thread that heavy reliance on COUNT(*) represents
bad application design, I'm not sure that such a patch would meet with
general approval.

Actually the patch shown above is optimizing COUNT(*) and nothing else,
too, since it's hard to conceive of any other zero-argument aggregate.

Anyway, if anyone is hot to make COUNT(*) faster, that's where to look.
I don't think any of the previous discussion in this thread is on-point
at all, except for the parts where people suggested avoiding it.

            regards, tom lane

Re: Slow count(*) again...

From
Mladen Gogala
Date:
david@lang.hm wrote:
> On Tue, 12 Oct 2010, Mladen Gogala wrote:
>
>
>> david@lang.hm wrote:
>>
>>> from a PR point of view, speeding up the trivil count(*) case could be
>>> worth it, just to avoid people complaining about it not being fast.
>>>
>>>
>>>
>> Fixing PR stuff is not the approach that I would take. People are complaining
>> about select count(*) because they're using it in all the wrong places.
>>
>
> that may be the case, but if it's possible to make it less painful it will
> mean more people use postgres, both because it works better for them when
> they are using the suboptimal programs, but also because when people do
> their trivial testing of databases to decide which one they will use, they
> won't rule out postgres because "it's so slow"
>
>

There is no free lunch. If the count field is maintained somewhere, the
concurrency will suffer. I find the idea of fixing the "count delusion"
ridiculous, may Richard Dawkins forgive me for this pun. Saying that
something is slow without testing and a proper
consideration is ridiculous. As a DBA, I usually get complaints like
"the database is slow today" 3 times before lunch, every day.  The
database is never slow, the database is a warehouse where you keep your
data. What is slow is the access to the data, and that is done by, guess
what, the application program. Almost always, it's the application
that's slow, not the database. As for the "select count(*)", idiom, what
are you trying to do? Where are you using it? If you are using it for
pagination, consider the possibility of not specifying
the number of pages on the website, just the links "next -->" and "prev
<--". Alternatively, you can fetch a small amount into the web page and
direct the users who would like to see the complete information to a
background reporting too. Mixing batch reports and online reports is a
very easy thing to do. If you are using it to establish existence,
you're doing it wrong. I've had a problem like that this morning. A
developer came to me with the usual phrase that the "database is slow".
It was a PHP form which should write an output file and let the user
know where the file is. The function looks like this:

function put_xls($sth) {
    global $FNAME;
    $FNAME=$FNAME.".xls";
    $lineno=0;
    $ncols=$sth->FieldCount();
    for ($i = 0;$i <= $ncols;$i++) {
        $cols[$i] = $sth->FetchField($i);
        $colnames[$i]=$cols[$i]->name;
    }
    $workbook = new Spreadsheet_Excel_Writer("/software$FNAME");
    $format_bold =& $workbook->addFormat();
    $format_bold->setBold();
    $format_bold->setAlign('left');
    $format_left =& $workbook->addFormat();
    $format_left->setAlign('left');
    $worksheet = & $workbook->addWorksheet('Moreover Search');
    $worksheet->writeRow($lineno++,0,$colnames,$format_bold);
    while($row=$sth->FetchRow()) {
        $worksheet->writeRow($lineno++,0,$row,$format_left);
    }
    $workbook->close();
    $cnt=$sth->Recordcount();
    return($cnt);
}

The relevant includes are here:

require ('Date.php');
require ('adodb5/tohtml.inc.php');
require_once ('adodb5/adodb.inc.php');
require_once ('adodb5/adodb-exceptions.inc.php');
require_once 'Spreadsheet/Excel/Writer.php';
$ADODB_FETCH_MODE = ADODB_FETCH_NUM;

So, what is the problem here? Why was the "database slow"? As it turns
out, the PEAR module for writing Excel spreadsheets, which is the tool
used here, creates the entire spreadsheet in  memory and writes it out
on the "close" command. What was spinning was "httpd" process, the
database was completely and utterly idle, rolling thumbs and awaiting
orders. Using the "fputcsv" instead, made the function fly. The only
thing that was lost were the bold column titles. Changing little things
can result in the big performance gains. Making "select count(*)"
unnaturally  fast would be tending to  bad programming practices. I am
not sure that this is a desirable development. You can't expect people
to adjust the database software to your application. Applications are
always database specific. Writing  an application that will access a
PostgreSQL database is not the same as writing an application that will
access an Oracle database.

--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Re: Slow count(*) again...

From
Jesper Krogh
Date:
On 2010-10-12 19:07, Tom Lane wrote:
> Anyway, if anyone is hot to make COUNT(*) faster, that's where to look.
> I don't think any of the previous discussion in this thread is on-point
> at all, except for the parts where people suggested avoiding it.
>

I would have to say that allthough it is nice to get count(*) faster I
think your testing is way too simple.

It pretty much proves that in terms of the code involved in the
count(*) process there is not much to be achieved. But your table
has way to little payload. As PG currently is it will start by pushing
data off to TOAST when the tuple size reaches 1KB
and the speed of count(*) is very much dominated by the amount
of "dead weight" it has to draw in together with the heap-access for the
row on accessing the table. Creating a case where the table is this
slim is (in my viewpoint) very much to the extreme on the small side.

Just having 32 bytes bytes of "payload" would more or less double
you time to count if I read you test results correctly?. .. and in the
situation where diskaccess would be needed .. way more.

Dividing by pg_relation_size by the amout of tuples in our production
system I end up having no avg tuple size less than 100bytes.

.. without having complete insigt.. a visibillity map that could be used in
conjunction with indices would solve that. What the cost would be
of maintaining it is also a factor.

Jesper

--
Jesper

Re: Slow count(*) again...

From
Tom Lane
Date:
Jesper Krogh <jesper@krogh.cc> writes:
> On 2010-10-12 19:07, Tom Lane wrote:
>> Anyway, if anyone is hot to make COUNT(*) faster, that's where to look.

> Just having 32 bytes bytes of "payload" would more or less double
> you time to count if I read you test results correctly?. .. and in the
> situation where diskaccess would be needed .. way more.

> Dividing by pg_relation_size by the amout of tuples in our production
> system I end up having no avg tuple size less than 100bytes.

Well, yeah.  I deliberately tested with a very narrow table so as to
stress the per-row CPU costs as much as possible.  With any wider table
you're just going to be I/O bound.

> .. without having complete insigt.. a visibillity map that could be used in
> conjunction with indices would solve that. What the cost would be
> of maintaining it is also a factor.

I'm less than convinced that that approach will result in a significant
win.  It's certainly not going to do anything to convert COUNT(*) into
an O(1) operation, which frankly is what the complainants are expecting.
There's basically no hope of solving the "PR problem" without somehow
turning COUNT(*) into a materialized-view reference.  We've discussed
that in the past, and know how to do it in principle, but the complexity
and distributed overhead are daunting.

            regards, tom lane

Re: Slow count(*) again...

From
"Pierre C"
Date:
> suggest that 99% instances of the "select count(*)" idiom are probably
>> bad use of the SQL language.

Well, suppose you paginate results. If the user sees that the search query
returns 500 pages, there are two options :

- you're google, and your sorting algorithms are so good that the answer
the user wants is in the first page
- or the user will refine his search by entering more keywords tu get a
manageable result set

So, in both cases, the count(*) was useless anyway. And the slowest ones
are the most useless, since the user will immediatey discard the result
and refine his query.

If your full text search is slow, try Xapian or Lucene.

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Tuesday 12 October 2010 07:19:57 you wrote:
> >> The biggest single problem with "select count(*)" is that it is
> >> seriously overused. People use that idiom to establish existence, which
> >> usually leads to a performance disaster in the application using it,
> >> unless the table has no more than few hundred records. SQL language, of
> >> which PostgreSQL offers an excellent implementation,  offers [NOT]
> >> EXISTS clause since its inception in the Jurassic era. The problem is
> >> with the sequential scan, not with counting. I'd even go as far as to
> >> suggest that 99% instances of the "select count(*)" idiom are probably
> >> bad use of the SQL language.
> >
> > I agree, I have seen many very bad examples of using count(*). I will go
> > so far as to question the use of count(*) in my examples here. It there
> > a better way to come up with a page list than using count(*)? What is
> > the best method to make a page of results and a list of links to other
> > pages of results? Am I barking up the wrong tree here?
>
> One way I have dealt with this on very large tables is to cache the
> count(*) at the application level (using memcached, terracotta, or
> something along those lines) and then increment that cache whenever you
> add a row to the relevant table.  On application restart that cache is
> re-initialized with a regular old count(*).  This approach works really
> well and all large systems in my experience need caching in front of the
> DB eventually.  If you have a simpler system with say a single
> application/web server you can simply store the value in a variable, the
> specifics would depend on the language and framework you are using.

I use this method when ever possible. I talked about it in my first post.
I generally keep a table around I call counts. It has many rows that store
count numbers from frequently used views.
The one that I can't do anything about is the case where you nave no control
over the WHERE clause, (or where there may be simply too many options to count
everything ahead of time without making things even slower). That is the point
of this entire thread, or was... ;)
-Neil-


>
> Another more all-DB approach is to create a statistics tables into which
> you place aggregated statistics rows (num deleted, num inserted, totals,
> etc) at an appropriate time interval in your code.  So you have rows
> containing aggregated statistics information for the past and some tiny
> portion of the new data happening right now that hasn't yet been
> aggregated.  Queries then look like a summation of the aggregated values
> in the statistics table plus a count(*) over just the newest portion of
> the data table and are generally very fast.
>
> Overall I have found that once things get big the layers of your app
> stack start to blend together and have to be combined in clever ways to
> keep speed up.  Postgres is a beast but when you run into things it
> can't do well just find a way to cache it or make it work together with
> some other persistence tech to handle those cases.

Re: Slow count(*) again...

From
Mladen Gogala
Date:
Pierre C wrote:
>>
>
> Well, suppose you paginate results. If the user sees that the search query
> returns 500 pages, there are two options :
>
With Google, I usually lose patience on the page 3. All that I, as an
end user, need to know is whether there are  more than 10 pages. The
fact that there are 1776 pages in the result set is not particularly
useful to me. I couldn't care less whether the number of returned pages
is 1492, 1776 or 1861, I'm going to look at, at most, the first 5 of them.

> - you're google, and your sorting algorithms are so good that the answer
> the user wants is in the first page
> - or the user will refine his search by entering more keywords tu get a
> manageable result set
>
> So, in both cases, the count(*) was useless anyway. And the slowest ones
> are the most useless, since the user will immediatey discard the result
> and refine his query.
>
> If your full text search is slow, try Xapian or Lucene.
>
>
May I also recommend Sphinx? It's a very nice text search engine, with
the price equal to that of Lucene.


--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Tuesday 12 October 2010 08:39:19 Dan Harris wrote:
>   On 10/11/10 8:02 PM, Scott Carey wrote:
> > would give you a 1MB read-ahead.  Also, consider XFS and its built-in
> > defragmentation.  I have found that a longer lived postgres DB will get
> > extreme file fragmentation over time and sequential scans end up mostly
> > random.  On-line file defrag helps tremendously.
>
> We just had a corrupt table caused by an XFS online defrag.  I'm scared
> to use this again while the db is live.  Has anyone else found this to
> be safe?  But, I can vouch for the fragmentation issue, it happens very
> quickly in our system.
>
> -Dan

I would like to know the details of what was going on that caused your
problem. I have been using XFS for over 9 years, and it has never caused any
trouble at all in a production environment. Sure, I had many problems with it
on the test bench, but in most cases the issues were very clear and easy to
avoid in production. There were some (older) XFS tools that caused some
problems, but that is in the past, and as time goes on, it seems take less and
less planning to make it work properly.
-Neil-

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Tuesday 12 October 2010 14:35:01 you wrote:
> > suggest that 99% instances of the "select count(*)" idiom are probably
> >
> >> bad use of the SQL language.
>
> Well, suppose you paginate results. If the user sees that the search query
> returns 500 pages, there are two options :
>
> - you're google, and your sorting algorithms are so good that the answer
> the user wants is in the first page
> - or the user will refine his search by entering more keywords tu get a
> manageable result set
>
> So, in both cases, the count(*) was useless anyway. And the slowest ones
> are the most useless, since the user will immediatey discard the result
> and refine his query.
>
> If your full text search is slow, try Xapian or Lucene.

I guess I have to comment here again and point out that while I am having this
issue with text searches, I avoid using count(*) in such cases, I just use
next and previous links. Where the real problem (for me) is that when someone
searches a date or time range. My application keeps track of huge amounts of
realtime transactional data. So an administrator might want a report as to
what some data point did yesterday between 3 and 4 PM. Under normal conditions
the range of records that match can be between 0 and over 5,000. This is
really killing me especially when the reporting people want a list of how many
transactions each that were on points in a given zipcode had this morning
between 8 and 9 AM, it takes about 5 minutes to run on a server that has
enough ram to hold the entire table!

Pseudo query:
Show how many transactions per node in zipcode 92252 between 8:00 and 9:00
today:

point_number | number_of_transactions
65889    |    31
34814    |    4865
28349    |    0
3358    |    364
...

24 total rows, > 5 minutes.

Then they want every node to be a link to a list of actual data within the
specified timeframe.
This is where I have to to the same query twice to first find out how many for
the page links, then again to get a page of results.
Sure, I could keep tables around that have numbers by the hour, minute, day or
whatever to cache up results for speeding things, then the problem is that
when the data is put into the server, there are so many statistics tables to
update, the front end becomes a huge problem. Also, it makes for a huge mess
of tables to think about when I need to make a report.

-Neil-



Re: Slow count(*) again...

From
Dan Harris
Date:
  On 10/12/10 4:33 PM, Neil Whelchel wrote:
> On Tuesday 12 October 2010 08:39:19 Dan Harris wrote:
>>    On 10/11/10 8:02 PM, Scott Carey wrote:
>>> would give you a 1MB read-ahead.  Also, consider XFS and its built-in
>>> defragmentation.  I have found that a longer lived postgres DB will get
>>> extreme file fragmentation over time and sequential scans end up mostly
>>> random.  On-line file defrag helps tremendously.
>> We just had a corrupt table caused by an XFS online defrag.  I'm scared
>> to use this again while the db is live.  Has anyone else found this to
>> be safe?  But, I can vouch for the fragmentation issue, it happens very
>> quickly in our system.
>>
>> -Dan
> I would like to know the details of what was going on that caused your
> problem. I have been using XFS for over 9 years, and it has never caused any
> trouble at all in a production environment. Sure, I had many problems with it
> on the test bench, but in most cases the issues were very clear and easy to
> avoid in production. There were some (older) XFS tools that caused some
> problems, but that is in the past, and as time goes on, it seems take less and
> less planning to make it work properly.
> -Neil-
>
There were roughly 50 transactions/sec going on at the time I ran it.
xfs_db reported 99% fragmentation before it ran ( we haven't been
running it via cron ).  The operation completed in about 15 minutes (
360GB of used data on the file system ) with no errors.  Everything
seemed fine until the next morning when a user went to query a table we
got a message about a "missing" file inside the pg cluster.  We were
unable to query the table at all via psql.  It was a bit of a panic
situation so we restored that table from backup immediately and the
problem was solved without doing more research.

This database has been running for years with no problem ( and none
since ), that was the first time I tried to do an on-line defrag and
that was the only unusual variable introduced into the system at that
time so it was a strong enough correlation for me to believe that caused
it.  Hopefully this was just a corner case..

-Dan


Re: Slow count(*) again...

From
Robert Haas
Date:
On Tue, Oct 12, 2010 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Anyway, if anyone is hot to make COUNT(*) faster, that's where to look.
> I don't think any of the previous discussion in this thread is on-point
> at all, except for the parts where people suggested avoiding it.

I kind of hope that index-only scans help with this, too.  If you have
a wide table and a narrow (but not partial) index, and if the
visibility map bits are mostly set, it ought to be cheaper to read the
index than the table - certainly in the case where any disk I/O is
involved, and maybe even if it isn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Sunday 10 October 2010 21:15:56 Neil Whelchel wrote:

> Right now, I am building a test machine with two dual core Intel processors
> and two 15KRPM mirrored hard drives, 1 GB ram. I am using a small amount of
> ram because I will be using small test tables. I may do testing in the
> future with more ram and bigger tables, but I think I can accomplish what
> we are all after with what I have. The machine will be limited to running
> the database server in test, init, bash, and ssh, no other processes will
> be running except for what is directly involved with testing. I will post
> exact specs when I post test results. I will create some test tables, and
> the same tables will be used in all tests. Suggestions for optimal
> Postgres and system  configuration are welcome. I will try any suggested
> settings that I have time to test. -Neil-
>

Ok the test machine is up and running:
A few more details, the hard drives are SCSI Ultra-320, the CPUs are 2.8 GHZ,
533 MHZ FSB. I wanted to make a more memory cramped machine to keep the table
to RAM ratio closer to the production machines, but for now, all I have are
1GB DDRs, and the machine requires pairs, so total memory is 2GB. Swap is
turned off.

The data I will be using is a couple of days of raw data from a production
system. The columns of interest are numeric and timestamp. I will use the
exact same data for all tests.

                           Table "public.log"
      Column      |            Type             |       Modifiers
------------------+-----------------------------+------------------------
 batch_id         | integer                     |
 t_stamp          | timestamp without time zone | not null default now()
 raw_data         | numeric                     |
 data_value       | numeric                     |
 data_value_delta | numeric                     |
 journal_value    | numeric                     |
 journal_data     | numeric                     |
 machine_id       | integer                     | not null
 group_number     | integer                     |
Indexes:
    "log_idx" btree (group_number, batch_id)
    "log_oid_idx" btree (oid)
    "log_t_stamp" btree (t_stamp)

The initial test is with XFS with write barriers turned on, this makes for
very slow writes. The point of the first test is to get a baseline of
everything out-of-the-box. So, here are the numbers:

Insert the data into one table:
crash:~# time psql -U test test -q < log.sql
real    679m43.678s
user    1m4.948s
sys     13m1.893s

crash:~# echo 3 > /proc/sys/vm/drop_caches
crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
  count
----------
 10050886
(1 row)

real    0m11.812s
user    0m0.000s
sys     0m0.004s

crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
  count
----------
 10050886
(1 row)

real    0m3.737s
user    0m0.000s
sys     0m0.000s

As can be seen here, the cache helps..
And the numbers are not all that bad, so let's throw a sabot into the gears:
crash:~# time psql -U test test -c "UPDATE log SET raw_data=raw_data+1"
UPDATE 10050886

real    14m13.802s
user    0m0.000s
sys     0m0.000s

crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
  count
----------
 10050886
(1 row)

real    3m32.757s
user    0m0.000s
sys     0m0.000s

Just to be sure:
crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
  count
----------
 10050886
(1 row)

real    2m38.631s
user    0m0.000s
sys     0m0.000s

It looks like cache knocked about a minute off, still quite sad.
So, I shutdown Postgres, ran xfs_fsr, and started Postgres:
crash:~# echo 3 > /proc/sys/vm/drop_caches
crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
  count
----------
 10050886
(1 row)

real    1m36.304s
user    0m0.000s
sys     0m0.000s

So it seems that defragmentation knocked another minute off:
Let's see how much cache helps now:
crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
  count
----------
 10050886
(1 row)

real    1m34.873s
user    0m0.000s
sys     0m0.000s

Not much... And we are a long way from the 3.7 seconds with a freshly inserted
table. Maybe the maid can help here.
crash:~# time psql -U test test -c "VACUUM log;"
VACUUM

real    22m31.931s
user    0m0.000s
sys     0m0.000s

crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
  count
----------
 10050886
(1 row)

real    1m30.927s
user    0m0.000s
sys     0m0.000s

Nope...
So, possible conclusions are:
1. Even with VACUUM database table speed degrades as tables are updated.
2. Time testing on a freshly INSERTed table gives results that are not real-
world.
3. Filesystem defragmentation helps (some).
4. Cache only makes a small difference once a table has been UPDATEd.

I am going to leave this configuration running for the next day or so. This
way I can try any suggestions and play with any more ideas that I have.
I will try these same tests on ext4 later, along with any good suggested
tests.
I will try MySQL with the dame data with both XFS and ext4.
-Neil-

Re: Slow count(*) again...

From
Mark Kirkwood
Date:
On 13/10/10 19:47, Neil Whelchel wrote:
>
> Nope...
> So, possible conclusions are:
> 1. Even with VACUUM database table speed degrades as tables are updated.
> 2. Time testing on a freshly INSERTed table gives results that are not real-
> world.
> 3. Filesystem defragmentation helps (some).
> 4. Cache only makes a small difference once a table has been UPDATEd.
>
> I am going to leave this configuration running for the next day or so. This
> way I can try any suggestions and play with any more ideas that I have.
> I will try these same tests on ext4 later, along with any good suggested
> tests.
> I will try MySQL with the dame data with both XFS and ext4.
> -Neil-
>
>

I think that major effect you are seeing here is that the UPDATE has
made the table twice as big on disk (even after VACUUM etc), and it has
gone from fitting in ram to not fitting in ram - so cannot be
effectively cached anymore.

This would not normally happen in real life (assuming UPDATEs only
modify a small part of a table  per transaction). However administration
updates (e.g 'oh! -  ref 1 should now be ref 2 please update
everything') *will* cause the table size to double.

This is an artifact of Postgres's non overwriting storage manager -
Mysql will update in place and you will not see this.

Try VACUUM FULL on the table and retest.

regards

Mark

Re: Slow count(*) again...

From
"Pierre C"
Date:
> I guess I have to comment here again and point out that while I am
> having this
> issue with text searches, I avoid using count(*) in such cases, I just
> use
> next and previous links.

Unfortunately sometimes you got to do an ORDER BY on search results, and
then all the rows got to be read...

> Where the real problem (for me) is that when someone
> searches a date or time range. My application keeps track of huge

Have you tried CLUSTER ?

Also, it is sad to say, but if you need an engine able to use index-only
scans which would fit this type of query, replicate the table to MyISAM.
Unfortunately, the MySQL optimizer is really not so smart about complex
reporting queries (no hash joins, no hash aggregates) so if you don't have
a multicolumn index covering that you can use for index-only scan in your
query, you'll get either a really huge sort or a really nasty nested loop
index scan...

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Wednesday 13 October 2010 00:19:26 Mark Kirkwood wrote:
> On 13/10/10 19:47, Neil Whelchel wrote:
> > Nope...
> > So, possible conclusions are:
> > 1. Even with VACUUM database table speed degrades as tables are updated.
> > 2. Time testing on a freshly INSERTed table gives results that are not
> > real- world.
> > 3. Filesystem defragmentation helps (some).
> > 4. Cache only makes a small difference once a table has been UPDATEd.
> >
> > I am going to leave this configuration running for the next day or so.
> > This way I can try any suggestions and play with any more ideas that I
> > have. I will try these same tests on ext4 later, along with any good
> > suggested tests.
> > I will try MySQL with the dame data with both XFS and ext4.
> > -Neil-
>
> I think that major effect you are seeing here is that the UPDATE has
> made the table twice as big on disk (even after VACUUM etc), and it has
> gone from fitting in ram to not fitting in ram - so cannot be
> effectively cached anymore.
>
> This would not normally happen in real life (assuming UPDATEs only
> modify a small part of a table  per transaction). However administration
> updates (e.g 'oh! -  ref 1 should now be ref 2 please update
> everything') *will* cause the table size to double.
>
> This is an artifact of Postgres's non overwriting storage manager -
> Mysql will update in place and you will not see this.
>
> Try VACUUM FULL on the table and retest.
>
> regards
>
> Mark

There seems to be allot of discussion about VACUUM FULL, and its problems. The
overall buzz seems to be that VACUUM FULL is a bad idea (I could be wrong
here). It has been some time since I have read the changelogs, but I seem to
remember that there have been some major changes to VACUUM FULL recently.
Maybe this needs to be re-visited in the documentation.

crash:~# time psql -U test test -c "VACUUM FULL log;"
VACUUM

real    4m49.055s
user    0m0.000s
sys     0m0.000s

crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
  count
----------
 10050886
(1 row)

real    0m9.665s
user    0m0.000s
sys     0m0.004s

A huge improvement from the minute and a half before the VACUUM FULL.
crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
  count
----------
 10050886
(1 row)

real    0m3.786s
user    0m0.000s
sys     0m0.000s

And the cache helps...
So, we are right back to within 10ms of where we started after INSERTing the
data, but it took a VACUUM FULL to accomplish this (by making the table fit in
RAM).
This is a big problem on a production machine as the VACUUM FULL is likely to
get in the way of INSERTing realtime data into the table.

So to add to the conclusion pile:
5. When you have no control over the WHERE clause which may send count(*)
through more rows of a table that would fit in RAM your performance will be
too slow, so count is missing a LIMIT feature to avoid this.
6. Keep tables that are to be updated frequently as narrow as possible: Link
them to wider tables to store the columns that are less frequently updated.

So with our conclusion pile so far we can deduce that if we were to keep all
of our data in two column tables (one to link them together, and the other to
store one column of data), we stand a much better chance of making the entire
table to be counted fit in RAM, so we simply apply the WHERE clause to a
specific table as opposed to a column within a wider table... This seems to
defeat the entire goal of the relational database...

-Neil-

Re: Slow count(*) again...

From
Mladen Gogala
Date:
  On 10/13/2010 2:47 AM, Neil Whelchel wrote:
> Even with VACUUM database table speed degrades

What the heck is the "database table speed"? Tables don't do anything.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Mladen Gogala
Date:
  On 10/13/2010 3:19 AM, Mark Kirkwood wrote:
> I think that major effect you are seeing here is that the UPDATE has
> made the table twice as big on disk (even after VACUUM etc), and it has
> gone from fitting in ram to not fitting in ram - so cannot be
> effectively cached anymore.
>
In the real world, tables are larger than the available memory. I have
tables of several hundred gigabytes in size. Tables shouldn't be
"effectively cached", the next step would be to measure "buffer cache
hit ratio", tables should be effectively used.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Mark Kirkwood
Date:
On 13/10/10 21:38, Neil Whelchel wrote:
>
> So with our conclusion pile so far we can deduce that if we were to keep all
> of our data in two column tables (one to link them together, and the other to
> store one column of data), we stand a much better chance of making the entire
> table to be counted fit in RAM, so we simply apply the WHERE clause to a
> specific table as opposed to a column within a wider table... This seems to
> defeat the entire goal of the relational database...
>
>

That is a bit excessive I think - a more reasonable conclusion to draw
is that tables bigger than ram will drop to IO max speed to scan, rather
than DIMM max speed...

There are things you can do to radically improve IO throughput - e.g a
pair of AMC or ARECA 12 slot RAID cards setup RAID 10 and tuned properly
should give you a max sequential throughput of something like 12*100
MB/s = 1.2 GB/s. So your example table (estimated at 2GB) so be able to
be counted by Postgres in about 3-4 seconds...

This assumes a more capable machine than you are testing on I suspect.

Cheers

Mark

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Wednesday 13 October 2010 01:50:23 Mark Kirkwood wrote:
> On 13/10/10 21:38, Neil Whelchel wrote:
> > So with our conclusion pile so far we can deduce that if we were to keep
> > all of our data in two column tables (one to link them together, and the
> > other to store one column of data), we stand a much better chance of
> > making the entire table to be counted fit in RAM, so we simply apply the
> > WHERE clause to a specific table as opposed to a column within a wider
> > table... This seems to defeat the entire goal of the relational
> > database...
>
> That is a bit excessive I think - a more reasonable conclusion to draw
> is that tables bigger than ram will drop to IO max speed to scan, rather
> than DIMM max speed...
>
> There are things you can do to radically improve IO throughput - e.g a
> pair of AMC or ARECA 12 slot RAID cards setup RAID 10 and tuned properly
> should give you a max sequential throughput of something like 12*100
> MB/s = 1.2 GB/s. So your example table (estimated at 2GB) so be able to
> be counted by Postgres in about 3-4 seconds...
>
> This assumes a more capable machine than you are testing on I suspect.
>
> Cheers
>
> Mark
The good ol' bruit force approach! I knew I'd see this one sooner or later.
Though I was not sure if I was going to see the 16TB of RAM suggestion first.
Seriously though, as the title of this thread suggests, everything is
relative. Sure count(*) and everything else will work faster with more system
power. It just seems to me that count(*) is slower than it could be given a
set of conditions. I started this thread because I think that there must be a
better way to count matches from an INDEXed column than shoving the entire
table through RAM (including columns that you are not interested in at the
minute). And even worse, when you have no (reasonable) control of the WHERE
clause preventing your system from thrashing for the next week because
somebody put in criteria that matched a few TB of records and there is no way
to LIMIT count(*) other than externally timing the query and aborting it if it
takes too long. Whet is needed is a way to determine how many rows are likely
to match a given WHERE clause so we can cut off useless queries, but we need a
fast count(*) for that, or a limit on the existing one... I seem to remember
saying something about an index driven estimate(*) at one point...

I might go as far as to rattle the cage of the developers to see if it makes
any sense to add some column oriented storage capability to Postgres. That
would be the hot ticket to be able to specify an attribute on a column so that
the back end could shadow or store a column in a column oriented table so
aggregate functions could work on them with good efficiency, or is that an
INDEX?

Since the thread has started, I have had people ask about different system
configurations, especially the filesystem (XFS, ext4...). I have never tested
ext4, and since we are all involved here, I thought that I could do so and
share my results for others, that is why I got into time testing stuff.
Time testing count(*) in my later postings is really not the point as count is
simply dragging the entire table off of the RAID through RAM, I can use any
other function like max()... No that can narrow down its scan with an INDEX...
Ok, sum(), there we go!

-Neil-


Re: Slow count(*) again...

From
Vitalii Tymchyshyn
Date:
12.10.10 14:44, Craig Ringer написав(ла):
>
>> in the case where you are doing a count(*) where query and the where is
>> on an indexed column, could the search just look at the index + the
>> visibility mapping rather than doing an sequential search through the
>> table?
>
> Nope, because the visibility map, which is IIRC only one bit per page,
> doesn't record how many tuples there are on the page, or enough
> information about them to determine how many of them are visible to
> the current transaction*.
I'd say it can tell you that your may not recheck given tuple, can't it?
You still have to count all index tuples and recheck the ones that are
uncertain. Does it work in this way? This can help a lot for wide tuples
in table, but with narrow index and mostly read-only data.

Best regards, Vitalii Tymchyshyn

Re: Slow count(*) again...

From
Vitalii Tymchyshyn
Date:
12.10.10 21:58, Tom Lane написав(ла):
>
> I'm less than convinced that that approach will result in a significant
> win.  It's certainly not going to do anything to convert COUNT(*) into
> an O(1) operation, which frankly is what the complainants are expecting.
> There's basically no hope of solving the "PR problem" without somehow
> turning COUNT(*) into a materialized-view reference.  We've discussed
> that in the past, and know how to do it in principle, but the complexity
> and distributed overhead are daunting.
>
>
I've though about "aggregate" indexes, something like
create index index_name on table_name(count(*) group by column1, column2);
OR
create index index_name on table_name(count(*));
for table-wide count

To make it usable one would need:
1) Allow third aggregate function SMERGE that can merge one aggregate
state to another
2) The index should be regular index (e.g. btree) on column1, column2
that for each pair has page list to which it's data may belong (in
past/current running transactions), and aggregate state for each page
that were frozen previously
When index is used, it can use precalculated values for "pages with all
tuples vacuumed" (I suspect this is information from visibility map) and
should do regular calculation for all non-frozen pages with visibility
checks and everything what's needed.
When vacuum processes the page, it should (in sync or async way)
calculate aggregate values for the page.

IMHO Such an indexes would make materialized views/triggers/high level
caches unneeded in most cases.

Best regards, Vitalii Tymchyshyn


Re: Slow count(*) again...

From
Craig Ringer
Date:
On 13/10/2010 12:38 AM, Jesper Krogh wrote:

> If some clever postgres hacker could teach postgres to allocate blocks
> using posix_fallocate in quite large batches, say .. something like:
> fallocate(min(current_relation_size *0.1,1073741824))

There doesn't seem to be any use of posix_fallocate in the sources, at
least according to git grep. The patch that introduced posix_fadvise use
apparently had posix_fallocate in it, but that use appears to have been
removed down the track.

It's worth noting that posix_fallocate sucks if your file system doesn't
intelligent support for it. IIRC it's horrible on ext3, where it can
take a while to return while it allocates (and IIRC zeroes!) all those
blocks. This may be part of why it's not used. In past testing with
posix_fallocate for other tools I've also found rather mixed performance
results - it can slow things down rather than speed them up, depending
on the file system in use and all sorts of other factors.

If Pg was to use posix_fallocate, it'd probably need control over it on
a per-tablespace basis.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

Re: Slow count(*) again...

From
Greg Smith
Date:
Dan Harris wrote:
> I'm not sure how to figure out what version of XFS we're on.. but it's
> Linux kernel 2.6.24-24 x86_64 on Ubuntu Server 8.04.3.  Postgres
> version 8.3

There's the kernel side support that matches your kernel, as well as the
xfsprogs package.  The latter is where a lot of distributions haven't
kept up with upstream changes, and where I suspect the defragmenter bug
you ran into is located at.

Hardy ships with 2.9.4-2:  http://packages.ubuntu.com/hardy/xfsprogs

The work incorporating a more stable XFS into RHEL started with xfsprogs
3.0.1-6 going into Fedora 11, and 3.1.X would represent a current
release.  So your Ubuntu kernel is two major improvement releases
behind, 3.0 and 3.1 were the upgrades to xfsprogs where things really
got going again making that code modern and solid.  Ubuntu Lucid
switched to 3.1.0, RHEL6 will probably ship 3.1.0 too.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: Slow count(*) again...

From
Mladen Gogala
Date:
  On 10/13/2010 8:12 AM, Greg Smith wrote:
> The work incorporating a more stable XFS into RHEL started with xfsprogs
> 3.0.1-6 going into Fedora 11, and 3.1.X would represent a current
> release.  So your Ubuntu kernel is two major improvement releases
> behind, 3.0 and 3.1 were the upgrades to xfsprogs where things really
> got going again making that code modern and solid.  Ubuntu Lucid
> switched to 3.1.0, RHEL6 will probably ship 3.1.0 too.
>

I am afraid that my management will not let me use anything that doesn't
exist as a RPM package in the current Red Hat distribution. No Ubuntu,
no Fedora, no manual linking.  There will always be that ominous
question: how many other companies are using XFS? From the business
perspective, questions like that make perfect sense.

--
Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
www.vmsinfo.com


Re: Slow count(*) again...

From
Greg Smith
Date:
Mladen Gogala wrote:
> I am afraid that my management will not let me use anything that
> doesn't exist as a RPM package in the current Red Hat distribution. No
> Ubuntu, no Fedora, no manual linking.  There will always be that
> ominous question: how many other companies are using XFS? From the
> business perspective, questions like that make perfect sense.

XFS support is available as an optional module starting in RHEL 5.5.  In
CentOS, you just grab it, so that's what I've been doing.  My
understanding is that you may have to ask your sales rep to enable
access to it under the official RedHat Network channels if you're using
a subscription from them.  I'm not sure exactly what the support
situation is with it, but it's definitely available as an RPM from RedHat.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: Slow count(*) again...

From
Robert Haas
Date:
On Wed, Oct 13, 2010 at 4:38 AM, Neil Whelchel <neil.whelchel@gmail.com> wrote:
> There seems to be allot of discussion about VACUUM FULL, and its problems. The
> overall buzz seems to be that VACUUM FULL is a bad idea (I could be wrong
> here). It has been some time since I have read the changelogs, but I seem to
> remember that there have been some major changes to VACUUM FULL recently.
> Maybe this needs to be re-visited in the documentation.

In 9.0, VACUUM FULL does something similar to what CLUSTER does.  This
is a much better idea than what it did in 8.4 and prior.

> crash:~# time psql -U test test -c "VACUUM FULL log;"
> VACUUM
>
> real    4m49.055s
> user    0m0.000s
> sys     0m0.000s
>
> crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
>  count
> ----------
>  10050886
> (1 row)
>
> real    0m9.665s
> user    0m0.000s
> sys     0m0.004s
>
> A huge improvement from the minute and a half before the VACUUM FULL.

This is a very surprising result that I would like to understand
better.  Let's assume that your UPDATE statement bloated the table by
2x (you could use pg_relation_size to find out exactly; the details
probably depend on fillfactor which you might want to lower if you're
going to do lots of updates).  That ought to mean that count(*) has to
grovel through twice as much data, so instead of taking 9 seconds it
ought to take 18 seconds.  Where the heck is the other 1:12 going?
This might sort of make sense if the original table was laid out
sequentially on disk and the updated table was not, but how and why
would that happen?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Slow count(*) again...

From
Robert Haas
Date:
On Wed, Oct 13, 2010 at 6:16 AM, Neil Whelchel <neil.whelchel@gmail.com> wrote:
> I might go as far as to rattle the cage of the developers to see if it makes
> any sense to add some column oriented storage capability to Postgres. That
> would be the hot ticket to be able to specify an attribute on a column so that
> the back end could shadow or store a column in a column oriented table so
> aggregate functions could work on them with good efficiency, or is that an
> INDEX?

I'd love to work on that, but without funding it's tough to find the
time.  It's a big project.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Slow count(*) again...

From
"Kevin Grittner"
Date:
Neil Whelchel <neil.whelchel@gmail.com> wrote:

> crash:~# time psql -U test test -c "UPDATE log SET
> raw_data=raw_data+1"
> UPDATE 10050886
>
> real    14m13.802s
> user    0m0.000s
> sys     0m0.000s
>
> crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
>   count
> ----------
>  10050886
> (1 row)
>
> real    3m32.757s
> user    0m0.000s
> sys     0m0.000s
>
> Just to be sure:
> crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
>   count
> ----------
>  10050886
> (1 row)
>
> real    2m38.631s
> user    0m0.000s
> sys     0m0.000s
>
> It looks like cache knocked about a minute off

That's unlikely to be caching, since you just updated the rows.
It's much more likely to be one or both of rewriting the rows as you
read them to set hint bits or competing with autovacuum.

The large increase after the update probably means you went from a
table which was fully cached to something larger than the total
cache.

-Kevin

Re: Slow count(*) again...

From
Tom Lane
Date:
Neil Whelchel <neil.whelchel@gmail.com> writes:
> Insert the data into one table:
> crash:~# time psql -U test test -q < log.sql
> real    679m43.678s
> user    1m4.948s
> sys     13m1.893s

> crash:~# echo 3 > /proc/sys/vm/drop_caches
> crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
>   count
> ----------
>  10050886
> (1 row)

> real    0m11.812s
> user    0m0.000s
> sys     0m0.004s

> crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
>   count
> ----------
>  10050886
> (1 row)

> real    0m3.737s
> user    0m0.000s
> sys     0m0.000s

> As can be seen here, the cache helps..

That's probably got little to do with caching and everything to do with
setting hint bits on the first SELECT pass.

I concur with Mark's question about whether your UPDATE pushed the table
size across the limit of what would fit in RAM.

            regards, tom lane

Re: Slow count(*) again...

From
Jesper Krogh
Date:
On 2010-10-13 15:28, Robert Haas wrote:
> On Wed, Oct 13, 2010 at 6:16 AM, Neil Whelchel<neil.whelchel@gmail.com>  wrote:
>
>> I might go as far as to rattle the cage of the developers to see if it makes
>> any sense to add some column oriented storage capability to Postgres. That
>> would be the hot ticket to be able to specify an attribute on a column so that
>> the back end could shadow or store a column in a column oriented table so
>> aggregate functions could work on them with good efficiency, or is that an
>> INDEX?
>>
> I'd love to work on that, but without funding it's tough to find the
> time.  It's a big project.
>
Is it hugely different from just getting the visibillity map suitable
for doing index-only scans and extracting values from the index
directly as Heikki has explained?

That would essentially do a column oriented table (the index itself)
of a specific columns (or column set).

... still a huge task though.

--
Jesper

Re: Slow count(*) again...

From
Alex Hunsaker
Date:
On Wed, Oct 13, 2010 at 07:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Neil Whelchel <neil.whelchel@gmail.com> writes:

> I concur with Mark's question about whether your UPDATE pushed the table
> size across the limit of what would fit in RAM.

Yeah, you said you have ~2GB of ram, just counting the bytes and the
number of rows (not including padding or overhead) puts you around
~670MB.  Some quick testing here on a 64 bit box :

=> create table log (batch_id int, t_stamp timestamp without time zone
not null default now(), raw_data numeric, data_value numeric,
data_value_delta numeric, journal_value numeric, journal_data numeric,
machine_id integer not null, group_number integer) with oids;
CREATE TABLE
Time: 34.310 ms

=> insert into log (batch_id, data_value, data_value_delta,
journal_value, journal_data, group_number, machine_id, raw_data)
select 1, 1, 1, 1, 1, 1, 1, 1 from generate_series(1, 10050886);
INSERT 0 10050886
Time: 32818.529 ms

=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 969 MB

=> update log set raw_data = raw_data+1;
UPDATE 10050886
Time: 65805.741 ms

=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
  1939 MB

=> SELECT count(*) from log;
  count
----------
 10050886
(1 row)

Time: 11181.005 ms

=> SELECT count(*) from log;
  count
----------
 10050886
(1 row)

Time: 2825.569 ms

This box has ~6GB ram.


BTW did anyone else hear the below in a Valeris voice?
> And the numbers are not all that bad, so let's throw a sabot into the gears:
> crash:~# time psql -U test test -c "UPDATE log SET raw_data=raw_data+1"

Re: Slow count(*) again...

From
Alex Hunsaker
Date:
On Wed, Oct 13, 2010 at 02:38, Neil Whelchel <neil.whelchel@gmail.com> wrote:

> And the cache helps...
> So, we are right back to within 10ms of where we started after INSERTing the
> data, but it took a VACUUM FULL to accomplish this (by making the table fit in
> RAM).
> This is a big problem on a production machine as the VACUUM FULL is likely to
> get in the way of INSERTing realtime data into the table.

Right, but the real point is how often do you plan on mass updating
the table?  Thats (hopefully) the only time a vacuum full should be
needed.  Otherwise (auto) vacuum will probably work most of the time.

> 6. Keep tables that are to be updated frequently as narrow as possible: Link
> them to wider tables to store the columns that are less frequently updated.

Again I don't think its updated frequently so much as mass updated.  I
run some databases here that have tens to hundreds of updates every
second.  The difference is I don't update *all* 26 million rows at the
same time that often.  But If I did, Id probably want to lower the
fillfactor.

For example:
=> update log set raw_data = raw_data+1;
UPDATE 10050886
Time: 59387.021 ms
=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 1939 MB

=> update log set raw_data = raw_data+1;
UPDATE 10050886
Time: 70549.425 ms
=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 2909 MB

=> update log set raw_data = raw_data+1;
UPDATE 10050886
Time: 78551.544 ms
=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 3879 MB

=> update log set raw_data = raw_data+1;
UPDATE 10050886
Time: 74443.945 ms
=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 4848 MB


Here you see basically linear growth, after some vacuuming:

=> VACUUM log;
VACUUM
Time: 193055.857 ms
=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 4848 MB

=> VACUUM log;
VACUUM
Time: 38281.541 ms
whopper=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 4848 MB

=> VACUUM log;
VACUUM
Time: 28.531 ms
=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 4848 MB

Hey... its not shrinking it at all...:
=> VACUUM verbose log;
INFO:  vacuuming "public.log"
INFO:  "log": found 0 removable, 0 nonremovable row versions in 31 out
of 620425 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 2511 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.01u sec elapsed 0.01 sec.
INFO:  vacuuming "pg_toast.pg_toast_10544753"
INFO:  index "pg_toast_10544753_index" now contains 0 row versions in 1 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_10544753": found 0 removable, 0 nonremovable row
versions in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM
Time: 29.070 ms

-- ok lets start over and this time set fillfactor to 50;
=> alter table log set (fillfactor = 50);
=> vacuum full log;
=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 1963 MB

-- 2x the default size, lets see what an update does now
=> update log set raw_data = raw_data+1;
UPDATE 10050886
Time: 70424.752 ms
=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 1963 MB

-- hey ! same size

=> update log set raw_data = raw_data+1;
UPDATE 10050886
Time: 58112.895 ms
=> SELECT pg_size_pretty(pg_total_relation_size('log'));
 pg_size_pretty
----------------
 1963 MB
(1 row)

-- Still the same

So in short... vacuum seems to fall over flat with mass updates, set a
lower fillfactor :).

> So with our conclusion pile so far we can deduce that if we were to keep all
> of our data in two column tables (one to link them together, and the other to
> store one column of data), we stand a much better chance of making the entire
> table to be counted fit in RAM,

I dunno about that... Seems like if you only had 2 tables both would
fail to fit in ram fairly quickly :)

> so we simply apply the WHERE clause to a
> specific table as opposed to a column within a wider table... This seems to
> defeat the entire goal of the relational database...

Sure... thats one answer.  See
http://wiki.postgresql.org/wiki/Slow_Counting for more.  But the basic
ideas are:
1) estimate the count
2) use triggers and keep the count somewhere else
3) keep it in ram

Re: Slow count(*) again...

From
david@lang.hm
Date:
On Wed, 13 Oct 2010, Tom Lane wrote:

> Neil Whelchel <neil.whelchel@gmail.com> writes:
>
> That's probably got little to do with caching and everything to do with
> setting hint bits on the first SELECT pass.
>
> I concur with Mark's question about whether your UPDATE pushed the table
> size across the limit of what would fit in RAM.

Neil, can you just double the size of your initial test to make sure that
it's too large to fit in ram to start with?

David Lang

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Wednesday 13 October 2010 05:33:28 Mladen Gogala wrote:
>   On 10/13/2010 8:12 AM, Greg Smith wrote:
> > The work incorporating a more stable XFS into RHEL started with xfsprogs
> > 3.0.1-6 going into Fedora 11, and 3.1.X would represent a current
> > release.  So your Ubuntu kernel is two major improvement releases
> > behind, 3.0 and 3.1 were the upgrades to xfsprogs where things really
> > got going again making that code modern and solid.  Ubuntu Lucid
> > switched to 3.1.0, RHEL6 will probably ship 3.1.0 too.
>
> I am afraid that my management will not let me use anything that doesn't
> exist as a RPM package in the current Red Hat distribution. No Ubuntu,
> no Fedora, no manual linking.  There will always be that ominous
> question: how many other companies are using XFS? From the business
> perspective, questions like that make perfect sense.

XFS sees extensive use in the billing departments of many phone and utility
companies. Maybe not the code that you see in Linux, but the on-disk format,
which I think is unchanged since its original release. (You can use the modern
XFS code in Linux to mount a filesystem from an older SGI machine that used
XFS.) The code in Linux is based on the code that SGI released some time in
2000, which worked at that time very well for the SGI machine. At the time
that SGI came up with XFS, they had realtime in mind. They added specific
features to the filesystem to guarantee IO at a specific rate, this was
intended for database and other realtime applications. I have not looked at
the Linux version to see if it contains these extensions. I will be doing this
soon, however as my next big project will require a true realtime system.
-Neil-

Re: Slow count(*) again...

From
Neil Whelchel
Date:
On Wednesday 13 October 2010 06:27:34 you wrote:
> On Wed, Oct 13, 2010 at 4:38 AM, Neil Whelchel <neil.whelchel@gmail.com>
wrote:
> > There seems to be allot of discussion about VACUUM FULL, and its
> > problems. The overall buzz seems to be that VACUUM FULL is a bad idea (I
> > could be wrong here). It has been some time since I have read the
> > changelogs, but I seem to remember that there have been some major
> > changes to VACUUM FULL recently. Maybe this needs to be re-visited in
> > the documentation.
>
> In 9.0, VACUUM FULL does something similar to what CLUSTER does.  This
> is a much better idea than what it did in 8.4 and prior.
>
> > crash:~# time psql -U test test -c "VACUUM FULL log;"
> > VACUUM
> >
> > real    4m49.055s
> > user    0m0.000s
> > sys     0m0.000s
> >
> > crash:~# time psql -U test test -c "SELECT count(*) FROM log;"
> >  count
> > ----------
> >  10050886
> > (1 row)
> >
> > real    0m9.665s
> > user    0m0.000s
> > sys     0m0.004s
> >
> > A huge improvement from the minute and a half before the VACUUM FULL.
>
> This is a very surprising result that I would like to understand
> better.  Let's assume that your UPDATE statement bloated the table by
> 2x (you could use pg_relation_size to find out exactly; the details
> probably depend on fillfactor which you might want to lower if you're
> going to do lots of updates).  That ought to mean that count(*) has to
> grovel through twice as much data, so instead of taking 9 seconds it
> ought to take 18 seconds.  Where the heck is the other 1:12 going?
> This might sort of make sense if the original table was laid out
> sequentially on disk and the updated table was not, but how and why
> would that happen?
This is likely due to the table not fitting in memory before the VACUUM FULL.
I am glad that you suggested using pg_relation_size, I somehow didn't think of
it at the time. I will redo the test and publish the results of
pg_relation_size.
-Neil-

Re: Slow count(*) again...

From
Mark Kirkwood
Date:
On 13/10/10 21:44, Mladen Gogala wrote:
>  On 10/13/2010 3:19 AM, Mark Kirkwood wrote:
>> I think that major effect you are seeing here is that the UPDATE has
>> made the table twice as big on disk (even after VACUUM etc), and it has
>> gone from fitting in ram to not fitting in ram - so cannot be
>> effectively cached anymore.
>>
> In the real world, tables are larger than the available memory. I have
> tables of several hundred gigabytes in size. Tables shouldn't be
> "effectively cached", the next step would be to measure "buffer cache
> hit ratio", tables should be effectively used.
>
Sorry Mladen,

I didn't mean to suggest that all tables should fit into ram... but was
pointing out (one reason) why Neil would expect to see a different
sequential scan speed after the UPDATE.

I agree that in many interesting cases, tables are bigger than ram [1].

Cheers

Mark

[1] Having said that, these days 64GB of ram is not unusual for a
server... and we have many real customer databases smaller than this
where I work.

Re: Slow count(*) again...

From
Mark Kirkwood
Date:
On 13/10/10 23:16, Neil Whelchel wrote:
>
> The good ol' bruit force approach! I knew I'd see this one sooner or later.
> Though I was not sure if I was going to see the 16TB of RAM suggestion first.
> Seriously though, as the title of this thread suggests, everything is
> relative. Sure count(*) and everything else will work faster with more system
> power. It just seems to me that count(*) is slower than it could be given a
> set of conditions....
>
> Since the thread has started, I have had people ask about different system
> configurations, especially the filesystem (XFS, ext4...). I have never tested
> ext4, and since we are all involved here, I thought that I could do so and
> share my results for others, that is why I got into time testing stuff.
> Time testing count(*) in my later postings is really not the point as count is
> simply dragging the entire table off of the RAID through RAM, I can use any
> other function like max()... No that can narrow down its scan with an INDEX...
> Ok, sum(), there we go!
>
>
>

Well in some (quite common) use cases, the queries cannot be known in
advance, and the tables are considerably bigger than ram... this makes
the fast IO a good option - sometimes better (and in the end cheaper)
than trying to maintain every conceivable covering index.

Of course it would be great if Postgres could use the indexes alone to
execute certain queries - we may see some of that capability in the next
few release (keep and eye on messages concerning the 'Visibility Map').

regards

Mark


Re: Slow count(*) again...

From
Robert Haas
Date:
On Wed, Oct 13, 2010 at 1:59 PM, Jesper Krogh <jesper@krogh.cc> wrote:
> On 2010-10-13 15:28, Robert Haas wrote:
>>
>> On Wed, Oct 13, 2010 at 6:16 AM, Neil Whelchel<neil.whelchel@gmail.com>
>>  wrote:
>>
>>>
>>> I might go as far as to rattle the cage of the developers to see if it
>>> makes
>>> any sense to add some column oriented storage capability to Postgres.
>>> That
>>> would be the hot ticket to be able to specify an attribute on a column so
>>> that
>>> the back end could shadow or store a column in a column oriented table so
>>> aggregate functions could work on them with good efficiency, or is that
>>> an
>>> INDEX?
>>>
>>
>> I'd love to work on that, but without funding it's tough to find the
>> time.  It's a big project.
>>
>
> Is it hugely different from just getting the visibillity map suitable
> for doing index-only scans and extracting values from the index
> directly as Heikki has explained?]

I think that there's a lot more to a real column-oriented database
than index-only scans, although, of course, index-only scans are very
important.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Slow count(*) again...

From
"mark"
Date:
Could this be an interesting test use of https://www.fossexperts.com/ ?

'Community' driven proposal - multiple people / orgs agree to pay various
portions? Maybe with multiple funders a reasonable target fund amount could
be reached.

Just throwing around ideas here.


Mark

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Wednesday, October 13, 2010 7:29 AM
To: Neil Whelchel
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Slow count(*) again...

On Wed, Oct 13, 2010 at 6:16 AM, Neil Whelchel <neil.whelchel@gmail.com>
wrote:
> I might go as far as to rattle the cage of the developers to see if it
makes
> any sense to add some column oriented storage capability to Postgres. That
> would be the hot ticket to be able to specify an attribute on a column so
that
> the back end could shadow or store a column in a column oriented table so
> aggregate functions could work on them with good efficiency, or is that an
> INDEX?

I'd love to work on that, but without funding it's tough to find the
time.  It's a big project.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: Slow count(*) again...

From
Jesper Krogh
Date:
On 2010-10-14 06:22, mark wrote:
> Could this be an interesting test use of https://www.fossexperts.com/ ?
>
> 'Community' driven proposal - multiple people / orgs agree to pay various
> portions? Maybe with multiple funders a reasonable target fund amount could
> be reached.
>
I might convince my boss to chip in... but how do we get the task
up there.. should we find one to give an estimate first?

--
Jesper

Re: Slow count(*) again...

From
Robert Haas
Date:
On Thu, Oct 14, 2010 at 12:22 AM, mark <dvlhntr@gmail.com> wrote:
> Could this be an interesting test use of https://www.fossexperts.com/ ?
>
> 'Community' driven proposal - multiple people / orgs agree to pay various
> portions? Maybe with multiple funders a reasonable target fund amount could
> be reached.
>
> Just throwing around ideas here.

This is a bit off-topic, but as of now, they're only accepting
proposals for projects to be performed by CommandPrompt itself.  So
that doesn't help me much (note the sig).

But in theory it's a good idea.  Of course, when and if they open it
up, then what?  If more than one developer or company is interested in
a project, who determines who gets to do the work and get paid for it?
 If that determination is made by CommandPrompt itself, or if it's
just a free-for-all to see who can get their name on the patch that
ends up being committed, it's going to be hard to get other
people/companies to take it very seriously.

Another problem is that even when they do open it up, they apparently
intend to charge 7.5 - 15% of the contract value as a finder's fee.
That's a lot of money.  For a $100 project it's totally reasonable,
but for a $10,000 project it's far more expensive than the value of
the service they're providing can justify.  (Let's not even talk about
a $100,000 project.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Slow count(*) again...

From
Jesper Krogh
Date:
On 2010-10-14 21:56, Robert Haas wrote:
> On Thu, Oct 14, 2010 at 12:22 AM, mark<dvlhntr@gmail.com>  wrote:
>
>> Could this be an interesting test use of https://www.fossexperts.com/ ?
>>
>> 'Community' driven proposal - multiple people / orgs agree to pay various
>> portions? Maybe with multiple funders a reasonable target fund amount could
>> be reached.
>>
>> Just throwing around ideas here.
>>
> This is a bit off-topic, but as of now, they're only accepting
> proposals for projects to be performed by CommandPrompt itself.  So
> that doesn't help me much (note the sig).
>
> But in theory it's a good idea.  Of course, when and if they open it
> up, then what?  If more than one developer or company is interested in
> a project, who determines who gets to do the work and get paid for it?
>   If that determination is made by CommandPrompt itself, or if it's
> just a free-for-all to see who can get their name on the patch that
> ends up being committed, it's going to be hard to get other
> people/companies to take it very seriously.
>
Couldnt you open up a dialog about it?
> Another problem is that even when they do open it up, they apparently
> intend to charge 7.5 - 15% of the contract value as a finder's fee.
> That's a lot of money.  For a $100 project it's totally reasonable,
> but for a $10,000 project it's far more expensive than the value of
> the service they're providing can justify.  (Let's not even talk about
> a $100,000 project.)
>

Hi Robert.

I can definately see your arguments, but you failed to describe
a "better" way?

Many of us rely heavily on PostgreSQL and would
like to get "this feature", but sponsoring it all alone does not seem
like a viable option (just a guess), taken into consideration we dont
even have an estimate about how big it is, but I saw the estimate of
15K USD of the "ALTER column position" description.. and the
visibillity map is most likely in the "same ballpark" (from my
perspective).

So in order to get something like a visibillity map (insert your
favorite big
feature here), you have the option:

* Sponsor it all by yourself. (where its most likely going to be too big,
    or if it is the center of your applictions, then you definitely turn
to a
    RDBMS that has supported it for longer times, if you can).
* Wait for someone else to sponsor it all by them selves. (that happens
   occationally, but for particular features is it hard to see when and
what,
   and the actual sponsor would still have the dilemma in the first point).
* Hack it yourselves (many of us dont have time neither skills to do it, and
   my employer actually wants me to focus on the stuff that brings most
direct
   value for my time, which is a category hacking PG does not fall into
when the
   business is about something totally else).
* A kind of microsponsoring like above?
* Your proposal in here?

To me.. the 4'th bullet point looks like the most viable so far..

To be honest, if it is EDB, Redpill, Command Prompt, 2nd Quadrant or
whoever end up doing the job is, seen from this perspective not
important, just it ends in the hands of someone "capable" of doing
it. ... allthougth Heikki has done some work on this task allready.

Preferrably I would like to get it coordinated by the PG project itself.
But
I can see that it is really hard to do that kind of stuff. And you would
still
face the challenge about who should end up doing the thing.

Jesper .. dropped Joshua Drake on CC, he might have given all of this some
seconds of thought allready.

--
Jesper


Re: Slow count(*) again...

From
Devrim GÜNDÜZ
Date:
On Wed, 2010-10-13 at 09:02 -0400, Greg Smith wrote:

> XFS support is available as an optional module starting in RHEL 5.5.
> In CentOS, you just grab it, so that's what I've been doing.  My
> understanding is that you may have to ask your sales rep to enable
> access to it under the official RedHat Network channels if you're
> using a subscription from them.  I'm not sure exactly what the support
> situation is with it, but it's definitely available as an RPM from
> RedHat.

Right. It is called "Red Hat Scalable File System", and once paid, it is
available via RHN.
--
Devrim GÜNDÜZ
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
PostgreSQL RPM Repository: http://yum.pgrpms.org
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz

Attachment

Re: Slow count(*) again...

From
Mladen Gogala
Date:
Neil Whelchel wrote:
>
>
> That is why I suggested an estimate(*) that works like (a faster) count(*)
> except that it may be off a bit. I think that is what he was talking about
> when he wrote this.
>
>
The main problem with "select count(*)" is that it gets seriously
mis-used.  Using "select count(*)" to establish existence is bad for
performance and for code readability.

--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Re: Slow count(*) again...

From
Mladen Gogala
Date:
Jon Nelson wrote:
>
> Well, I didn't quite mean that - having no familiarity with Oracle I
> don't know what the alter system statement does, but I was talking
> specifically about the linux buffer and page cache.
>

Those are not utilized by Oracle.  This is a RAC instance, running on
top of ASM, which is an Oracle volume manager, using raw devices. There
is no file system on those disks:

SQL> select file_name from dba_data_files
  2  where tablespace_name='ADBASE_DATA';

FILE_NAME
--------------------------------------------------------------------------------
+DGDATA/stag3/datafile/adbase_data.262.727278257
+DGDATA/stag3/datafile/adbase_data.263.727278741
+DGDATA/stag3/datafile/adbase_data.264.727280145
+DGDATA/stag3/datafile/adbase_data.265.727280683

[oracle@lpo-oracle-30 ~]$ $ORA_CRS_HOME/bin/crs_stat -l
NAME=ora.STAG3.STAG31.inst
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.STAG3.STAG32.inst
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.STAG3.db
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-30.ASM1.asm
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-30.LISTENER_LPO-ORACLE-30.lsnr
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-30.gsd
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-30.ons
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-30.vip
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-30

NAME=ora.lpo-oracle-31.ASM2.asm
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-31.LISTENER_LPO-ORACLE-31.lsnr
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-31.gsd
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-31.ons
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31

NAME=ora.lpo-oracle-31.vip
TYPE=application
TARGET=ONLINE
STATE=ONLINE on lpo-oracle-31



The only way to flush cache is the aforementioned "alter system"
command. AFAIK, Postgres doesn't have anything like that. Oracle uses
raw devices precisely to avoid double buffering.

--

Mladen Gogala
Sr. Oracle DBA
1500 Broadway
New York, NY 10036
(212) 329-5251
http://www.vmsinfo.com
The Leader in Integrated Media Intelligence Solutions




Re: Slow count(*) again...

From
Greg Smith
Date:
Jesper Krogh wrote:
> To be honest, if it is EDB, Redpill, Command Prompt, 2nd Quadrant or
> whoever end up doing the job is, seen from this perspective not
> important, just it ends in the hands of someone "capable" of doing
> it. ... although Heikki has done some work on this task already.

Now you're closing in on why this is a touchy subject.  Heikki has
already done work here funded by EDB.  As such, the idea of anyone else
being put in charge of fund raising and allocation for this particular
feature would be a political mess.  While it would be nice if there was
a completely fair sponsorship model for developing community PostgreSQL
features, overseen by a benevolent, free, and completely unaffiliated
overlord, we're not quite there yet.  In cases like these, where there's
evidence a company with a track record of delivering features is already
involved, you're probably better off contacting someone from there
directly--rather than trying to fit that into the public bounty model
some PostgreSQL work is getting done via lately.  The visibility map is
a particularly troublesome one, because the list of "capable" people who
could work on that, but who aren't already working at a company having
some relations with EDB, is rather slim.

I know that's kind of frustrating to hear, for people who would like to
get a feature done but can't finance the whole thing themselves.  But
look on the bright side--the base price is free, and when you give most
PostgreSQL companies money to work on something it's at least possible
to get what you want done.  You'd have to pay a whole lot more than the
$15K number you threw out there before any of the commercial database
vendors would pay any attention to your particular feature request.

--
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us
Author, "PostgreSQL 9.0 High Performance"    Pre-ordering at:
https://www.packtpub.com/postgresql-9-0-high-performance/book


Re: Slow count(*) again...

From
Bruce Momjian
Date:
bricklen wrote:
> On Sat, Oct 9, 2010 at 4:26 PM, Neil Whelchel <neil.whelchel@gmail.com> wrote:
> > Maybe an
> > estimate(*) that works like count but gives an answer from the index without
> > checking visibility? I am sure that this would be good enough to make a page
> > list, it is really no big deal if it errors on the positive side, maybe the
> > list of pages has an extra page off the end. I can live with that. What I
> > can't live with is taking 13 seconds to get a page of results from 850,000
> > rows in a table.
> > -Neil-
> >
>
> FWIW, Michael Fuhr wrote a small function to parse the EXPLAIN plan a
> few years ago and it works pretty well assuming your stats are up to
> date.
>
> http://markmail.org/message/gknqthlwry2eoqey

What I recommend is to execute the query with EXPLAIN, and look at the
estimated rows and costs.  If the row number is large, just round it to
the nearest thousand and return it to the application as a count ---
this is what Google does for searches (just try it).

If the row count/cost are low, run the query and return an exact count.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: Slow count(*) again...

From
Scott Carey
Date:
On Oct 12, 2010, at 11:58 AM, Tom Lane wrote:

> Jesper Krogh <jesper@krogh.cc> writes:
>> On 2010-10-12 19:07, Tom Lane wrote:
>>> Anyway, if anyone is hot to make COUNT(*) faster, that's where to look.
>
>> Just having 32 bytes bytes of "payload" would more or less double
>> you time to count if I read you test results correctly?. .. and in the
>> situation where diskaccess would be needed .. way more.
>
>> Dividing by pg_relation_size by the amout of tuples in our production
>> system I end up having no avg tuple size less than 100bytes.
>
> Well, yeah.  I deliberately tested with a very narrow table so as to
> stress the per-row CPU costs as much as possible.  With any wider table
> you're just going to be I/O bound.


On a wimpy disk, I/O bound for sure.   But my disks go 1000MB/sec.  No query can go fast enough for them.  The best
I'vegotten is 800MB/sec, on a wide row (average 800 bytes).  Most tables go 300MB/sec or so.  And with 72GB of RAM,
manyscans are in-memory anyway. 

A single SSD with supercapacitor will go about 500MB/sec by itself next spring.   I will easily be able to build a
systemwith 2GB/sec I/O for under $10k. 



Re: Slow count(*) again...

From
Jesper Krogh
Date:
On 2010-10-21 06:47, Scott Carey wrote:
> On a wimpy disk, I/O bound for sure.   But my disks go 1000MB/sec.
> No query can go fast enough for them.  The best I've gotten is
> 800MB/sec, on a wide row (average 800 bytes).  Most tables go
> 300MB/sec or so.  And with 72GB of RAM, many scans are in-memory
> anyway.

Is it cpu or io bound while doing it?
Can you scan it faster using time cat relation-oid.* > /dev/null

> A single SSD with supercapacitor will go about 500MB/sec by itself
> next spring.   I will easily be able to build a system with 2GB/sec
> I/O for under $10k.

What filesystem are you using? Readahead?
Can you try to check the filesystemfragmentation of the table using filefrag?

-- 
Jesper



Re: Slow count(*) again...

From
Scott Carey
Date:
On Oct 21, 2010, at 11:13 AM, Jesper Krogh wrote:

> On 2010-10-21 06:47, Scott Carey wrote:
> > On a wimpy disk, I/O bound for
> sure.   But my disks go 1000MB/sec.
>
> > No query can go fast enough for them.  The best I've gotten is
>
> > 800MB/sec, on a wide row (average 800 bytes).  Most tables go
>
> > 300MB/sec or so.  And with 72GB of RAM, many scans are in-memory
>
> > anyway.
>
>
> Is it cpu or io bound while doing it?
I/O bound with the fio benchmark tool if 16K blocks or greater, CPU bound with 8K blocks or smaller.  CentOS 5.5.
CPU bound with postgres.


> Can you scan it faster using time cat relation-oid.* > /dev/null
>

I'm not sure what you mean.  in psql, select * piped to /dev/null is VERY CPU bound because of all the formatting.  I
haven'ttoyed with COPY.   Do you mean the actual files?  'dd' tests from actual files are similar to fio, but not as
consistentand hard to add concurrency.  That is faster than postgres. 

>
> > A single SSD with supercapacitor will go about 500MB/sec by itself
>
> > next spring.   I will easily be able to build a system with 2GB/sec
>
> > I/O for under $10k.
>
>
>
>  What filesystem are you using? Readahead?
> Can you try to check the filesystemfragmentation of the table using filefrag?
>
XFS, defragmented once a day.  Readahead 40960 (20MB, 1MB per spindle).  two raid 10 arrays, each 10 discs each (2 hot
spare),software raid-0 tying those together (md, 1MB blocks).  Two Adaptec 5805 (or 5085, the external SAS one).  A
thirdraid card for the OS/xlog with 4x10krpm sas drives internal. 

Fragmentation quickly takes this down a lot as do small files and concurrent activity, since its only enough spindles
for~2000 iops.  But its almost all large reporting queries on partitioned tables (500,000 partitions).   A few smaller
tablesare starting to cause too many seeks so those might end up on a smaller, high iops tablespace later. 

Over time the disks have filled up and there is a significant slowdown in sequential transfer at the end of the
partition-- 600MB/sec max.  That is still CPU bound on most scans, but postgres can go that fast on some scans. 

Off topic:
Other interesting features is how this setup causes the system tables to bloat by factors of 2x to 8x each week, and
requiresfrequent vacuum full + reindex on several of them else they become 1.5GB in size.  Nothing like lots of temp
tablework + hour long concurrent transactions to make the system catalog bloat.  I suppose with 8.4 many temp tables
couldbe replaced using WITH queries, but in other cases analyzing a temp table is the only way to get a sane query
plan.


> --
> Jesper
>
>
>