Thread: Slow count(*) again...
I know that there haven been many discussions on the slowness of count(*) even when an index is involved because the visibility of the rows has to be checked. In the past I have seen many suggestions about using triggers and tables to keep track of counts and while this works fine in a situation where you know what the report is going to be ahead of time, this is simply not an option when an unknown WHERE clause is to be used (dynamically generated). I ran into a fine example of this when I was searching this mailing list, "Searching in 856,646 pages took 13.48202 seconds. Site search powered by PostgreSQL 8.3." Obviously at some point count(*) came into play here because the site made a list of pages (1 2 3 4 5 6 > next). I very commonly make a list of pages from search results, and the biggest time killer here is the count(*) portion, even worse yet, I sometimes have to hit the database with two SELECT statements, one with OFFSET and LIMIT to get the page of results I need and another to get the amount of total rows so I can estimate how many pages of results are available. The point I am driving at here is that since building a list of pages of results is such a common thing to do, there need to be some specific high speed ways to do this in one query. Maybe an estimate(*) that works like count but gives an answer from the index without checking visibility? I am sure that this would be good enough to make a page list, it is really no big deal if it errors on the positive side, maybe the list of pages has an extra page off the end. I can live with that. What I can't live with is taking 13 seconds to get a page of results from 850,000 rows in a table. -Neil-
On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel <neil.whelchel@gmail.com> wrote: > I know that there haven been many discussions on the slowness of count(*) even > when an index is involved because the visibility of the rows has to be > checked. In the past I have seen many suggestions about using triggers and > tables to keep track of counts and while this works fine in a situation where > you know what the report is going to be ahead of time, this is simply not an > option when an unknown WHERE clause is to be used (dynamically generated). > I ran into a fine example of this when I was searching this mailing list, > "Searching in 856,646 pages took 13.48202 seconds. Site search powered by > PostgreSQL 8.3." Obviously at some point count(*) came into play here because > the site made a list of pages (1 2 3 4 5 6 > next). I very commonly make a > list of pages from search results, and the biggest time killer here is the > count(*) portion, even worse yet, I sometimes have to hit the database with > two SELECT statements, one with OFFSET and LIMIT to get the page of results I > need and another to get the amount of total rows so I can estimate how many > pages of results are available. The point I am driving at here is that since > building a list of pages of results is such a common thing to do, there need > to be some specific high speed ways to do this in one query. Maybe an > estimate(*) that works like count but gives an answer from the index without > checking visibility? I am sure that this would be good enough to make a page > list, it is really no big deal if it errors on the positive side, maybe the > list of pages has an extra page off the end. I can live with that. What I > can't live with is taking 13 seconds to get a page of results from 850,000 > rows in a table. 99% of the time in the situations you don't need an exact measure, and assuming analyze has run recently, select rel_tuples from pg_class for a given table is more than close enough. I'm sure wrapping that in a simple estimated_rows() function would be easy enough to do.
Neil Whelchel wrote: > I know that there haven been many discussions on the slowness of count(*) even > when an index is involved because the visibility of the rows has to be > checked. In the past I have seen many suggestions about using triggers and > tables to keep track of counts and while this works fine in a situation where > you know what the report is going to be ahead of time, this is simply not an > option when an unknown WHERE clause is to be used (dynamically generated). > I ran into a fine example of this when I was searching this mailing list, > "Searching in 856,646 pages took 13.48202 seconds. Site search powered by > PostgreSQL 8.3." Obviously at some point count(*) came into play here because > the site made a list of pages (1 2 3 4 5 6 > next). I very commonly make a > list of pages from search results, and the biggest time killer here is the > count(*) portion, even worse yet, I sometimes have to hit the database with > two SELECT statements, one with OFFSET and LIMIT to get the page of results I > need and another to get the amount of total rows so I can estimate how many > pages of results are available. The point I am driving at here is that since > building a list of pages of results is such a common thing to do, there need > to be some specific high speed ways to do this in one query. Maybe an > estimate(*) that works like count but gives an answer from the index without > checking visibility? I am sure that this would be good enough to make a page > list, it is really no big deal if it errors on the positive side, maybe the > list of pages has an extra page off the end. I can live with that. What I > can't live with is taking 13 seconds to get a page of results from 850,000 > rows in a table. > -Neil- > > Unfortunately, the problem is in the rather primitive way PostgreSQL does I/O. It didn't change in 9.0 so there is nothing you could gain by upgrading. If you execute strace -o /tmp/pg.out -e read <PID of the sequential scan process> and inspect the file /tmp/pg.out when the query finishes, you will notice a gazillion of read requests, all of them 8192 bytes in size. That means that PostgreSQL is reading the table block by block, without any merging of the requests. You can alleviate the pain by using the OS tricks, like specifying the deadline I/O scheduler in the grub.conf and set prefetch on the FS block devices by using blockdev, but there is nothing special that can be done, short of rewriting the way PostgreSQL does I/O. There were rumors about the version 9.0 and asynchronous I/O, but that didn't materialize. That is really strange to me, because PostgreSQL tables are files or groups of files, if the table size exceeds 1GB. It wouldn't be very hard to try reading 1MB at a time and that would speed up the full table scan significantly. Problem with single block I/O is that there is a context switch for each request, the I/O scheduler has to work hard to merge requests appropriately and there is really no need for that, tables are files navigating through files is not a problem, even with much larger blocks. In another database, whose name I will not mention, there is a parameter db_file_multiblock_read_count which specifies how many blocks will be read by a single read when doing a full table scan. PostgreSQL is in dire need of something similar and it wouldn't even be that hard to implement. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On 10/09/2010 06:54 PM, Mladen Gogala wrote: > In another database, whose name I will not mention, there is a parameter > db_file_multiblock_read_count which specifies how many blocks will be > read by a single read when doing a full table scan. PostgreSQL is in > dire need of something similar and it wouldn't even be that hard to > implement. You're correct in that it isn't particularly difficult to implement for sequential scans. But I have done some testing with aggressive read ahead, and although it is clearly a big win with a single client, the benefit was less clear as concurrency was increased. Joe -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, & 24x7 Support
Attachment
Joe Conway wrote: > On 10/09/2010 06:54 PM, Mladen Gogala wrote: > >> In another database, whose name I will not mention, there is a parameter >> db_file_multiblock_read_count which specifies how many blocks will be >> read by a single read when doing a full table scan. PostgreSQL is in >> dire need of something similar and it wouldn't even be that hard to >> implement. >> > > You're correct in that it isn't particularly difficult to implement for > sequential scans. But I have done some testing with aggressive read > ahead, and although it is clearly a big win with a single client, the > benefit was less clear as concurrency was increased. > > Joe > > Well, in my opinion that should be left to the DBA, the same as in the "other database". The mythical DBA, the creature that mighty Larry Ellison himself is on a crusade against, usually can figure out the right value for the database he or she's is in charge of. I humbly confess to being an Oracle DBA for more than 2 decades and now branching into Postgres because my employer is less than enthusiastic about Oracle, with the special accent on their pricing. Modern databases, Postgres included, are quite complex and companies need DBA personnel to help fine tune the applications. I know that good DBA personnel is quite expensive but without a competent DBA who knows the database software well enough, companies can and will suffer from blunders with performance, downtime, lost data and alike. In the world where almost every application is written for the web, performance, uptime and user experience are of the critical importance. The architects of Postgres database would be well advised to operate under the assumption that every production database has a competent DBA keeping an eye on the database. Every application has its own mix of sequential and index scans, you cannot possibly test all possible applications. Aggressive read-ahead or "multi-block reads" can be a performance problem and it will complicate the optimizer, because the optimizer now has a new variable to account for: the block size, potentially making seq_page_cost even cheaper and random_page_cost even more expensive, depending on the blocking. However, slow sequential scan is, in my humble opinion, the single biggest performance problem of the PostgreSQL databases and should be improved, the sooner, the better. You should, however, count on the DBA personnel to help with the tuning. We're the Tinkerbells of the database world. I am 6'4", 240 LBS, no wings. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote: > On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel <neil.whelchel@gmail.com> wrote: > > I know that there haven been many discussions on the slowness of count(*) > > even when an index is involved because the visibility of the rows has to > > be checked. In the past I have seen many suggestions about using > > triggers and tables to keep track of counts and while this works fine in > > a situation where you know what the report is going to be ahead of time, > > this is simply not an option when an unknown WHERE clause is to be used > > (dynamically generated). I ran into a fine example of this when I was > > searching this mailing list, "Searching in 856,646 pages took 13.48202 > > seconds. Site search powered by PostgreSQL 8.3." Obviously at some point > > count(*) came into play here because the site made a list of pages (1 2 > > 3 4 5 6 > next). I very commonly make a list of pages from search > > results, and the biggest time killer here is the count(*) portion, even > > worse yet, I sometimes have to hit the database with two SELECT > > statements, one with OFFSET and LIMIT to get the page of results I need > > and another to get the amount of total rows so I can estimate how many > > pages of results are available. The point I am driving at here is that > > since building a list of pages of results is such a common thing to do, > > there need to be some specific high speed ways to do this in one query. > > Maybe an estimate(*) that works like count but gives an answer from the > > index without checking visibility? I am sure that this would be good > > enough to make a page list, it is really no big deal if it errors on the > > positive side, maybe the list of pages has an extra page off the end. I > > can live with that. What I can't live with is taking 13 seconds to get a > > page of results from 850,000 rows in a table. > > 99% of the time in the situations you don't need an exact measure, and > assuming analyze has run recently, select rel_tuples from pg_class for > a given table is more than close enough. I'm sure wrapping that in a > simple estimated_rows() function would be easy enough to do. This is a very good approach and it works very well when you are counting the entire table, but when you have no control over the WHERE clause, it doesn't help. IE: someone puts in a word to look for in a web form. From my perspective, this issue is the biggest problem there is when using Postgres to create web pages, and it is so commonly used, I think that there should be a specific way to deal with it so that you don't have to run the same WHERE clause twice. IE: SELECT count(*) FROM <table> WHERE <clause>; to get the total amount of items to make page navigation links, then: SELECT <columns> FROM table WHERE <clause> LIMIT <items_per_page> OFFSET <(page_no-1)*items_per_page>; to get the actual page contents. It's bad enough that count(*) is slow, then you have to do it all over again to get the results you need! I have not dug into this much yet, but would it be possible to return the amount of rows that a WHERE clause would actually return if the LIMIT and OFFSET were not applied. IE: When a normal query is executed, the server returns the number of rows aside from the actual row data. Would it be a big deal to modify this to allow it to return the amount of rows before the LIMIT and OFFSET is applied as well? This would sure cut down on time it takes to do the same WHERE clause twice... I have considered using a cursor to do this, however this requires a transfer of all of the rows to the client to get a total count, then setting the cursor to get the rows that you are interested in. Or is there a way around this that I am not aware of? -Neil-
On Sat, Oct 9, 2010 at 7:44 PM, Mladen Gogala <mladen.gogala@vmsinfo.com> wrote:
The architects of Postgres database would be well advised to operate under the assumption that every production database has a competent DBA keeping an eye on the database.
I'd actually go so far as to say that they have already made this assumption. The out of the box config needs modification for all but the most low-volume applications and postgres really benefits from having some attention paid to performance. Not only does tuning the db provide enormous gains, but it is often possible to dramatically improve query responsiveness by simply restructuring a query (assuming an aggregating query over a fairly large table with a few joins thrown in). My team does not have a competent DBA (though I've got 15+ years of experience developing on top of various dbs and certainly don't make overly naive assumptions about how things work) and the gains that we made, when I finally just sat down and read everything I could get my hands on about postgres and started reading this list, were really quite impressive. I intend to take some of the courses offered by some of the companies that are active on this list when my schedule allows in order to expand my knowledge even farther, as a DBA is a luxury we cannot really afford at the moment.
On 10/10/2010 11:02 AM, Neil Whelchel wrote: > On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote: >> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com> > wrote: >>> I know that there haven been many discussions on the slowness of count(*) >>> even when an index is involved because the visibility of the rows has to >>> be checked. In the past I have seen many suggestions about using >>> triggers and tables to keep track of counts and while this works fine in >>> a situation where you know what the report is going to be ahead of time, >>> this is simply not an option when an unknown WHERE clause is to be used >>> (dynamically generated). I ran into a fine example of this when I was >>> searching this mailing list, "Searching in 856,646 pages took 13.48202 >>> seconds. Site search powered by PostgreSQL 8.3." Obviously at some point >>> count(*) came into play here because the site made a list of pages (1 2 >>> 3 4 5 6> next). I very commonly make a list of pages from search >>> results, and the biggest time killer here is the count(*) portion, even >>> worse yet, I sometimes have to hit the database with two SELECT >>> statements, one with OFFSET and LIMIT to get the page of results I need >>> and another to get the amount of total rows so I can estimate how many >>> pages of results are available. The point I am driving at here is that >>> since building a list of pages of results is such a common thing to do, >>> there need to be some specific high speed ways to do this in one query. >>> Maybe an estimate(*) that works like count but gives an answer from the >>> index without checking visibility? I am sure that this would be good >>> enough to make a page list, it is really no big deal if it errors on the >>> positive side, maybe the list of pages has an extra page off the end. I >>> can live with that. What I can't live with is taking 13 seconds to get a >>> page of results from 850,000 rows in a table. >> >> 99% of the time in the situations you don't need an exact measure, and >> assuming analyze has run recently, select rel_tuples from pg_class for >> a given table is more than close enough. I'm sure wrapping that in a >> simple estimated_rows() function would be easy enough to do. > > This is a very good approach and it works very well when you are counting the > entire table, but when you have no control over the WHERE clause, it doesn't > help. IE: someone puts in a word to look for in a web form. For that sort of thing, there isn't much that'll help you except visibility-aware indexes, covering indexes, etc if/when they're implemented. Even then, they'd only help when it was a simple index-driven query with no need to hit the table to recheck any test conditions, etc. I guess there could be *some* way to expose the query planner's cost estimates in a manner useful for result count estimation ... but given how coarse its stats are and how wildly out the estimates can be, I kind of doubt it. It's really intended for query planning decisions and more interested in orders of magnitude, "0, 1, or more than that" measures, etc, and seems to consider 30% here or there to be pretty insignificant most of the time. > It's bad enough that count(*) is slow, then you have to do it all over again > to get the results you need! I have not dug into this much yet, but would it > be possible to return the amount of rows that a WHERE clause would actually > return if the LIMIT and OFFSET were not applied. IE: When a normal query is > executed, the server returns the number of rows aside from the actual row > data. Would it be a big deal to modify this to allow it to return the amount > of rows before the LIMIT and OFFSET is applied as well? It'd force the server to fully execute the query. Then again, it sounds like you're doing that anyway. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
On 10/10/2010 9:54 AM, Mladen Gogala wrote: > Unfortunately, the problem is in the rather primitive way PostgreSQL > does I/O. It didn't change in 9.0 so there is nothing you could gain by > upgrading. If you execute strace -o /tmp/pg.out -e read <PID of the > sequential scan process> and inspect the file /tmp/pg.out when the query > finishes, you will notice a gazillion of read requests, all of them 8192 > bytes in size. That means that PostgreSQL is reading the table block by > block, without any merging of the requests. I'd be really interested in any measurements you've done to determine the cost of this over doing reads in larger chunks. If they're properly detailed and thought out, the -hackers list is likely to be interested as well. The Linux kernel, at least, does request merging (and splitting, and merging, and more splitting) along the request path, and I'd personally expect that most of the cost of 8k requests would be in the increased number of system calls, buffer copies, etc required. Measurements demonstrating or contradicting this would be good to see. It's worth being aware that there are memory costs to doing larger reads, especially when you have many backends each of which want to allocate a larger buffer for reading. If you can use a chunk of shared_buffers as the direct destination for the read that's OK, but otherwise you're up for (1mb-8kb)*num_backends extra memory use on I/O buffers that could otherwise be used as shared_buffers or OS cache. Async I/O, too, has costs. > PostgreSQL is in > dire need of something similar and it wouldn't even be that hard to > implement. I'd really like to see both those assertions backed with data or patches ;-) Personally, I know just enough about how PG's I/O path works to suspect that "not that hard to implement" is probably a little ... over-optimistic. Sure, it's not that hard to implement in a new program with no wired-in architectural and design choices; that doesn't mean it's easy to retrofit onto existing code, especially a bunch of co-operating processes with their own buffer management. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
On Saturday 09 October 2010 23:56:15 Craig Ringer wrote: > On 10/10/2010 11:02 AM, Neil Whelchel wrote: > > On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote: > >> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com> > > > > wrote: > >>> I know that there haven been many discussions on the slowness of > >>> count(*) even when an index is involved because the visibility of the > >>> rows has to be checked. In the past I have seen many suggestions about > >>> using triggers and tables to keep track of counts and while this works > >>> fine in a situation where you know what the report is going to be > >>> ahead of time, this is simply not an option when an unknown WHERE > >>> clause is to be used (dynamically generated). I ran into a fine > >>> example of this when I was searching this mailing list, "Searching in > >>> 856,646 pages took 13.48202 seconds. Site search powered by PostgreSQL > >>> 8.3." Obviously at some point count(*) came into play here because the > >>> site made a list of pages (1 2 3 4 5 6> next). I very commonly make a > >>> list of pages from search results, and the biggest time killer here is > >>> the count(*) portion, even worse yet, I sometimes have to hit the > >>> database with two SELECT statements, one with OFFSET and LIMIT to get > >>> the page of results I need and another to get the amount of total rows > >>> so I can estimate how many pages of results are available. The point I > >>> am driving at here is that since building a list of pages of results > >>> is such a common thing to do, there need to be some specific high > >>> speed ways to do this in one query. Maybe an estimate(*) that works > >>> like count but gives an answer from the index without checking > >>> visibility? I am sure that this would be good enough to make a page > >>> list, it is really no big deal if it errors on the positive side, > >>> maybe the list of pages has an extra page off the end. I can live with > >>> that. What I can't live with is taking 13 seconds to get a page of > >>> results from 850,000 rows in a table. > >> > >> 99% of the time in the situations you don't need an exact measure, and > >> assuming analyze has run recently, select rel_tuples from pg_class for > >> a given table is more than close enough. I'm sure wrapping that in a > >> simple estimated_rows() function would be easy enough to do. > > > > This is a very good approach and it works very well when you are counting > > the entire table, but when you have no control over the WHERE clause, it > > doesn't help. IE: someone puts in a word to look for in a web form. > > For that sort of thing, there isn't much that'll help you except > visibility-aware indexes, covering indexes, etc if/when they're > implemented. Even then, they'd only help when it was a simple > index-driven query with no need to hit the table to recheck any test > conditions, etc. Good point, maybe this is turning more into a discussion of how to generate a list of pages of results and one page of results with one query so we don't have to do the same painfully slow query twice to do a very common task. On the other hand, I copied a table out of one of my production servers that has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp, text). The first numeric column has numbers evenly spread between 0 and 100 and it is indexed. I put the table in a pair of database servers both running on the same physical hardware. One server is Postgres, the other is a popular server (I am not mentioning names here). on Postgres: SELECT count(*) FROM table where column>50; takes about 8 seconds to run. The other database server took less than one second (about 25 ms) as it is using the index (I assume) to come up with the results. It is true that this is not a fair test because both servers were tested with their default settings, and the defaults for Postgres are much more conservative, however, I don't think that any amount of settings tweaking will bring them even in the same ball park. There has been discussion about the other server returning an incorrect count because all of the indexed rows may not be live at the time. This is not a problem for the intended use, that is why I suggested another function like estimate(*). It's name suggests that the result will be close, not 100% correct, which is plenty good enough for generating a list of results pages in most cases. I am faced with a very serious problem here. If the query to make a list of pages takes say 6 seconds and it takes another 6 seconds to generate a page of results, the customer is waiting 12 seconds. This is not going to work. If count made a quick estimate, say less than a second, and it took 6 seconds to come up with the actual results, I could live with that. Or if coming up with the window of results via (OFFSET and LIMIT) and returned the total number of rows that would have matched the query, then I would still have everything I need to render the page in a reasonable time. I really think that this needs to be addressed somewhere. It's not like I am the only one that does this. You see it nearly everywhere a long list of results is (expected to be) returned in a web site. Among the people I work with, this seems to be the most mentioned reason that they claim that they don't use Postgres for their projects. It would be nice to see how the server comes up with the search results and list of links to pages of results for this mailing list. (http://search.postgresql.org/search?q=slow+count%28%29&m=1&l=&d=365&s=r) I am guessing that it probably uses the count and query method I am talking about. > > I guess there could be *some* way to expose the query planner's cost > estimates in a manner useful for result count estimation ... but given > how coarse its stats are and how wildly out the estimates can be, I kind > of doubt it. It's really intended for query planning decisions and more > interested in orders of magnitude, "0, 1, or more than that" measures, > etc, and seems to consider 30% here or there to be pretty insignificant > most of the time. > > > It's bad enough that count(*) is slow, then you have to do it all over > > again to get the results you need! I have not dug into this much yet, > > but would it be possible to return the amount of rows that a WHERE > > clause would actually return if the LIMIT and OFFSET were not applied. > > IE: When a normal query is executed, the server returns the number of > > rows aside from the actual row data. Would it be a big deal to modify > > this to allow it to return the amount of rows before the LIMIT and > > OFFSET is applied as well? > > It'd force the server to fully execute the query. Then again, it sounds > like you're doing that anyway.
2010/10/10 Neil Whelchel <neil.whelchel@gmail.com>
This is a very good approach and it works very well when you are counting theOn Saturday 09 October 2010 18:47:34 Scott Marlowe wrote:
> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel <neil.whelchel@gmail.com>
wrote:
> > I know that there haven been many discussions on the slowness of count(*)
> > even when an index is involved because the visibility of the rows has to
> > be checked. In the past I have seen many suggestions about using
> > triggers and tables to keep track of counts and while this works fine in
> > a situation where you know what the report is going to be ahead of time,
> > this is simply not an option when an unknown WHERE clause is to be used
> > (dynamically generated). I ran into a fine example of this when I was
> > searching this mailing list, "Searching in 856,646 pages took 13.48202
> > seconds. Site search powered by PostgreSQL 8.3." Obviously at some point
> > count(*) came into play here because the site made a list of pages (1 2
> > 3 4 5 6 > next). I very commonly make a list of pages from search
> > results, and the biggest time killer here is the count(*) portion, even
> > worse yet, I sometimes have to hit the database with two SELECT
> > statements, one with OFFSET and LIMIT to get the page of results I need
> > and another to get the amount of total rows so I can estimate how many
> > pages of results are available. The point I am driving at here is that
> > since building a list of pages of results is such a common thing to do,
> > there need to be some specific high speed ways to do this in one query.
> > Maybe an estimate(*) that works like count but gives an answer from the
> > index without checking visibility? I am sure that this would be good
> > enough to make a page list, it is really no big deal if it errors on the
> > positive side, maybe the list of pages has an extra page off the end. I
> > can live with that. What I can't live with is taking 13 seconds to get a
> > page of results from 850,000 rows in a table.
>
> 99% of the time in the situations you don't need an exact measure, and
> assuming analyze has run recently, select rel_tuples from pg_class for
> a given table is more than close enough. I'm sure wrapping that in a
> simple estimated_rows() function would be easy enough to do.
entire table, but when you have no control over the WHERE clause, it doesn't
help. IE: someone puts in a word to look for in a web form.
From my perspective, this issue is the biggest problem there is when using
Postgres to create web pages, and it is so commonly used, I think that there
should be a specific way to deal with it so that you don't have to run the
same WHERE clause twice.
IE: SELECT count(*) FROM <table> WHERE <clause>; to get the total amount of
items to make page navigation links, then:
SELECT <columns> FROM table WHERE <clause> LIMIT <items_per_page> OFFSET
<(page_no-1)*items_per_page>; to get the actual page contents.
How about
select * from (select *, count(*) over () as total_count from <table> where <clause) a LIMIT <items_per_page> OFFSET
<(page_no-1)*items_per_page>It will return you total_count column with equal value in each row. You may have problems if no rows are returned (e.g. page num is too high).
-- Best regards,
Vitalii Tymchyshyn
On 10/10/2010 6:29 PM, Neil Whelchel wrote: > On the other hand, I copied a table out of one of my production servers that > has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp, > text). The first numeric column has numbers evenly spread between 0 and 100 > and it is indexed. I put the table in a pair of database servers both running > on the same physical hardware. One server is Postgres, the other is a popular > server (I am not mentioning names here). Please do. Your comment is pretty meaningless otherwise. If you're talking about MySQL: Were you using InnoDB or MyISAM table storage? Of course it's fast with MyISAM, it relies on locks to do updates and has bugger all capability for write concurrency, or to permit readers while writing is going on. If you're using InnoDB, then I'd like to know how they've managed that. If you're talking about some *other* database, please name it and provide any useful details, because the hand waving is not helpful. > I don't think that any amount of settings > tweaking will bring them even in the same ball park. If you are, in fact, comparing MySQL+MyISAM and PostgreSQL, then you're quite right. Pg will never have such a fast count() as MyISAM does or the same insanely fast read performance, and MyISAM will never be as reliable, robust or concurrency-friendly as Pg is. Take your pick, you can't have both. > There has been discussion > about the other server returning an incorrect count because all of the indexed > rows may not be live at the time. This is not a problem for the intended use, > that is why I suggested another function like estimate(*). It's name suggests > that the result will be close, not 100% correct, which is plenty good enough > for generating a list of results pages in most cases. Do you have any practical suggestions for generating such an estimate, though? I find it hard to think of any way the server can do that doesn't involve executing the query. The table stats are WAY too general and a bit hit-and-miss, and there isn't really any other way to do it. If all you want is a way to retrieve both a subset of results AND a count of how many results would've been generated, it sounds like all you really need is a way to get the total number of results returned by a cursor query, which isn't a big engineering challenge. I expect that in current Pg versions a trivial PL/PgSQL function could be used to slurp and discard unwanted results, but a better in-server option to count the results from a cursor query would certainly be nice. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
On 10/10/2010 6:29 AM, Neil Whelchel wrote: > On Saturday 09 October 2010 23:56:15 Craig Ringer wrote: >> On 10/10/2010 11:02 AM, Neil Whelchel wrote: >>> On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote: >>>> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com> >>> wrote: >>>>> I know that there haven been many discussions on the slowness of >>>>> count(*) even when an index is involved because the visibility of the >>>>> rows has to be checked. In the past I have seen many suggestions about >>>>> using triggers and tables to keep track of counts and while this works >>>>> fine in a situation where you know what the report is going to be >>>>> ahead of time, this is simply not an option when an unknown WHERE >>>>> clause is to be used (dynamically generated). I ran into a fine >>>>> example of this when I was searching this mailing list, "Searching in >>>>> 856,646 pages took 13.48202 seconds. Site search powered by PostgreSQL >>>>> 8.3." Obviously at some point count(*) came into play here because the >>>>> site made a list of pages (1 2 3 4 5 6> next). I very commonly make a >>>>> list of pages from search results, and the biggest time killer here is >>>>> the count(*) portion, even worse yet, I sometimes have to hit the >>>>> database with two SELECT statements, one with OFFSET and LIMIT to get >>>>> the page of results I need and another to get the amount of total rows >>>>> so I can estimate how many pages of results are available. The point I >>>>> am driving at here is that since building a list of pages of results >>>>> is such a common thing to do, there need to be some specific high >>>>> speed ways to do this in one query. Maybe an estimate(*) that works >>>>> like count but gives an answer from the index without checking >>>>> visibility? I am sure that this would be good enough to make a page >>>>> list, it is really no big deal if it errors on the positive side, >>>>> maybe the list of pages has an extra page off the end. I can live with >>>>> that. What I can't live with is taking 13 seconds to get a page of >>>>> results from 850,000 rows in a table. > Good point, maybe this is turning more into a discussion of how to generate a > list of pages of results and one page of results with one query so we don't > have to do the same painfully slow query twice to do a very common task. > > On the other hand, I copied a table out of one of my production servers that > has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp, > text). The first numeric column has numbers evenly spread between 0 and 100 > and it is indexed. I put the table in a pair of database servers both running > on the same physical hardware. One server is Postgres, the other is a popular > server (I am not mentioning names here). on Postgres: SELECT count(*) FROM > table where column>50; takes about 8 seconds to run. The other database server > took less than one second (about 25 ms) as it is using the index (I assume) to > come up with the results. It is true that this is not a fair test because both > servers were tested with their default settings, and the defaults for Postgres > are much more conservative, however, I don't think that any amount of settings > tweaking will bring them even in the same ball park. There has been discussion > about the other server returning an incorrect count because all of the indexed > rows may not be live at the time. This is not a problem for the intended use, > that is why I suggested another function like estimate(*). It's name suggests > that the result will be close, not 100% correct, which is plenty good enough > for generating a list of results pages in most cases. I am faced with a very > serious problem here. If the query to make a list of pages takes say 6 seconds > and it takes another 6 seconds to generate a page of results, the customer is > waiting 12 seconds. This is not going to work. If count made a quick estimate, > say less than a second, and it took 6 seconds to come up with the actual > results, I could live with that. Or if coming up with the window of results > via (OFFSET and LIMIT) and returned the total number of rows that would have > matched the query, then I would still have everything I need to render the > page in a reasonable time. I really think that this needs to be addressed > somewhere. It's not like I am the only one that does this. You see it nearly > everywhere a long list of results is (expected to be) returned in a web site. > Among the people I work with, this seems to be the most mentioned reason that > they claim that they don't use Postgres for their projects. > > It would be nice to see how the server comes up with the search results and > list of links to pages of results for this mailing list. > (http://search.postgresql.org/search?q=slow+count%28%29&m=1&l=&d=365&s=r) I am > guessing that it probably uses the count and query method I am talking about. > >> I guess there could be *some* way to expose the query planner's cost >> estimates in a manner useful for result count estimation ... but given >> how coarse its stats are and how wildly out the estimates can be, I kind >> of doubt it. It's really intended for query planning decisions and more >> interested in orders of magnitude, "0, 1, or more than that" measures, >> etc, and seems to consider 30% here or there to be pretty insignificant >> most of the time. >> >>> It's bad enough that count(*) is slow, then you have to do it all over >>> again to get the results you need! I have not dug into this much yet, >>> but would it be possible to return the amount of rows that a WHERE >>> clause would actually return if the LIMIT and OFFSET were not applied. >>> IE: When a normal query is executed, the server returns the number of >>> rows aside from the actual row data. Would it be a big deal to modify >>> this to allow it to return the amount of rows before the LIMIT and >>> OFFSET is applied as well? >> It'd force the server to fully execute the query. Then again, it sounds >> like you're doing that anyway. How big is your DB? How fast is your disk access? Any chance disks/RAM can be addressed? My disk access is pitiful... first run, 2.3 million rows.. 0m35.38s, subsequent runs.. real 0m2.55s rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting count --------- 2340704 (1 row) real 0m35.38s user 0m0.25s sys 0m0.03s subsequent runs.... (count changes due to inserts.) rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting count --------- 2363707 (1 row) real 0m2.70s user 0m0.27s sys 0m0.02s rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting count --------- 2363707 (1 row) real 0m2.55s user 0m0.26s sys 0m0.02s rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting count --------- 2363707 (1 row) real 0m2.50s user 0m0.26s sys 0m0.02s reporting=# SELECT pg_size_pretty(pg_total_relation_size('my_production_table')); pg_size_pretty ---------------- 1890 MB (1 row)
Craig Ringer wrote: > On 10/10/2010 9:54 AM, Mladen Gogala wrote: > > >> Unfortunately, the problem is in the rather primitive way PostgreSQL >> does I/O. It didn't change in 9.0 so there is nothing you could gain by >> upgrading. If you execute strace -o /tmp/pg.out -e read <PID of the >> sequential scan process> and inspect the file /tmp/pg.out when the query >> finishes, you will notice a gazillion of read requests, all of them 8192 >> bytes in size. That means that PostgreSQL is reading the table block by >> block, without any merging of the requests. >> > > I'd be really interested in any measurements you've done to determine > the cost of this over doing reads in larger chunks. If they're properly > detailed and thought out, the -hackers list is likely to be interested > as well. > I can provide measurements, but from Oracle RDBMS. Postgres doesn't allow tuning of that aspect, so no measurement can be done. Would the numbers from Oracle RDBMS be acceptable? > The Linux kernel, at least, does request merging (and splitting, and > merging, and more splitting) along the request path, and I'd personally > expect that most of the cost of 8k requests would be in the increased > number of system calls, buffer copies, etc required. Measurements > demonstrating or contradicting this would be good to see. > Even the cost of hundreds of thousands of context switches is far from negligible. What kind of measurements do you expect me to do with the database which doesn't support tweaking of that aspect of its operation? > It's worth being aware that there are memory costs to doing larger > reads, especially when you have many backends each of which want to > allocate a larger buffer for reading. Oh, it's not only larger memory, the buffer management would have to be changed too, to prevent process doing a sequential scan from inundating the shared buffers. Alternatively, the blocks would have to be written into the private memory and immediately thrown away after that. However, the experience with Oracle tells me that this is well worth it. Here are the numbers: Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options SQL> show parameter db_file_multi NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ db_file_multiblock_read_count integer 16 SQL> alter session set db_file_multiblock_read_count=1; Session altered. SQL> select count(*) from ni_occurrence; COUNT(*) ---------- 402062638 Elapsed: 00:08:20.88 SQL> alter session set db_file_multiblock_read_count=128; Session altered. Elapsed: 00:00:00.50 SQL> select count(*) from ni_occurrence; COUNT(*) ---------- 402062638 Elapsed: 00:02:17.58 In other words, when I batched the sequential scan to do 128 blocks I/O, it was 4 times faster than when I did the single block I/O. Does that provide enough of an evidence and, if not, why not? > If you can use a chunk of > shared_buffers as the direct destination for the read that's OK, but > otherwise you're up for (1mb-8kb)*num_backends extra memory use on I/O > buffers that could otherwise be used as shared_buffers or OS cache. > > Async I/O, too, has costs. > There is a common platitude that says that there is no such thing as free lunch. However, both Oracle RDBMS and IBM DB2 use asynchronous I/O, probably because they're unaware of the danger. Let me now give you a full table scan of a much smaller table located in a Postgres database: news=> select count(*) from internet_web_sites; count --------- 1290133 (1 row) Time: 12838.958 ms Oracle counts 400 million records in 2 minutes and Postgres 9.01 takes 12.8 seconds to count 1.2 million records? Do you see the disparity? Both databases, Oracle and Postgres, are utilizing the same 3Par SAN device, the machines housing both databases are comparable HP 64 bit Linux machines, both running 64 bit version of Red Hat 5.5. Respective table sizes are here: SQL> select bytes/1048576 as MB from user_segments 2 where segment_name='NI_OCCURRENCE'; MB ---------- 35329 news=> select pg_size_pretty(pg_table_size('moreover.internet_web_sites')); pg_size_pretty ---------------- 216 MB (1 row) So, I really pushed Oracle much harder than I pushed Postgres. > > PostgreSQL is in > >> dire need of something similar and it wouldn't even be that hard to >> implement. >> > > I'd really like to see both those assertions backed with data or patches ;-) > With the database that doesn't allow tuning of that aspect, it's the self-defeating proposition. However, I did my best to give you the numbers. > Personally, I know just enough about how PG's I/O path works to suspect > that "not that hard to implement" is probably a little ... > over-optimistic. Sure, it's not that hard to implement in a new program > with no wired-in architectural and design choices; that doesn't mean > it's easy to retrofit onto existing code, especially a bunch of > co-operating processes with their own buffer management. > > It maybe so, but slow sequential scan is still the largest single performance problem of PostgreSQL. The frequency with which that topic appears on the mailing lists should serve as a good evidence for that. I did my best to prove my case. Again, requiring "hard numbers" when using the database which doesn't allow tweaking of the I/O size is self defeating proposition. The other databases, like DB2 and Oracle both allow tweaking of that aspect of its operation, Oracle even on the per session basis. If you still claim that it wouldn't make the difference, the onus to prove it is on you. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On 10/10/2010 11:02 AM, Reid Thompson wrote: >>>> On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com> >>>> >> On the other hand, I copied a table out of one of my production servers that >> has about 60,000 rows with 6 columns (numeric, numeric, bool, bool, timestamp, >> text). The first numeric column has numbers evenly spread between 0 and 100 >> and it is indexed. I put the table in a pair of database servers both running >> on the same physical hardware. One server is Postgres, the other is a popular >> server (I am not mentioning names here). on Postgres: SELECT count(*) FROM >> table where column>50; takes about 8 seconds to run. The other database server >> took less than one second (about 25 ms) as it is using the index (I assume) to >> come up with the results. It is true that this is not a fair test because both >> servers were tested with their default settings, and the defaults for Postgres >> are much more conservative, however, I don't think that any amount of settings >> tweaking will bring them even in the same ball park. There has been discussion >> about the other server returning an incorrect count because all of the indexed >> rows may not be live at the time. This is not a problem for the intended use, >> that is why I suggested another function like estimate(*). It's name suggests >> that the result will be close, not 100% correct, which is plenty good enough >> for generating a list of results pages in most cases. I am faced with a very >> serious problem here. If the query to make a list of pages takes say 6 seconds >> and it takes another 6 seconds to generate a page of results, the customer is >> waiting 12 seconds. This is not going to work. If count made a quick estimate, >> say less than a second, and it took 6 seconds to come up with the actual >> results, I could live with that. Or if coming up with the window of results >> via (OFFSET and LIMIT) and returned the total number of rows that would have >> matched the query, then I would still have everything I need to render the >> page in a reasonable time. I really think that this needs to be addressed >> somewhere. It's not like I am the only one that does this. You see it nearly >> everywhere a long list of results is (expected to be) returned in a web site. >> Among the people I work with, this seems to be the most mentioned reason that >> they claim that they don't use Postgres for their projects. t anyway. > > How big is your DB? > How fast is your disk access? > Any chance disks/RAM can be addressed? > > My disk access is pitiful... > first run, 2.3 million rows.. 0m35.38s, subsequent runs.. real 0m2.55s > > rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting > count > --------- > 2340704 > (1 row) > > > real 0m35.38s > user 0m0.25s > sys 0m0.03s > > subsequent runs.... (count changes due to inserts.) > > rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting > count > --------- > 2363707 > (1 row) > > > real 0m2.70s > user 0m0.27s > sys 0m0.02s > rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting > count > --------- > 2363707 > (1 row) > > > real 0m2.55s > user 0m0.26s > sys 0m0.02s > rthompso@hw-prod-repdb1> time psql -c "select count(*) from my_production_table" reporting > count > --------- > 2363707 > (1 row) > > > real 0m2.50s > user 0m0.26s > sys 0m0.02s > > reporting=# SELECT pg_size_pretty(pg_total_relation_size('my_production_table')); > pg_size_pretty > ---------------- > 1890 MB > (1 row) > > forgot to note, my table schema is significantly larger. rthompso@hw-prod-repdb1> time psql -c "\d my_production_table_201010" reporting Table "public.my_production_table_201010" Column | Type | Modifiers -----------------------------+-----------------------------+---------------------------------------------------------------- | integer | not null default nextval('my_production_table_parent_id_seq'::regclass) | character varying(20) | | character(1) | | character varying(32) | | character varying(32) | | character varying(20) | | character varying(5) | | character varying(5) | | date | | character(1) | | character varying(32) | | character varying(32) | | character varying(32) | | character varying(2) | | character varying(10) | | character varying(10) | | character varying(32) | | character varying(7) | | character varying(10) | | character varying(2) | | character varying(9) | | character varying(9) | | character varying(9) | | character varying(10) | | character varying(32) | | character varying(32) | | character varying(20) | | character varying(5) | | character varying(5) | | character varying(32) | | character varying(32) | | character varying(32) | | character varying(2) | | character varying(10) | | character varying(10) | | character varying(10) | | character varying(10) | | integer | | character varying(2) | | character varying(32) | | character varying(32) | | integer | | integer | | text | | character varying(3) | | date | | date | | date | | integer | | integer | | integer | | integer | | character varying(6) | | character varying(10) | | character varying(32) | | character varying(32) | | character varying(32) | | character varying(10) | | character varying(6) | | character varying(8) | | boolean | | character(1) | | date | | integer | | date | | character varying(11) | | character varying(4) | | character(1) | | date | | character varying(5) | | character varying(20) | | date | | character(1) | | character(1) | | character varying(2) | | text | | integer | | integer | | timestamp without time zone | default now() | timestamp without time zone | | character varying(64) | | character varying(64) | | character varying(64) | Indexes: "my_production_table_201010_pkey" PRIMARY KEY, btree (id) "my_production_table_201010_date_idx" btree (xxxxdate), tablespace "indexspace" "my_production_table_201010_epatient_idx" btree (storeid, xxxxxxxxxxxxx), tablespace "indexspace" "my_production_table_201010_medicationname_idx" btree (xxxxxxxxxxxxxx), tablespace "indexspace" "my_production_table_201010_ndc_idx" btree (xxx), tablespace "indexspace" Check constraints: "my_production_table_201010_filldate_check" CHECK (xxxxdate >= '2010-10-01'::date AND xxxxdate < '2010-11-01'::date) Foreign-key constraints: "my_production_table_201010_pkgfileid_fkey" FOREIGN KEY (pkgfileid) REFERENCES my_production_tablefiles(id) Inherits: my_production_table_parent
On Sunday 10 October 2010 05:02:03 Віталій Тимчишин wrote: > 2010/10/10 Neil Whelchel <neil.whelchel@gmail.com> > > > On Saturday 09 October 2010 18:47:34 Scott Marlowe wrote: > > > On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel <neil.whelchel@gmail.com> > > > > wrote: > > > > I know that there haven been many discussions on the slowness of > > > > count(*) > > > > > > even when an index is involved because the visibility of the rows has > > > > to > > > > > > be checked. In the past I have seen many suggestions about using > > > > triggers and tables to keep track of counts and while this works fine > > > > in > > > > > > a situation where you know what the report is going to be ahead of > > > > time, > > > > > > this is simply not an option when an unknown WHERE clause is to be > > > > used (dynamically generated). I ran into a fine example of this when > > > > I was searching this mailing list, "Searching in 856,646 pages took > > > > 13.48202 seconds. Site search powered by PostgreSQL 8.3." Obviously > > > > at some > > > > point > > > > > > count(*) came into play here because the site made a list of pages (1 > > > > 2 3 4 5 6 > next). I very commonly make a list of pages from search > > > > results, and the biggest time killer here is the count(*) portion, > > > > even worse yet, I sometimes have to hit the database with two SELECT > > > > statements, one with OFFSET and LIMIT to get the page of results I > > > > need and another to get the amount of total rows so I can estimate > > > > how many pages of results are available. The point I am driving at > > > > here is that since building a list of pages of results is such a > > > > common thing to do, there need to be some specific high speed ways > > > > to do this in one query. Maybe an estimate(*) that works like count > > > > but gives an answer from the index without checking visibility? I am > > > > sure that this would be good enough to make a page list, it is > > > > really no big deal if it errors on > > > > the > > > > > > positive side, maybe the list of pages has an extra page off the end. > > > > I can live with that. What I can't live with is taking 13 seconds to > > > > get > > > > a > > > > > > page of results from 850,000 rows in a table. > > > > > > 99% of the time in the situations you don't need an exact measure, and > > > assuming analyze has run recently, select rel_tuples from pg_class for > > > a given table is more than close enough. I'm sure wrapping that in a > > > simple estimated_rows() function would be easy enough to do. > > > > This is a very good approach and it works very well when you are counting > > the > > entire table, but when you have no control over the WHERE clause, it > > doesn't > > help. IE: someone puts in a word to look for in a web form. > > > > From my perspective, this issue is the biggest problem there is when > > using Postgres to create web pages, and it is so commonly used, I think > > that there > > should be a specific way to deal with it so that you don't have to run > > the same WHERE clause twice. > > IE: SELECT count(*) FROM <table> WHERE <clause>; to get the total amount > > of items to make page navigation links, then: > > SELECT <columns> FROM table WHERE <clause> LIMIT <items_per_page> OFFSET > > <(page_no-1)*items_per_page>; to get the actual page contents. > > > > How about > > select * from (select *, count(*) over () as total_count from <table> where > <clause) a LIMIT <items_per_page> OFFSET > <(page_no-1)*items_per_page> > It will return you total_count column with equal value in each row. You may > have problems if no rows are returned (e.g. page num is too high). I have done this before, but the speedup from the two hits to the database that I mentioned above is tiny, just a few ms. It seems to end up doing about the same thing on the database end. The reason that I don't commonly do this is what you said about not getting a count result if you run off the end. -Neil-
On 10/11/2010 01:14 AM, Mladen Gogala wrote: > I can provide measurements, but from Oracle RDBMS. Postgres doesn't > allow tuning of that aspect, so no measurement can be done. Would the > numbers from Oracle RDBMS be acceptable? Well, they'd tell me a lot about Oracle's performance as I/O chunk size scales, but almost nothing about the cost of small I/O operations vs larger ones in general. Typically dedicated test programs that simulate the database read patterns would be used for this sort of thing. I'd be surprised if nobody on -hackers has already done suitable testing; I was mostly asking because I was interested in how you were backing your assertions. PostgreSQL isn't Oracle; their design is in many ways very different. Most importantly, Oracle uses a redo log, where PostgreSQL stores old rows with visibility information directly in the tables. It is possible that a larger proportion of Oracle's I/O costs are fixed per-block overheads rather than per-byte costs, so it seeks to batch requests into larger chunks. Of course, it's also possible that 8k chunk I/O is just universally expensive and is something Pg should avoid, too, but we can't know that without dedicated testing, which I at least haven't done. I don't follow -hackers closely, and wouldn't have seen discussion about testing done there. The archives are likely to contain useful discussions. Then again, IIRC Pg's page size is also it's I/O size, so you could actually get larger I/O chunking by rebuilding Pg with larger pages. Having never had the need, I haven't examined the performance of page size changes on I/O performance. >> The Linux kernel, at least, does request merging (and splitting, and >> merging, and more splitting) along the request path, and I'd >> personally expect that most of the cost of 8k requests would be in the >> increased number of system calls, buffer copies, etc required. >> Measurements demonstrating or contradicting this would be good to see. > > Even the cost of hundreds of thousands of context switches is far from > negligible. What kind of measurements do you expect me to do with the > database which doesn't support tweaking of that aspect of its operation? Test programs, or references to testing done by others that demonstrates these costs in isolation. Of course, they still wouldn't show what gain Pg might obtain (nothing except hacking on Pg's sources really will) but they'd help measure the costs of doing I/O that way. I suspect you're right that large I/O chunks would be desirable and a potential performance improvement. What I'd like to know is *how* *much*, or at least how much the current approach costs in pure overheads like context switches and scheduler delays. > Does that provide enough of an evidence and, if not, why not? It shows that it helps Oracle a lot ;-) Without isolating how much of that is raw costs of the block I/O and how much is costs internal to Oracle, it's still hard to have much idea how much it'd benefit Pg to take a similar approach. I'm sure the folks on -hackers have been over this and know a whole lot more about it than I do, though. > Oracle counts 400 million records in 2 minutes and Postgres 9.01 takes > 12.8 seconds to count 1.2 million records? Do you see the disparity? Sure. What I don't know is how much of that is due to block sizes. There are all sorts of areas where Oracle could be gaining. > It maybe so, but slow sequential scan is still the largest single > performance problem of PostgreSQL. The frequency with which that topic > appears on the mailing lists should serve as a good evidence for that. I'm certainly not arguing that it could use improvement; it's clearly hurting some users. I just don't know if I/O chunking is the answer - I suspect that if it were, then it would've become a priority for one or more people working on Pg much sooner. It's quite likely that it's one of those things where it makes a huge difference for Oracle because Oracle has managed to optimize out most of the other bigger costs. If Pg still has other areas that make I/O more expensive per-byte (say, visibility checks) and low fixed per-block costs, then there'd be little point in chunking I/O. My understanding is that that's pretty much how things stand at the moment, but I'd love verification from someone who's done the testing. >If you still claim that it wouldn't make the difference, > the onus to prove it is on you. I didn't mean to claim that it would make no difference. If I sounded like it, sorry. I just want to know how _much_ , or more accurately how great the overheads of the current approach in Pg are vs doing much larger reads. -- Craig Ringer
On Sun, Oct 10, 2010 at 12:14 PM, Mladen Gogala <mladen.gogala@vmsinfo.com> wrote: > > > > In other words, when I batched the sequential scan to do 128 blocks I/O, it > was 4 times faster than when I did the single block I/O. > Does that provide enough of an evidence and, if not, why not? These numbers tell us nothing because, unless you dropped the caches between runs, then at least part of some runs was very probably cached. -- Jon
On Mon, Oct 11, 2010 at 06:41:16AM +0800, Craig Ringer wrote: > On 10/11/2010 01:14 AM, Mladen Gogala wrote: > >> I can provide measurements, but from Oracle RDBMS. Postgres doesn't >> allow tuning of that aspect, so no measurement can be done. Would the >> numbers from Oracle RDBMS be acceptable? > > Well, they'd tell me a lot about Oracle's performance as I/O chunk size > scales, but almost nothing about the cost of small I/O operations vs > larger ones in general. > > Typically dedicated test programs that simulate the database read > patterns would be used for this sort of thing. I'd be surprised if > nobody on -hackers has already done suitable testing; I was mostly > asking because I was interested in how you were backing your assertions. One thing a test program would have to take into account is multiple concurrent users. What speeds up the single user case may well hurt the multi user case, and the behaviors that hurt single user cases may have been put in place on purpose to allow decent multi-user performance. Of course, all of that is "might" and "maybe", and I can't prove any assertions about block size either. But the fact of multiple users needs to be kept in mind. It was asserted that reading bigger chunks would help performance; a response suggested that, at least in Linux, setting readahead on a device would essentially do the same thing. Or that's what I got from the thread, anyway. I'm interested to know how similar performance might be between the large block size case and the large readahead case. Comments, anyone? -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
Attachment
On 10/11/2010 08:27 AM, Joshua Tolley wrote: > One thing a test program would have to take into account is multiple > concurrent users. What speeds up the single user case may well hurt the > multi user case, and the behaviors that hurt single user cases may have been > put in place on purpose to allow decent multi-user performance. Of course, all > of that is "might" and "maybe", and I can't prove any assertions about block > size either. But the fact of multiple users needs to be kept in mind. Agreed. I've put together a simple test program to test I/O chunk sizes. It only tests single-user performance, but it'd be pretty trivial to adapt it to spawn a couple of worker children or run several threads, each with a suitable delay as it's rather uncommon to have a bunch of seqscans all fire off at once. From this test it's pretty clear that with buffered I/O of an uncached 700mb file under Linux, the I/O chunk size makes very little difference, with all chunk sizes taking 9.8s to read the test file, with near-identical CPU utilization. Caches were dropped between each test run. For direct I/O (by ORing the O_DIRECT flag to the open() flags), chunk size is *hugely* significant, with 4k chunk reads of the test file taking 38s, 8k 22s, 16k 14s, 32k 10.8s, 64k - 1024k 9.8s, then rising a little again over 1024k. Apparently Oracle is almost always configured to use direct I/O, so it would benefit massively from large chunk sizes. PostgreSQL is almost never used with direct I/O, and at least in terms of the low-level costs of syscalls and file system activity, shouldn't care at all about read chunk sizes. Bumping readahead from 256 to 8192 made no significant difference for either case. Of course, I'm on a crappy laptop disk... I'm guessing this is the origin of the OP's focus on I/O chunk sizes. Anyway, for the single-seqscan case, I see little evidence here that using a bigger read chunk size would help PostgreSQL reduce overheads or improve performance. OP: Is your Oracle instance using direct I/O? -- Craig Ringer
Attachment
On 10/10/2010 8:27 PM, Joshua Tolley wrote: > It was asserted that reading bigger chunks would help performance; a response > suggested that, at least in Linux, setting readahead on a device would > essentially do the same thing. Or that's what I got from the thread, anyway. > I'm interested to know how similar performance might be between the large > block size case and the large readahead case. Comments, anyone? > Craig maybe right, the fact that Oracle is doing direct I/O probably does account for the difference. The fact is, however, that the question about slow sequential scan appears with some regularity on PostgreSQL forums. My guess that a larger chunk would be helpful may not be correct, but I do believe that there is a problem with a too slow sequential scan. Bigger chunks are a very traditional solution which may not work but the problem is still there. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On Sun, Oct 10, 2010 at 11:14:43PM -0400, Mladen Gogala wrote: > The fact is, however, that the question > about slow sequential scan appears with some regularity on PostgreSQL > forums. Definitely. Whether that's because there's something pathologically wrong with sequential scans, or just because they're the slowest of the common operations, remains to be seen. After all, if sequential scans were suddenly fast, something else would be the slowest thing postgres commonly did. All that said, if there's gain to be had by increasing block size, or something else, esp. if it's low hanging fruit, w00t. -- Joshua Tolley / eggyknap End Point Corporation http://www.endpoint.com
Attachment
On 11/10/10 11:14, Mladen Gogala wrote: > On 10/10/2010 8:27 PM, Joshua Tolley wrote: >> It was asserted that reading bigger chunks would help performance; a >> response >> suggested that, at least in Linux, setting readahead on a device would >> essentially do the same thing. Or that's what I got from the thread, >> anyway. >> I'm interested to know how similar performance might be between the large >> block size case and the large readahead case. Comments, anyone? >> > > Craig maybe right, the fact that Oracle is doing direct I/O probably > does account for the difference. The fact is, however, that the question > about slow sequential scan appears with some regularity on PostgreSQL > forums. My guess that a larger chunk would be helpful may not be > correct, but I do believe that there is a problem with a too slow > sequential scan. Bigger chunks are a very traditional solution which > may not work but the problem is still there. Now that, I agree with. BTW, I casually looked into async I/O a little, and it seems the general situation for async I/O on Linux is "pretty bloody awful". POSIX async I/O uses signal-driven completion handlers - but signal queue depth limits mean they aren't necessarily reliable, so you can get lost completions and have to scan the event buffer periodically to catch them. The alternative is running completion handlers in threads, but apparently there are queue depth limit issues there too, as well as the joy that is getting POSIX threading right. I think there was some talk about this on -HACKERS a while ago. Here's the main discussion on async I/O I've found: http://archives.postgresql.org/pgsql-hackers/2006-10/msg00820.php ... from which it seems that async buffered I/O is poorly supported, if at all, on current Linux kernels. Don't know about the BSDs. As Pg is *really* poorly suited to direct I/O, relying on the OS buffer cache as it does, unbuffered direct I/O isn't really an option. Linux async I/O seems to be designed for network I/O and for monitoring lots of files for changes, rather than for highly concurrent I/O on one or a few files. It shows. Re slow seqscans, there is still plenty of room to move: - Sequential scans cannot (AFAIK) use the visibility map introduced in 8.4 to skip sections of tables that are known to contain only dead tuples not visible to any transaction or free space. This potential optimization could make a big difference in tables with FILLFACTOR or with holes created by certain update patterns. - Covering indexes ("index oriented" table columns) and/or indexes with embedded visibility information could dramatically improve the performance of certain queries by eliminating the need to hit the heap at all, albeit at the cost of trade-offs elsewhere. This would be particularly useful for those classic count() queries. There have been discussions about these on -hackers, but I'm not up with the current status or lack thereof. - There's been recent talk of using pread() rather than lseek() and read() to save on syscall overhead. The difference is probably minimal, but it'd be nice. It is worth being aware of a few other factors: - Sometimes seqscans are actually the fastest option, and people don't realize this, so they try to force index use where it doesn't make sense. This is the cause of a significant number of list complaints. - Slow sequential scans are often a consequence of table bloat. It's worth checking for this. Pg's autovacuum and manual vacuum have improved in performance and usability dramatically over time, but still have room to move. Sometimes people disable autovacuum in the name of a short-lived performance boost, not realizing it'll have horrible effects on performance in the mid- to long- term. - Seqscans can be chosen when index scans are more appropriate if the random_page_cost and seq_page_cost aren't set sensibly, which they usually aren't. This doesn't make seqscans any faster, but it's even worse when you have a good index you're not using. I can't help but wonder if a bundled "quick and dirty benchmark" tool for Pg would be beneficial in helping to determine appropriate values for these settings and for effective io concurrency. -- Craig Ringer Tech-related writing: http://soapyfrogs.blogspot.com/
On Sunday 10 October 2010 15:41:16 you wrote: > On 10/11/2010 01:14 AM, Mladen Gogala wrote: > > I can provide measurements, but from Oracle RDBMS. Postgres doesn't > > allow tuning of that aspect, so no measurement can be done. Would the > > numbers from Oracle RDBMS be acceptable? > > Well, they'd tell me a lot about Oracle's performance as I/O chunk size > scales, but almost nothing about the cost of small I/O operations vs > larger ones in general. > > Typically dedicated test programs that simulate the database read > patterns would be used for this sort of thing. I'd be surprised if > nobody on -hackers has already done suitable testing; I was mostly > asking because I was interested in how you were backing your assertions. > > PostgreSQL isn't Oracle; their design is in many ways very different. > Most importantly, Oracle uses a redo log, where PostgreSQL stores old > rows with visibility information directly in the tables. It is possible > that a larger proportion of Oracle's I/O costs are fixed per-block > overheads rather than per-byte costs, so it seeks to batch requests into > larger chunks. Of course, it's also possible that 8k chunk I/O is just > universally expensive and is something Pg should avoid, too, but we > can't know that without > dedicated testing, which I at least haven't done. I don't follow > -hackers closely, and wouldn't have seen discussion about testing done > there. The archives are likely to contain useful discussions. > > Then again, IIRC Pg's page size is also it's I/O size, so you could > actually get larger I/O chunking by rebuilding Pg with larger pages. > Having never had the need, I haven't examined the performance of page > size changes on I/O performance. This is a really good idea! I will look into doing this and I will post the results as soon as I can get it done. Right now, I am building a test machine with two dual core Intel processors and two 15KRPM mirrored hard drives, 1 GB ram. I am using a small amount of ram because I will be using small test tables. I may do testing in the future with more ram and bigger tables, but I think I can accomplish what we are all after with what I have. The machine will be limited to running the database server in test, init, bash, and ssh, no other processes will be running except for what is directly involved with testing. I will post exact specs when I post test results. I will create some test tables, and the same tables will be used in all tests. Suggestions for optimal Postgres and system configuration are welcome. I will try any suggested settings that I have time to test. -Neil- > > >> The Linux kernel, at least, does request merging (and splitting, and > >> merging, and more splitting) along the request path, and I'd > >> personally expect that most of the cost of 8k requests would be in the > >> increased number of system calls, buffer copies, etc required. > >> Measurements demonstrating or contradicting this would be good to see. > > > > Even the cost of hundreds of thousands of context switches is far from > > negligible. What kind of measurements do you expect me to do with the > > database which doesn't support tweaking of that aspect of its operation? > > Test programs, or references to testing done by others that demonstrates > these costs in isolation. Of course, they still wouldn't show what gain > Pg might obtain (nothing except hacking on Pg's sources really will) but > they'd help measure the costs of doing I/O that way. > > I suspect you're right that large I/O chunks would be desirable and a > potential performance improvement. What I'd like to know is *how* > *much*, or at least how much the current approach costs in pure > overheads like context switches and scheduler delays. > > > Does that provide enough of an evidence and, if not, why not? > > It shows that it helps Oracle a lot ;-) > > Without isolating how much of that is raw costs of the block I/O and how > much is costs internal to Oracle, it's still hard to have much idea how > much it'd benefit Pg to take a similar approach. > > I'm sure the folks on -hackers have been over this and know a whole lot > more about it than I do, though. > > > Oracle counts 400 million records in 2 minutes and Postgres 9.01 takes > > 12.8 seconds to count 1.2 million records? Do you see the disparity? > > Sure. What I don't know is how much of that is due to block sizes. There > are all sorts of areas where Oracle could be gaining. > > > It maybe so, but slow sequential scan is still the largest single > > performance problem of PostgreSQL. The frequency with which that topic > > appears on the mailing lists should serve as a good evidence for that. > > I'm certainly not arguing that it could use improvement; it's clearly > hurting some users. I just don't know if I/O chunking is the answer - I > suspect that if it were, then it would've become a priority for one or > more people working on Pg much sooner. > > It's quite likely that it's one of those things where it makes a huge > difference for Oracle because Oracle has managed to optimize out most of > the other bigger costs. If Pg still has other areas that make I/O more > expensive per-byte (say, visibility checks) and low fixed per-block > costs, then there'd be little point in chunking I/O. My understanding is > that that's pretty much how things stand at the moment, but I'd love > verification from someone who's done the testing. > > >If you still claim that it wouldn't make the difference, > > > > the onus to prove it is on you. > > I didn't mean to claim that it would make no difference. If I sounded > like it, sorry. > > I just want to know how _much_ , or more accurately how great the > overheads of the current approach in Pg are vs doing much larger reads. > > -- > Craig Ringer
> I ran into a fine example of this when I was searching this mailing list, > "Searching in 856,646 pages took 13.48202 seconds. Site search powered by > PostgreSQL 8.3." Obviously at some point count(*) came into play here Well, tsearch full text search is excellent, but it has to work inside the limits of the postgres database itself, which means row visibility checks, and therefore, yes, extremely slow count(*) on large result sets when the tables are not cached in RAM. Also, if you want to use custom sorting (like by date, thread, etc) possibly all the matching rows will have to be read and sorted. Consider, for example, the Xapian full text search engine. It is not MVCC (it is single writer, multiple reader, so only one process can update the index at a time, but readers are not locked out during writes). Of course, you would never want something like that for your main database ! However, in its particular application, which is multi-criteria full text search (and flexible sorting of results), it just nukes tsearch2 on datasets not cached in RAM, simply because everything in it including disk layout etc, has been optimized for the application. Lucene is similar (but I have not benchmarked it versus tsearch2, so I can't tell). So, if your full text search is a problem, just use Xapian. You can update the Xapian index from a postgres trigger (using an independent process, or simply, a plpython trigger using the python Xapian bindings). You can query it using an extra process acting as a server, or you can write a set-returning plpython function which performs Xapian searches, and you can join the results to your tables. > Pg will never have such a fast count() as MyISAM does or the same > insanely fast read performance, Benchmark it you'll see, MyISAM is faster than postgres for "small simple selects", only if : - pg doesn't use prepared queries (planning time takes longer than a really simple select) - myisam can use index-only access - noone is writing to the myisam table at the moment, obviously On equal grounds (ie, SELECT * FROM table WHERE pk = value) there is no difference. The TCP/IP overhead is larger than the query anyway, you have to use unix sockets on both to get valid timings. Since by default on localhost MySQL seems to use unix sockets and PG uses tcp/ip, PG seem 2x slower, which is in fact not true.
On 10/9/10 6:47 PM, Scott Marlowe wrote: > On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com> wrote: >> I know that there haven been many discussions on the slowness of count(*) even >> when an index is involved because the visibility of the rows has to be >> checked. In the past I have seen many suggestions about using triggers and >> tables to keep track of counts and while this works fine in a situation where >> you know what the report is going to be ahead of time, this is simply not an >> option when an unknown WHERE clause is to be used (dynamically generated). >> I ran into a fine example of this when I was searching this mailing list, >> "Searching in 856,646 pages took 13.48202 seconds. Site search powered by >> PostgreSQL 8.3." Obviously at some point count(*) came into play here because >> the site made a list of pages (1 2 3 4 5 6> next). I very commonly make a >> list of pages from search results, and the biggest time killer here is the >> count(*) portion, even worse yet, I sometimes have to hit the database with >> two SELECT statements, one with OFFSET and LIMIT to get the page of results I >> need and another to get the amount of total rows so I can estimate how many >> pages of results are available. The point I am driving at here is that since >> building a list of pages of results is such a common thing to do, there need >> to be some specific high speed ways to do this in one query. Maybe an >> estimate(*) that works like count but gives an answer from the index without >> checking visibility? I am sure that this would be good enough to make a page >> list, it is really no big deal if it errors on the positive side, maybe the >> list of pages has an extra page off the end. I can live with that. What I >> can't live with is taking 13 seconds to get a page of results from 850,000 >> rows in a table. > > 99% of the time in the situations you don't need an exact measure, and > assuming analyze has run recently, select rel_tuples from pg_class for > a given table is more than close enough. I'm sure wrapping that in a > simple estimated_rows() function would be easy enough to do. First of all, it's not true. There are plenty of applications that need an exact answer. Second, even if it is only 1%,that means it's 1% of the queries, not 1% of people. Sooner or later a large fraction of developers will run into this. It's probably been the most-asked question I've seen on this forum in the four years I've been here. It's a real problem,and it needs a real solution. I know it's a hard problem to solve, but can we stop hinting that those of us who have this problem are somehow being dense? Thanks, Craig
On Monday 11 October 2010 10:46:17 Craig James wrote: > On 10/9/10 6:47 PM, Scott Marlowe wrote: > > On Sat, Oct 9, 2010 at 5:26 PM, Neil Whelchel<neil.whelchel@gmail.com> wrote: > >> I know that there haven been many discussions on the slowness of > >> count(*) even when an index is involved because the visibility of the > >> rows has to be checked. In the past I have seen many suggestions about > >> using triggers and tables to keep track of counts and while this works > >> fine in a situation where you know what the report is going to be ahead > >> of time, this is simply not an option when an unknown WHERE clause is > >> to be used (dynamically generated). I ran into a fine example of this > >> when I was searching this mailing list, "Searching in 856,646 pages > >> took 13.48202 seconds. Site search powered by PostgreSQL 8.3." > >> Obviously at some point count(*) came into play here because the site > >> made a list of pages (1 2 3 4 5 6> next). I very commonly make a list > >> of pages from search results, and the biggest time killer here is the > >> count(*) portion, even worse yet, I sometimes have to hit the database > >> with two SELECT statements, one with OFFSET and LIMIT to get the page > >> of results I need and another to get the amount of total rows so I can > >> estimate how many pages of results are available. The point I am > >> driving at here is that since building a list of pages of results is > >> such a common thing to do, there need to be some specific high speed > >> ways to do this in one query. Maybe an estimate(*) that works like > >> count but gives an answer from the index without checking visibility? I > >> am sure that this would be good enough to make a page list, it is > >> really no big deal if it errors on the positive side, maybe the list of > >> pages has an extra page off the end. I can live with that. What I can't > >> live with is taking 13 seconds to get a page of results from 850,000 > >> rows in a table. > > > > 99% of the time in the situations you don't need an exact measure, and > > assuming analyze has run recently, select rel_tuples from pg_class for > > a given table is more than close enough. I'm sure wrapping that in a > > simple estimated_rows() function would be easy enough to do. > > First of all, it's not true. There are plenty of applications that need an > exact answer. Second, even if it is only 1%, that means it's 1% of the > queries, not 1% of people. Sooner or later a large fraction of developers > will run into this. It's probably been the most-asked question I've seen > on this forum in the four years I've been here. It's a real problem, and > it needs a real solution. > > I know it's a hard problem to solve, but can we stop hinting that those of > us who have this problem are somehow being dense? > > Thanks, > Craig That is why I suggested an estimate(*) that works like (a faster) count(*) except that it may be off a bit. I think that is what he was talking about when he wrote this. I don't think that anyone here is trying to cast any blame, we are just pointing out that there is a real problem here that involves what seems to be a very common task, and it is placing a huge disadvantage on the use of Postgres to other systems that can do it in less time. There doesn't seem to be any disagreement that count(*) is slower than it could be due to MVCC and other reasons, which is fine. However at the chopping end of the line, if a slow count(*) makes a Postgres driven website take say a minute to render a web page, it is completely useless if it can be replaced with a database engine that can do the same thing in (much) less time. On my servers, this is the major sticking point. There are so many people waiting on count(*), that the server runs out of memory and it is forced to stop accepting more connections until some of the threads finish. This makes many unhappy customers. When it comes to serving up web pages that contain a slice of a table with links to other slices, knowing about how many slices is very important. But I think that we can all agree that the exact amount is not a make or break (even better if the estimate is a bit high), so an estimate(*) function that takes some shortcuts here to get a much faster response (maybe off a bit) would solve a huge problem. What it all boils down to is webserver response time, and there are really two things that are slowing things down more than what should be needed. So there are really two possible solutions either of which would be a big help: 1. A faster count(*), or something like my proposed estimate(*). 2. A way to get the total rows matched when using LIMIT and OFFSET before LIMIT and OFFSET are applied. If you are making a web page that contains a few results of many possible results, you need two things for sure which means that there are really two problems with Postgres for doing this task. 1. You need to know (about) how many total rows. This requires a hit to the database which requires a scan of the table to get, there is no way to do this faster than count(*) as far as I know. 2. You need a slice of the data which requires another scan to the table to get, and using the same WHERE clause as above. This seems like a total waste, because we just did that with the exception of actually fetching the data. Why do it twice when if there was a way to get a slice using OFFSET and LIMIT and get the amount of rows that matched before the OFFSET and LIMIT was applied you could do the scan once? I think that this would improve things and give Postgres an edge over other systems. I hope this makes sense to at least one person in the right place. ;) -Neil-
On Mon, Oct 11, 2010 at 12:54 PM, Neil Whelchel <neil.whelchel@gmail.com> wrote:
2. You need a slice of the data which requires another scan to the table to
get, and using the same WHERE clause as above. This seems like a total waste,
because we just did that with the exception of actually fetching the data.
Why do it twice when if there was a way to get a slice using OFFSET and LIMIT
and get the amount of rows that matched before the OFFSET and LIMIT was
applied you could do the scan once? I think that this would improve things and
give Postgres an edge over other systems.
I'd go even farther with number 2 and suggest that a form of offset/limit which can return the total count OR have a total count be passed in to be returned the same way as if total count were being computed would make the use of that api even easier, since you could keep re-using the number returned the first time without changing the api that gets used depending upon context. Of course, you could contrive to set that up via a stored proc relatively easily by simply doing the count(*) once, then appending it to each row of the offset/limit query by including it in the select statement. Let it optionally receive the total to be used as an input parameter, which if not null will result in the count(*) block being skipped in the proc. You'd incur the full cost of the table scan plus offset/limit query once, but then not for each and every page. Since the modified api you suggest for offset/limit would surely have to perform the table scan once, that solution really isn't giving much more value than implementing as a stored proc other than the flexibility of executing an arbitrary query. Modified offset/limit combined with the count_estimate functionality would be very useful in this circumstance, though - especially if the estimate would just do a full count if the estimate is under a certain threshold. A 25% discrepancy when counting millions of rows is a lot less of an issue than a 25% discrepancy when counting 10 rows.
One issue with an estimation is that you must be certain that the estimate >= actual count or else the app must always attempt to load the page BEYOND the last page of the estimate in order to determine if the estimate must be revised upward. Otherwise, you risk leaving rows out entirely. Probably ok when returning search results. Not so much when displaying a list of assets.
On 10/11/2010 3:54 PM, Neil Whelchel wrote: > 1. A faster count(*), or something like my proposed estimate(*). > 2. A way to get the total rows matched when using LIMIT and OFFSET before > LIMIT and OFFSET are applied. The biggest single problem with "select count(*)" is that it is seriously overused. People use that idiom to establish existence, which usually leads to a performance disaster in the application using it, unless the table has no more than few hundred records. SQL language, of which PostgreSQL offers an excellent implementation, offers [NOT] EXISTS clause since its inception in the Jurassic era. The problem is with the sequential scan, not with counting. I'd even go as far as to suggest that 99% instances of the "select count(*)" idiom are probably bad use of the SQL language. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On Oct 10, 2010, at 10:14 AM, Mladen Gogala wrote: > > SQL> show parameter db_file_multi > > NAME TYPE VALUE > ------------------------------------ ----------- > ------------------------------ > db_file_multiblock_read_count integer 16 > SQL> alter session set db_file_multiblock_read_count=1; > > Session altered. > SQL> select count(*) from ni_occurrence; > > COUNT(*) > ---------- > 402062638 > > Elapsed: 00:08:20.88 > SQL> alter session set db_file_multiblock_read_count=128; > > Session altered. > > Elapsed: 00:00:00.50 > SQL> select count(*) from ni_occurrence; > > COUNT(*) > ---------- > 402062638 > > Elapsed: 00:02:17.58 > > > In other words, when I batched the sequential scan to do 128 blocks I/O, > it was 4 times faster than when I did the single block I/O. > Does that provide enough of an evidence and, if not, why not? > Did you tune the linux FS read-ahead first? You can get large gains by doing that if you are on ext3. blockdev --setra 2048 <device> would give you a 1MB read-ahead. Also, consider XFS and its built-in defragmentation. I have found that a longer livedpostgres DB will get extreme file fragmentation over time and sequential scans end up mostly random. On-line file defrag helps tremendously. > It maybe so, but slow sequential scan is still the largest single > performance problem of PostgreSQL. The frequency with which that topic > appears on the mailing lists should serve as a good evidence for that. I > did my best to prove my case. I'm not sure its all the I/O however. It seems that Postgres uses a lot more CPU than other DB's to crack open a tuple andinspect it. Testing on unindexed tables with count(*) I can get between 200MB and 800MB per second off disk max withfull cpu utilization (depending on the average tuple size and contents). This is on a disk array that can do 1200MB/sec. It always feels dissapointing to not be able to max out the disk throughput on the simplest possible query. > Again, requiring "hard numbers" when > using the database which doesn't allow tweaking of the I/O size is self > defeating proposition. The other databases, like DB2 and Oracle both > allow tweaking of that aspect of its operation, Oracle even on the per > session basis. If you still claim that it wouldn't make the difference, > the onus to prove it is on you. > > -- > Mladen Gogala > Sr. Oracle DBA > 1500 Broadway > New York, NY 10036 > (212) 329-5251 > www.vmsinfo.com > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Joshua Tolley wrote: > It was asserted that reading bigger chunks would help performance; a response > suggested that, at least in Linux, setting readahead on a device would > essentially do the same thing. Or that's what I got from the thread, anyway. > I'm interested to know how similar performance might be between the large > block size case and the large readahead case. Large read-ahead addresses the complaint here (bulk reads are slow) just fine, which is one reason why this whole topic isn't nearly as interesting as claimed. Larger chunk sizes in theory will do the same thing, but then you're guaranteed to be reading larger blocks than necessary in all cases. The nice thing about a good adaptive read-ahead is that it can target small blocks normally, and only kick into heavy read-ahead mode when the I/O pattern justifies it. This is a problem for the operating system to solve, and such solutions out there are already good enough that PostgreSQL has little reason to try and innovate in this area. I routinely see seq scan throughput double on Linux just by tweaking read-ahead from the tiny defaults to a sane value. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
On Oct 11, 2010, at 7:02 PM, Scott Carey wrote: > > On Oct 10, 2010, at 10:14 AM, Mladen Gogala wrote: > >> >> SQL> show parameter db_file_multi >> >> NAME TYPE VALUE >> ------------------------------------ ----------- >> ------------------------------ >> db_file_multiblock_read_count integer 16 >> SQL> alter session set db_file_multiblock_read_count=1; >> >> Session altered. >> SQL> select count(*) from ni_occurrence; >> >> COUNT(*) >> ---------- >> 402062638 >> >> Elapsed: 00:08:20.88 >> SQL> alter session set db_file_multiblock_read_count=128; >> >> Session altered. >> >> Elapsed: 00:00:00.50 >> SQL> select count(*) from ni_occurrence; >> >> COUNT(*) >> ---------- >> 402062638 >> >> Elapsed: 00:02:17.58 >> >> >> In other words, when I batched the sequential scan to do 128 blocks I/O, >> it was 4 times faster than when I did the single block I/O. >> Does that provide enough of an evidence and, if not, why not? >> > > Did you tune the linux FS read-ahead first? You can get large gains by doing that if you are on ext3. > blockdev --setra 2048 <device> > Scratch that, if you are using DirectIO, block read-ahead does nothing. The default is 128K for buffered I/O read-ahead. > would give you a 1MB read-ahead. Also, consider XFS and its built-in defragmentation. I have found that a longer livedpostgres DB will get extreme > file fragmentation over time and sequential scans end up mostly random. On-line file defrag helps tremendously. > >> It maybe so, but slow sequential scan is still the largest single >> performance problem of PostgreSQL. The frequency with which that topic >> appears on the mailing lists should serve as a good evidence for that. I >> did my best to prove my case. > > I'm not sure its all the I/O however. It seems that Postgres uses a lot more CPU than other DB's to crack open a tupleand inspect it. Testing on unindexed tables with count(*) I can get between 200MB and 800MB per second off disk maxwith full cpu utilization (depending on the average tuple size and contents). This is on a disk array that can do 1200MB/sec. It always feels dissapointing to not be able to max out the disk throughput on the simplest possible query. > >> Again, requiring "hard numbers" when >> using the database which doesn't allow tweaking of the I/O size is self >> defeating proposition. The other databases, like DB2 and Oracle both >> allow tweaking of that aspect of its operation, Oracle even on the per >> session basis. If you still claim that it wouldn't make the difference, >> the onus to prove it is on you. >> >> -- >> Mladen Gogala >> Sr. Oracle DBA >> 1500 Broadway >> New York, NY 10036 >> (212) 329-5251 >> www.vmsinfo.com >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On 10/11/2010 10:02 PM, Scott Carey wrote: > Did you tune the linux FS read-ahead first? You can get large gains by doing that if you are on ext3. > blockdev --setra 2048<device> Actually, I have blockdev --setra 32768 > would give you a 1MB read-ahead. Also, consider XFS and its built-in defragmentation. I have found that a longer livedpostgres DB will get extreme > file fragmentation over time and sequential scans end up mostly random. On-line file defrag helps tremendously. I agree, but I am afraid that after the demise of SGI, XFS isn't being developed. The company adopted the policy of using only the plain vanilla Ext3, which is unfortunate, but I can't do much about it. There is a lesson to be learned from the story of ReiserFS. I am aware of the fact that Ext3 is rather basic, block oriented file system which doesn't perform well when compared to HPFS, VxFS or JFS2 and has no notion of extents, but I believe that I am stuck with it, until the advent of Ext4. BTW, there is no defragmenter for Ext4, not even on Ubuntu, which is rather bleeding edge distribution. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On Monday 11 October 2010 16:50:36 you wrote: > On 10/11/2010 3:54 PM, Neil Whelchel wrote: > > 1. A faster count(*), or something like my proposed estimate(*). > > 2. A way to get the total rows matched when using LIMIT and OFFSET before > > LIMIT and OFFSET are applied. > > The biggest single problem with "select count(*)" is that it is > seriously overused. People use that idiom to establish existence, which > usually leads to a performance disaster in the application using it, > unless the table has no more than few hundred records. SQL language, of > which PostgreSQL offers an excellent implementation, offers [NOT] > EXISTS clause since its inception in the Jurassic era. The problem is > with the sequential scan, not with counting. I'd even go as far as to > suggest that 99% instances of the "select count(*)" idiom are probably > bad use of the SQL language. I agree, I have seen many very bad examples of using count(*). I will go so far as to question the use of count(*) in my examples here. It there a better way to come up with a page list than using count(*)? What is the best method to make a page of results and a list of links to other pages of results? Am I barking up the wrong tree here? -Neil-
On Monday 11 October 2010 19:23:46 Mladen Gogala wrote: > On 10/11/2010 10:02 PM, Scott Carey wrote: > > Did you tune the linux FS read-ahead first? You can get large gains by > > doing that if you are on ext3. blockdev --setra 2048<device> > > Actually, I have blockdev --setra 32768 > > > would give you a 1MB read-ahead. Also, consider XFS and its built-in > > defragmentation. I have found that a longer lived postgres DB will get > > extreme file fragmentation over time and sequential scans end up mostly > > random. On-line file defrag helps tremendously. > > I agree, but I am afraid that after the demise of SGI, XFS isn't being > developed. The company adopted the policy of using only the plain > vanilla Ext3, which is unfortunate, but I can't do much about it. There > is a lesson to be learned from the story of ReiserFS. I am aware of the > fact that Ext3 is rather basic, block oriented file system which doesn't > perform well when compared to HPFS, VxFS or JFS2 and has no notion of > extents, but I believe that I am stuck with it, until the advent of > Ext4. BTW, there is no defragmenter for Ext4, not even on Ubuntu, which > is rather bleeding edge distribution. When it comes to a database that has many modifications to its tables, it seems that XFS pulls way ahead of other filesystems (likely) because of its on-line defragmentation among other reasons. I am not sure that XFS is not (properly) maintained. The people at xfs.org seem to be making steady progress, and high quality updates. I have been using it for some time now (9+ years), and as it is, it does everything I need it to do, and it is very reliable. I really can't see anything changing in the next few years that would effect its usability as a filesystem for Postgres, so until something proves to be better, I can't understand why you wouldn't want to use it, maintained or not. -Neil-
On Mon, Oct 11, 2010 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote:
This is a problem for the operating system to solve, and such solutions out there are already good enough that PostgreSQL has little reason to try and innovate in this area. I routinely see seq scan throughput double on Linux just by tweaking read-ahead from the tiny defaults to a sane value.
I spent some time going through the various tuning docs on the wiki whie bringing some new hardware up and I can't remember seeing any discussion of tweaking read-ahead at all in the normal performance-tuning references. Do you have any documentation of the kinds of tweaking you have done and its effects on different types of workloads?
I can't speak to documentation, but it is something that helps as your I/O subsystem gets more powerful, and how much it helps depends more on your hardware, which may have adaptive read ahead on its own, and your file system which may be more or less efficient at sequential I/O. For example ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS on a DELL PERC6 RAID card (but still ends up slower).
Linux Read-ahead has no effect on random access performance. A workload consisting of mixed sequential scans and random reads can be tuned to favor one over the other based on a combination of the I/O scheduler used and the ammount of read-ahead. Larger read-ahead helps sequential scans, and the Deadline scheduler tends to favor throughput (sequential scans) over latency (random access) compared to the cfq scheduler.
On Oct 11, 2010, at 8:58 PM, Samuel Gendler wrote:
On Mon, Oct 11, 2010 at 7:19 PM, Greg Smith <greg@2ndquadrant.com> wrote:
This is a problem for the operating system to solve, and such solutions out there are already good enough that PostgreSQL has little reason to try and innovate in this area. I routinely see seq scan throughput double on Linux just by tweaking read-ahead from the tiny defaults to a sane value.I spent some time going through the various tuning docs on the wiki whie bringing some new hardware up and I can't remember seeing any discussion of tweaking read-ahead at all in the normal performance-tuning references. Do you have any documentation of the kinds of tweaking you have done and its effects on different types of workloads?
On Mon, Oct 11, 2010 at 9:06 PM, Scott Carey <scott@richrelevance.com> wrote:
I can't speak to documentation, but it is something that helps as your I/O subsystem gets more powerful, and how much it helps depends more on your hardware, which may have adaptive read ahead on its own, and your file system which may be more or less efficient at sequential I/O. For example ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS on a DELL PERC6 RAID card (but still ends up slower).
Geez. I wish someone would have written something quite so bold as 'xfs is always faster than ext3' in the standard tuning docs. I couldn't find anything that made a strong filesystem recommendation. How does xfs compare to ext4? I wound up on ext4 on a dell perc6 raid card when an unexpected hardware failure on a production system caused my test system to get thrown into production before I could do any serious testing of xfs. If there is a strong consensus that xfs is simply better, I could afford the downtime to switch.
As it happens, this is a system where all of the heavy workload is in the form of sequential scan type load. The OLTP workload is very minimal (tens of queries per minute on a small number of small tables), but there are a lot of reporting queries that wind up doing sequential scans of large partitions (millions to tens of millions of rows). We've sized the new hardware so that the most commonly used partitions fit into memory, but if we could speed the queries that touch less frequently used partitions, that would be good. I'm the closest thing our team has to a DBA, which really only means that I'm the one person on the dev team or the ops team to have read all of the postgres docs and wiki and the mailing lists. I claim no actual DBA experience or expertise and have limited cycles to devote to tuning and testing, so if there is an established wisdom for filesystem choice and read ahead tuning, I'd be very interested in hearing it.
On Mon, 11 Oct 2010, Samuel Gendler wrote: > On Mon, Oct 11, 2010 at 9:06 PM, Scott Carey <scott@richrelevance.com>wrote: > >> I can't speak to documentation, but it is something that helps as your I/O >> subsystem gets more powerful, and how much it helps depends more on your >> hardware, which may have adaptive read ahead on its own, and your file >> system which may be more or less efficient at sequential I/O. For example >> ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS >> on a DELL PERC6 RAID card (but still ends up slower). >> >> > Geez. I wish someone would have written something quite so bold as 'xfs is > always faster than ext3' in the standard tuning docs. I couldn't find > anything that made a strong filesystem recommendation. How does xfs compare > to ext4? I wound up on ext4 on a dell perc6 raid card when an unexpected > hardware failure on a production system caused my test system to get thrown > into production before I could do any serious testing of xfs. If there is a > strong consensus that xfs is simply better, I could afford the downtime to > switch. unfortunantly you are not going to get a clear opinion here. ext3 has a long track record, and since it is the default, it gets a lot of testing. it does have known issues xfs had problems on linux immediatly after it was ported. It continues to be improved and many people have been using it for years and trust it. XFS does have a weakness in creating/deleting large numbers of small files. ext4 is the new kid on the block. it claims good things, but it's so new that many people don't trust it yet btrfs is the 'future of filesystems' that is supposed to be better than anything else, but it's definantly not stable yet, and time will tell if it really lives up to it's promises. an this is just on linux on BSD or solaris (or with out-of-kernel patches) you also have ZFS, which some people swear by, and other people swear at. David Lang > As it happens, this is a system where all of the heavy workload is in the > form of sequential scan type load. The OLTP workload is very minimal (tens > of queries per minute on a small number of small tables), but there are a > lot of reporting queries that wind up doing sequential scans of large > partitions (millions to tens of millions of rows). We've sized the new > hardware so that the most commonly used partitions fit into memory, but if > we could speed the queries that touch less frequently used partitions, that > would be good. I'm the closest thing our team has to a DBA, which really > only means that I'm the one person on the dev team or the ops team to have > read all of the postgres docs and wiki and the mailing lists. I claim no > actual DBA experience or expertise and have limited cycles to devote to > tuning and testing, so if there is an established wisdom for filesystem > choice and read ahead tuning, I'd be very interested in hearing it. >
Mladen Gogala wrote: > I agree, but I am afraid that after the demise of SGI, XFS isn't being > developed. It's back to being well maintained again; see http://blog.2ndquadrant.com/en/2010/04/the-return-of-xfs-on-linux.html for some history here and why it's become relevant to RedHat in particular recently. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
Samuel Gendler wrote: > I spent some time going through the various tuning docs on the wiki > whie bringing some new hardware up and I can't remember seeing any > discussion of tweaking read-ahead at all in the normal > performance-tuning references. Do you have any documentation of the > kinds of tweaking you have done and its effects on different types of > workloads? Much of my recent research has gone into the book you'll see plugged below rather than the wiki. The basics of read-ahead tuning is that you can see it increase bonnie++ sequential read results when you increase it, to a point. Get to that point and stop and you should be in good shape. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
11.10.10 20:46, Craig James написав(ла): > > First of all, it's not true. There are plenty of applications that > need an exact answer. Second, even if it is only 1%, that means it's > 1% of the queries, not 1% of people. Sooner or later a large fraction > of developers will run into this. It's probably been the most-asked > question I've seen on this forum in the four years I've been here. > It's a real problem, and it needs a real solution. > > I know it's a hard problem to solve, but can we stop hinting that > those of us who have this problem are somehow being dense? > BTW: There is a lot of talk about MVCC, but is next solution possible: 1) Create a page information map that for each page in the table will tell you how may rows are within and if any write (either successful or not) were done to this page. This even can be two maps to make second one really small (a bit per page) - so that it could be most time in-memory. 2) When you need to to count(*) or index check - first check if there were no writes to the page. If not - you can use count information from page info/index data without going to the page itself 3) Let vacuum clear the bit after frozing all the tuples in the page (am I using terminology correctly?). In this case all read-only (archive) data will be this bit off and index/count(*) will be really fast. Am I missing something? Best regards, Vitalii Tymchyshyn.
On 10/12/2010 03:56 PM, Vitalii Tymchyshyn wrote: > BTW: There is a lot of talk about MVCC, but is next solution possible: > 1) Create a page information map that for each page in the table will > tell you how may rows are within and if any write (either successful or > not) were done to this page. This even can be two maps to make second > one really small (a bit per page) - so that it could be most time > in-memory. > 2) When you need to to count(*) or index check - first check if there > were no writes to the page. If not - you can use count information from > page info/index data without going to the page itself > 3) Let vacuum clear the bit after frozing all the tuples in the page (am > I using terminology correctly?). Part of this already exists. It's called the visibility map, and is present in 8.4 and above. It's not currently used for queries, but can potentially be used to aid some kinds of query. http://www.postgresql.org/docs/8.4/static/storage-vm.html > In this case all read-only (archive) data will be this bit off and > index/count(*) will be really fast. A count with any joins or filter criteria would still have to scan all pages with visible tuples in them. So the visibility map helps speed up scanning of bloated tables, but doesn't provide a magical "fast count" except in the utterly trivial "select count(*) from tablename;" case, and can probably only be used for accurate results when there are no read/write transactions currently open. Even if you kept a count of tuples in each page along with the mvcc transaction ID information required to determine for which transactions that count is valid, it'd only be useful if you didn't have to do any condition checks, and it'd be yet another thing to update with every insert/delete/update. Perhaps for some users that'd be worth having, but it seems to me like it'd have pretty narrow utility. I'm not sure that's the answer. -- Craig Ringer
On Tue, 12 Oct 2010, Craig Ringer wrote: > >> BTW: There is a lot of talk about MVCC, but is next solution possible: >> 1) Create a page information map that for each page in the table will >> tell you how may rows are within and if any write (either successful or >> not) were done to this page. This even can be two maps to make second >> one really small (a bit per page) - so that it could be most time >> in-memory. >> 2) When you need to to count(*) or index check - first check if there >> were no writes to the page. If not - you can use count information from >> page info/index data without going to the page itself >> 3) Let vacuum clear the bit after frozing all the tuples in the page (am >> I using terminology correctly?). > > Part of this already exists. It's called the visibility map, and is present > in 8.4 and above. It's not currently used for queries, but can potentially be > used to aid some kinds of query. > > http://www.postgresql.org/docs/8.4/static/storage-vm.html > >> In this case all read-only (archive) data will be this bit off and >> index/count(*) will be really fast. > > A count with any joins or filter criteria would still have to scan all pages > with visible tuples in them. So the visibility map helps speed up scanning of > bloated tables, but doesn't provide a magical "fast count" except in the > utterly trivial "select count(*) from tablename;" case, and can probably only > be used for accurate results when there are no read/write transactions > currently open. Even if you kept a count of tuples in each page along with > the mvcc transaction ID information required to determine for which > transactions that count is valid, it'd only be useful if you didn't have to > do any condition checks, and it'd be yet another thing to update with every > insert/delete/update. > > Perhaps for some users that'd be worth having, but it seems to me like it'd > have pretty narrow utility. I'm not sure that's the answer. from a PR point of view, speeding up the trivil count(*) case could be worth it, just to avoid people complaining about it not being fast. in the case where you are doing a count(*) where query and the where is on an indexed column, could the search just look at the index + the visibility mapping rather than doing an sequential search through the table? as for your worries about the accuracy of a visibility based count in the face of other transactions, wouldn't you run into the same issues if you are doing a sequential scan with the same transactions in process? David Lang
12.10.10 11:14, Craig Ringer написав(ла): > On 10/12/2010 03:56 PM, Vitalii Tymchyshyn wrote: > >> BTW: There is a lot of talk about MVCC, but is next solution possible: >> 1) Create a page information map that for each page in the table will >> tell you how may rows are within and if any write (either successful or >> not) were done to this page. This even can be two maps to make second >> one really small (a bit per page) - so that it could be most time >> in-memory. >> 2) When you need to to count(*) or index check - first check if there >> were no writes to the page. If not - you can use count information from >> page info/index data without going to the page itself >> 3) Let vacuum clear the bit after frozing all the tuples in the page (am >> I using terminology correctly?). > > Part of this already exists. It's called the visibility map, and is > present in 8.4 and above. It's not currently used for queries, but can > potentially be used to aid some kinds of query. > > http://www.postgresql.org/docs/8.4/static/storage-vm.html > >> In this case all read-only (archive) data will be this bit off and >> index/count(*) will be really fast. > > A count with any joins or filter criteria would still have to scan all > pages with visible tuples in them. If one don't use parittioning. With proper partitioning, filter can simply select a partitions. Also filtering can be mapped on the index lookup. And if one could join index hash and visibility map, much like two indexes can be bit joined now, count can be really fast for all but non-frozen tuples. > So the visibility map helps speed up scanning of bloated tables, but > doesn't provide a magical "fast count" except in the utterly trivial > "select count(*) from tablename;" case, and can probably only be used > for accurate results when there are no read/write transactions > currently open. Why so? You simply has to recount the pages that are marked dirty in usual way. But count problem usually occurs when there are a lot of archive data (you need to count over 100K records) that is not modified. Best regards, Vitalii Tymchyshyn
On 10/12/2010 04:22 PM, david@lang.hm wrote: > from a PR point of view, speeding up the trivil count(*) case could be > worth it, just to avoid people complaining about it not being fast. At the cost of a fair bit more complexity, though, and slowing everything else down. The proper solution here remains, IMO, support for visibility information in indexes, whether by storing it once in the index and once in the heap (ouch!), storing it out-of-line, or using a covering index where one or more columns are stored wholly in the index not in the table heap at all. Here are a few of the many past discussions about this that have already covered some of the same ground: http://stackoverflow.com/questions/839015/postgres-could-an-index-organized-tables-paved-way-for-faster-select-count-fr http://osdir.com/ml/db.postgresql.performance/2003-10/msg00075.html (and the rest of the thread) A decent look with Google will find many, many more. > in the case where you are doing a count(*) where query and the where is > on an indexed column, could the search just look at the index + the > visibility mapping rather than doing an sequential search through the > table? Nope, because the visibility map, which is IIRC only one bit per page, doesn't record how many tuples there are on the page, or enough information about them to determine how many of them are visible to the current transaction*. > as for your worries about the accuracy of a visibility based count in > the face of other transactions, wouldn't you run into the same issues if > you are doing a sequential scan with the same transactions in process? No. Every tuple in a table heap in postgresql has hidden fields, some of which are used to determine whether the current transaction* can "see" the tuple - it may have been inserted after this transaction started, or deleted before this transaction started, so it's not visible to this transaction but may still be to others. http://www.postgresql.org/docs/current/static/ddl-system-columns.html This information isn't available in the visibility map, or in indexes. That's why PostgreSQL has to hit the heap to find it. * current transaction should really be "current snapshot". The snapshot is taken at the start of the whole transaction for SERIALIZABLE isolation, and at the start of each statement for READ COMMITTED isolation. -- Craig Ringer
david@lang.hm wrote: > from a PR point of view, speeding up the trivil count(*) case could be > worth it, just to avoid people complaining about it not being fast. > > Fixing PR stuff is not the approach that I would take. People are complaining about select count(*) because they're using it in all the wrong places. My assessment that there is a problem with sequential scan was wrong. Now, let's again take Oracle as the measure. Someone asked me about caching the data. Here it is: SQL> connect system/********* Connected. SQL> alter system flush buffer_cache; System altered. Elapsed: 00:00:12.68 SQL> connect adbase/********* Connected. SQL> alter session set db_file_multiblock_read_Count=128; Session altered. Elapsed: 00:00:00.41 SQL> select count(*) from ni_occurrence; COUNT(*) ---------- 402062638 Elapsed: 00:02:37.77 SQL> select bytes/1048576 MB from user_segments 2 where segment_name='NI_OCCURRENCE'; MB ---------- 35329 Elapsed: 00:00:00.20 SQL> So, the results weren't cached the first time around. The explanation is the fact that Oracle, as of the version 10.2.0, reads the table in the private process memory, not in the shared buffers. This table alone is 35GB in size, Oracle took 2 minutes 47 seconds to read it using the full table scan. If I do the same thing with PostgreSQL and a comparable table, Postgres is, in fact, faster: psql (9.0.1) Type "help" for help. news=> \timing Timing is on. news=> select count(*) from moreover_documents_y2010m09; count ---------- 17242655 (1 row) Time: 113135.114 ms news=> select pg_size_pretty(pg_table_size('moreover_documents_y2010m09')); pg_size_pretty ---------------- 27 GB (1 row) Time: 100.849 ms news=> The number of rows is significantly smaller, but the table contains rather significant "text" field which consumes quite a bit of TOAST storage and the sizes are comparable. Postgres read through 27GB in 113 seconds, less than 2 minutes and oracle took 2 minutes 37 seconds to read through 35GB. I stand corrected: there is nothing wrong with the speed of the Postgres sequential scan. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On Tue, Oct 12, 2010 at 7:27 AM, Mladen Gogala <mladen.gogala@vmsinfo.com> wrote: > > So, the results weren't cached the first time around. The explanation is the > fact that Oracle, as of the version 10.2.0, reads the table in the private > process memory, not in the shared buffers. This table alone is 35GB in > size, Oracle took 2 minutes 47 seconds to read it using the full table > scan. If I do the same thing with PostgreSQL and a comparable table, > Postgres is, in fact, faster: Well, I didn't quite mean that - having no familiarity with Oracle I don't know what the alter system statement does, but I was talking specifically about the linux buffer and page cache. The easiest way to drop the linux caches in one fell swoop is: echo 3 > /proc/sys/vm/drop_caches Is there a command to tell postgresql to drop/clear/reset it's buffer_cache? Clearing/dropping both the system (Linux) and the DB caches is important when doing benchmarks that involve I/O. -- Jon
Jon Nelson wrote: > Is there a command to tell postgresql to drop/clear/reset it's buffer_cache? > No. Usually the sequence used to remove all cached data from RAM before a benchmark is: pg_ctl stop sync echo 3 > /proc/sys/vm/drop_caches pg_ctl start -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
On Tue, Oct 12, 2010 at 3:07 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > On Tue, Oct 12, 2010 at 7:27 AM, Mladen Gogala > <mladen.gogala@vmsinfo.com> wrote: >> >> So, the results weren't cached the first time around. The explanation is the >> fact that Oracle, as of the version 10.2.0, reads the table in the private >> process memory, not in the shared buffers. This table alone is 35GB in >> size, Oracle took 2 minutes 47 seconds to read it using the full table >> scan. If I do the same thing with PostgreSQL and a comparable table, >> Postgres is, in fact, faster: > > Well, I didn't quite mean that - having no familiarity with Oracle I > don't know what the alter system statement does, but I was talking > specifically about the linux buffer and page cache. The easiest way to > drop the linux caches in one fell swoop is: > > echo 3 > /proc/sys/vm/drop_caches AFAIK this won't affect Oracle when using direct IO (which bypasses the page cache). Luca
On Tue, Oct 12, 2010 at 8:18 AM, Greg Smith <greg@2ndquadrant.com> wrote: > No. Usually the sequence used to remove all cached data from RAM before a > benchmark is: All cached data (as cached in postgresql - *not* the Linux system caches)..., right? -- Jon
Mladen Gogala <mladen.gogala@vmsinfo.com> writes: > The number of rows is significantly smaller, but the table contains > rather significant "text" field which consumes quite a bit of TOAST > storage and the sizes are comparable. Postgres read through 27GB in 113 > seconds, less than 2 minutes and oracle took 2 minutes 37 seconds to > read through 35GB. I stand corrected: there is nothing wrong with the > speed of the Postgres sequential scan. Um ... the whole point of TOAST is that the data isn't in-line. So what Postgres was actually reading through was probably quite a lot less than 27Gb. It's probably hard to make a completely apples-to-apples comparison because the two databases are so different, but I don't think this one proves that PG is faster than Oracle. regards, tom lane
Neil Whelchel <neil.whelchel@gmail.com> wrote: > What is the best method to make a page of results and a list of > links to other pages of results? For our most heavily used web app we decided to have the renderer just read the list of cases and render the pages to disk, and then present the first one. We set a limit of 500 entries on the list; if we get past 500 we put up a page telling them to refine their search criteria. That won't work for all circumstances, but it works well for out web app. -Kevin
Tom Lane wrote: > Mladen Gogala <mladen.gogala@vmsinfo.com> writes: > >> The number of rows is significantly smaller, but the table contains >> rather significant "text" field which consumes quite a bit of TOAST >> storage and the sizes are comparable. Postgres read through 27GB in 113 >> seconds, less than 2 minutes and oracle took 2 minutes 37 seconds to >> read through 35GB. I stand corrected: there is nothing wrong with the >> speed of the Postgres sequential scan. >> > > Um ... the whole point of TOAST is that the data isn't in-line. > So what Postgres was actually reading through was probably quite a > lot less than 27Gb. It's probably hard to make a completely > apples-to-apples comparison because the two databases are so different, > but I don't think this one proves that PG is faster than Oracle. > > regards, tom lane > As is usually the case, you're right. I will try copying the table to Postgres over the weekend, my management would not look kindly upon my copying 35GB of the production data during the working hours, for the scientific reasons. I have the storage and I can test, I will post the result. I developed quite an efficient Perl script which does copying without the intervening CSV file, so that the copy should not take more than 2 hours. I will be able to impose a shared lock on the table over the weekend, so that I don't blow away the UNDO segments. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
One way I have dealt with this on very large tables is to cache the count(*) at the application level (using memcached, terracotta, or something along those lines) and then increment that cache whenever you add a row to the relevant table. On application restart that cache is re-initialized with a regular old count(*). This approach works really well and all large systems in my experience need caching in front of the DB eventually. If you have a simpler system with say a single application/web server you can simply store the value in a variable, the specifics would depend on the language and framework you are using.The biggest single problem with "select count(*)" is that it is seriously overused. People use that idiom to establish existence, which usually leads to a performance disaster in the application using it, unless the table has no more than few hundred records. SQL language, of which PostgreSQL offers an excellent implementation, offers [NOT] EXISTS clause since its inception in the Jurassic era. The problem is with the sequential scan, not with counting. I'd even go as far as to suggest that 99% instances of the "select count(*)" idiom are probably bad use of the SQL language.I agree, I have seen many very bad examples of using count(*). I will go so far as to question the use of count(*) in my examples here. It there a better way to come up with a page list than using count(*)? What is the best method to make a page of results and a list of links to other pages of results? Am I barking up the wrong tree here?
Another more all-DB approach is to create a statistics tables into which you place aggregated statistics rows (num deleted, num inserted, totals, etc) at an appropriate time interval in your code. So you have rows containing aggregated statistics information for the past and some tiny portion of the new data happening right now that hasn't yet been aggregated. Queries then look like a summation of the aggregated values in the statistics table plus a count(*) over just the newest portion of the data table and are generally very fast.
Overall I have found that once things get big the layers of your app stack start to blend together and have to be combined in clever ways to keep speed up. Postgres is a beast but when you run into things it can't do well just find a way to cache it or make it work together with some other persistence tech to handle those cases.
On Sat, Oct 9, 2010 at 4:26 PM, Neil Whelchel <neil.whelchel@gmail.com> wrote: > Maybe an > estimate(*) that works like count but gives an answer from the index without > checking visibility? I am sure that this would be good enough to make a page > list, it is really no big deal if it errors on the positive side, maybe the > list of pages has an extra page off the end. I can live with that. What I > can't live with is taking 13 seconds to get a page of results from 850,000 > rows in a table. > -Neil- > FWIW, Michael Fuhr wrote a small function to parse the EXPLAIN plan a few years ago and it works pretty well assuming your stats are up to date. http://markmail.org/message/gknqthlwry2eoqey
Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > Greg Smith <greg@2ndquadrant.com> wrote: >> Usually the sequence used to remove all cached data from RAM >> before a benchmark is: > > All cached data (as cached in postgresql - *not* the Linux system > caches)..., right? No. The stop and start of PostgreSQL causes empty PostgreSQL caches. These lines, in between the stop and start, force the Linux cache to be empty (on recent kernel versions): sync echo 3 > /proc/sys/vm/drop_caches -Kevin
On 10/11/10 8:02 PM, Scott Carey wrote: > would give you a 1MB read-ahead. Also, consider XFS and its built-in defragmentation. I have found that a longer livedpostgres DB will get extreme > file fragmentation over time and sequential scans end up mostly random. On-line file defrag helps tremendously. > We just had a corrupt table caused by an XFS online defrag. I'm scared to use this again while the db is live. Has anyone else found this to be safe? But, I can vouch for the fragmentation issue, it happens very quickly in our system. -Dan
On Tue, 12 Oct 2010, Joe Uhl wrote: >>> The biggest single problem with "select count(*)" is that it is >>> seriously overused. People use that idiom to establish existence, which >>> usually leads to a performance disaster in the application using it, >>> unless the table has no more than few hundred records. SQL language, of >>> which PostgreSQL offers an excellent implementation, offers [NOT] >>> EXISTS clause since its inception in the Jurassic era. The problem is >>> with the sequential scan, not with counting. I'd even go as far as to >>> suggest that 99% instances of the "select count(*)" idiom are probably >>> bad use of the SQL language. >> >> I agree, I have seen many very bad examples of using count(*). I will go so >> far as to question the use of count(*) in my examples here. It there a >> better >> way to come up with a page list than using count(*)? What is the best >> method >> to make a page of results and a list of links to other pages of results? Am >> I >> barking up the wrong tree here? > One way I have dealt with this on very large tables is to cache the count(*) > at the application level (using memcached, terracotta, or something along > those lines) and then increment that cache whenever you add a row to the > relevant table. On application restart that cache is re-initialized with a > regular old count(*). This approach works really well and all large systems > in my experience need caching in front of the DB eventually. If you have a > simpler system with say a single application/web server you can simply store > the value in a variable, the specifics would depend on the language and > framework you are using. this works if you know ahead of time what the criteria of the search is going to be. so it will work for select count(*) from table; what this won't work for is cases wher the criteria of the search is unpredictable, i.e. ask the user for input select count(*) from table where field=$input; David Lang > Another more all-DB approach is to create a statistics tables into which you > place aggregated statistics rows (num deleted, num inserted, totals, etc) at > an appropriate time interval in your code. So you have rows containing > aggregated statistics information for the past and some tiny portion of the > new data happening right now that hasn't yet been aggregated. Queries then > look like a summation of the aggregated values in the statistics table plus a > count(*) over just the newest portion of the data table and are generally > very fast. > > Overall I have found that once things get big the layers of your app stack > start to blend together and have to be combined in clever ways to keep speed > up. Postgres is a beast but when you run into things it can't do well just > find a way to cache it or make it work together with some other persistence > tech to handle those cases. > > >
On Tue, 12 Oct 2010, Mladen Gogala wrote: > david@lang.hm wrote: >> from a PR point of view, speeding up the trivil count(*) case could be >> worth it, just to avoid people complaining about it not being fast. >> >> > Fixing PR stuff is not the approach that I would take. People are complaining > about select count(*) because they're using it in all the wrong places. that may be the case, but if it's possible to make it less painful it will mean more people use postgres, both because it works better for them when they are using the suboptimal programs, but also because when people do their trivial testing of databases to decide which one they will use, they won't rule out postgres because "it's so slow" the fact of the matter is that people do use count(*), and even though there are usually ways to avoid doing so, having the programmer have to do something different for postgres than they do for other databases is raising a barrier against postgres untilization in anything. David Lang > My > assessment that there is a problem with sequential scan was wrong. Now, let's > again take Oracle as the measure. > Someone asked me about caching the data. Here it is: > > SQL> connect system/********* > Connected. > SQL> alter system flush buffer_cache; > > System altered. > > Elapsed: 00:00:12.68 > SQL> connect adbase/********* > Connected. > SQL> alter session set db_file_multiblock_read_Count=128; > > Session altered. > > Elapsed: 00:00:00.41 > SQL> select count(*) from ni_occurrence; > > COUNT(*) > ---------- > 402062638 > > Elapsed: 00:02:37.77 > > SQL> select bytes/1048576 MB from user_segments > 2 where segment_name='NI_OCCURRENCE'; > > MB > ---------- > 35329 > > Elapsed: 00:00:00.20 > SQL> > > > So, the results weren't cached the first time around. The explanation is the > fact that Oracle, as of the version 10.2.0, reads the table in the private > process memory, not in the shared buffers. This table alone is 35GB in > size, Oracle took 2 minutes 47 seconds to read it using the full table scan. > If I do the same thing with PostgreSQL and a comparable table, Postgres is, > in fact, faster: > > psql (9.0.1) > Type "help" for help. > > news=> \timing > Timing is on. > news=> select count(*) from moreover_documents_y2010m09; > count ---------- > 17242655 > (1 row) > > Time: 113135.114 ms > news=> select pg_size_pretty(pg_table_size('moreover_documents_y2010m09')); > pg_size_pretty > ---------------- > 27 GB > (1 row) > > Time: 100.849 ms > news=> > > The number of rows is significantly smaller, but the table contains rather > significant "text" field which consumes quite a bit of TOAST storage and the > sizes are comparable. Postgres read through 27GB in 113 seconds, less than 2 > minutes and oracle took 2 minutes 37 seconds to read through 35GB. I stand > corrected: there is nothing wrong with the speed of the Postgres sequential > scan. > > >
On Tue, 12 Oct 2010, Craig Ringer wrote: > On 10/12/2010 04:22 PM, david@lang.hm wrote: > >> from a PR point of view, speeding up the trivil count(*) case could be >> worth it, just to avoid people complaining about it not being fast. > > At the cost of a fair bit more complexity, though, and slowing everything > else down. complexity probably, although given how complex the planner is already is this significant? as far as slowing everything else down, why would it do that? (beyond the simple fact that any new thing the planner can do makes the planner take a little longer) David Lang > The proper solution here remains, IMO, support for visibility information in > indexes, whether by storing it once in the index and once in the heap > (ouch!), storing it out-of-line, or using a covering index where one or more > columns are stored wholly in the index not in the table heap at all. > > Here are a few of the many past discussions about this that have already > covered some of the same ground: > > http://stackoverflow.com/questions/839015/postgres-could-an-index-organized-tables-paved-way-for-faster-select-count-fr > > http://osdir.com/ml/db.postgresql.performance/2003-10/msg00075.html > (and the rest of the thread) > > A decent look with Google will find many, many more. > >> in the case where you are doing a count(*) where query and the where is >> on an indexed column, could the search just look at the index + the >> visibility mapping rather than doing an sequential search through the >> table? > > Nope, because the visibility map, which is IIRC only one bit per page, > doesn't record how many tuples there are on the page, or enough information > about them to determine how many of them are visible to the current > transaction*. > >> as for your worries about the accuracy of a visibility based count in >> the face of other transactions, wouldn't you run into the same issues if >> you are doing a sequential scan with the same transactions in process? > > No. Every tuple in a table heap in postgresql has hidden fields, some of > which are used to determine whether the current transaction* can "see" the > tuple - it may have been inserted after this transaction started, or deleted > before this transaction started, so it's not visible to this transaction but > may still be to others. > > http://www.postgresql.org/docs/current/static/ddl-system-columns.html > > This information isn't available in the visibility map, or in indexes. That's > why PostgreSQL has to hit the heap to find it. > > * current transaction should really be "current snapshot". The snapshot is > taken at the start of the whole transaction for SERIALIZABLE isolation, and > at the start of each statement for READ COMMITTED isolation. > > -- > Craig Ringer >
On Oct 11, 2010, at 9:21 PM, Samuel Gendler wrote:
On Mon, Oct 11, 2010 at 9:06 PM, Scott Carey <scott@richrelevance.com> wrote:I can't speak to documentation, but it is something that helps as your I/O subsystem gets more powerful, and how much it helps depends more on your hardware, which may have adaptive read ahead on its own, and your file system which may be more or less efficient at sequential I/O. For example ext3 out of the box gets a much bigger gain from tuning read-ahead than XFS on a DELL PERC6 RAID card (but still ends up slower).Geez. I wish someone would have written something quite so bold as 'xfs is always faster than ext3' in the standard tuning docs. I couldn't find anything that made a strong filesystem recommendation. How does xfs compare to ext4? I wound up on ext4 on a dell perc6 raid card when an unexpected hardware failure on a production system caused my test system to get thrown into production before I could do any serious testing of xfs. If there is a strong consensus that xfs is simply better, I could afford the downtime to switch.As it happens, this is a system where all of the heavy workload is in the form of sequential scan type load. The OLTP workload is very minimal (tens of queries per minute on a small number of small tables), but there are a lot of reporting queries that wind up doing sequential scans of large partitions (millions to tens of millions of rows). We've sized the new hardware so that the most commonly used partitions fit into memory, but if we could speed the queries that touch less frequently used partitions, that would be good. I'm the closest thing our team has to a DBA, which really only means that I'm the one person on the dev team or the ops team to have read all of the postgres docs and wiki and the mailing lists. I claim no actual DBA experience or expertise and have limited cycles to devote to tuning and testing, so if there is an established wisdom for filesystem choice and read ahead tuning, I'd be very interested in hearing it.
ext4 is a very fast file system. Its faster than ext2, but has many more features and has the all-important journaling.
However, for large reporting queries and sequential scans, XFS will win in the long run if you use the online defragmenter. Otherwise, your sequential scans won't be all that sequential on any file system over time if your tables aren't written once, forever, serially. Parallel restore will result in a system that is fragmented -- ext4 will do best at limiting this on the restore, but only xfs has online defragmentation. We schedule ours daily and it noticeably improves sequential scan I/O.
Supposedly, an online defragmenter is in the works for ext4 but it may be years before its available.
On Tue, Oct 12, 2010 at 9:02 AM, Scott Carey <scott@richrelevance.com> wrote:
However, for large reporting queries and sequential scans, XFS will win in the long run if you use the online defragmenter. Otherwise, your sequential scans won't be all that sequential on any file system over time if your tables aren't written once, forever, serially. Parallel restore will result in a system that is fragmented -- ext4 will do best at limiting this on the restore, but only xfs has online defragmentation. We schedule ours daily and it noticeably improves sequential scan I/O.
Our reporting tables are written sequentially and left unmodified until entire partitions are dropped. However, equivalent partitions tend to get a little bit larger over time, so newer partitions won't necessarily fit into the gaps left by prior partition drops, so it is possible that partitions will be split into two sections, but should still be very sequential, if not perfectly so. It would seem that we stumbled into an ideal architecture for doing this kind of work - mostly by virtue of starting with 8.2.x and having huge problems with autovacuum and vacuum taking forever and dragging the db to halt, which caused us to move to an architecture which aggregates and then drops older data in entire partitions instead of updating aggregates individually and then deleting rows. Partitions are sized such that most reporting queries run over entire partitions, too (which was completely accidental since I had not yet delved into individual query optimization at the time), so even though we are doing sequential scans, we at least run as few of them as possible and are able to keep hot data in memory.
--sam
sgendler@ideasculptor.com (Samuel Gendler) writes: > Geez. I wish someone would have written something quite so bold as > 'xfs is always faster than ext3' in the standard tuning docs. I > couldn't find anything that made a strong filesystem > recommendation. How does xfs compare to ext4? I wound up on ext4 on > a dell perc6 raid card when an unexpected hardware failure on a > production system caused my test system to get thrown into production > before I could do any serious testing of xfs. If there is a strong > consensus that xfs is simply better, I could afford the downtime to > switch. It's news to me (in this thread!) that XFS is actually "getting some developer love," which is a pretty crucial factor to considering it relevant. XFS was an SGI creation, and, with: a) the not-scintillating performance of the company, b) the lack of a lot of visible work going into the filesystem, c) the paucity of support by Linux vendors (for a long time, if you told RHAT you were having problems, and were using XFS, the next step would be to park the ticket awaiting your installing a "supported filesystem") it didn't look like XFS was a terribly good bet. Those issues were certainly causing concern a couple of years ago. Faster "raw performance" isn't much good if it comes with a risk of: - Losing data - Losing support from vendors If XFS now *is* getting support from both the development and support perspectives, then the above concerns may have been invalidated. It would be very encouraging, if so. -- output = ("cbbrowne" "@" "gmail.com") Rules of the Evil Overlord #228. "If the hero claims he wishes to confess in public or to me personally, I will remind him that a notarized deposition will serve just as well."
>> > > A count with any joins or filter criteria would still have to scan all > pages with visible tuples in them. So the visibility map helps speed up > scanning of bloated tables, but doesn't provide a magical "fast count" > except in the utterly trivial "select count(*) from tablename;" case, > and can probably only be used for accurate results when there are no > read/write transactions currently open. select count(*) from tablename where [condition or filter that can use an index] [group by on columns in the index] will also work, I think. Additionally, I think it can work if other open transactions exist, provided they haven't written to the table being scanned. If they have, then only those pages that have been altered and marked in the visibility map need to be cracked openthe normal way. > Even if you kept a count of > tuples in each page along with the mvcc transaction ID information > required to determine for which transactions that count is valid, it'd > only be useful if you didn't have to do any condition checks, and it'd > be yet another thing to update with every insert/delete/update. > Yes, lots of drawbacks and added complexity. > Perhaps for some users that'd be worth having, but it seems to me like > it'd have pretty narrow utility. I'm not sure that's the answer. > > -- > Craig Ringer > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On 2010-10-12 18:02, Scott Carey wrote:
> However, for large reporting queries and sequential scans, XFS will
> win in the long run if you use the online defragmenter. Otherwise,
> your sequential scans won't be all that sequential on any file system
> over time if your tables aren't written once, forever, serially.
> Parallel restore will result in a system that is fragmented -- ext4
> will do best at limiting this on the restore, but only xfs has online
> defragmentation. We schedule ours daily and it noticeably improves
> sequential scan I/O.
>
> Supposedly, an online defragmenter is in the works for ext4 but it
> may be years before its available.
If some clever postgres hacker could teach postgres to allocate blocks
using posix_fallocate in quite large batches, say .. something like:
fallocate(min(current_relation_size *0.1,1073741824));
So if you have a relations filling 10GB allready, they the next "file" for the
relations is just fully allocated on the first byte by the filesystem. That
would ensure that large table is sitting efficiently on the filesystem level with
a minimum of fragmentation on ext4(and other FS's supporting posix_fallocate)
and for small systems it would only fill 10% more of diskspace... ..
.. last night I spend an hour looking for where its done but couldnt find the
source-file where extention of an existing relation takes place.. can
someone give directions?
--
Jesper
> However, for large reporting queries and sequential scans, XFS will
> win in the long run if you use the online defragmenter. Otherwise,
> your sequential scans won't be all that sequential on any file system
> over time if your tables aren't written once, forever, serially.
> Parallel restore will result in a system that is fragmented -- ext4
> will do best at limiting this on the restore, but only xfs has online
> defragmentation. We schedule ours daily and it noticeably improves
> sequential scan I/O.
>
> Supposedly, an online defragmenter is in the works for ext4 but it
> may be years before its available.
If some clever postgres hacker could teach postgres to allocate blocks
using posix_fallocate in quite large batches, say .. something like:
fallocate(min(current_relation_size *0.1,1073741824));
So if you have a relations filling 10GB allready, they the next "file" for the
relations is just fully allocated on the first byte by the filesystem. That
would ensure that large table is sitting efficiently on the filesystem level with
a minimum of fragmentation on ext4(and other FS's supporting posix_fallocate)
and for small systems it would only fill 10% more of diskspace... ..
.. last night I spend an hour looking for where its done but couldnt find the
source-file where extention of an existing relation takes place.. can
someone give directions?
--
Jesper
On Oct 12, 2010, at 8:39 AM, Dan Harris wrote: > On 10/11/10 8:02 PM, Scott Carey wrote: >> would give you a 1MB read-ahead. Also, consider XFS and its built-in defragmentation. I have found that a longer livedpostgres DB will get extreme >> file fragmentation over time and sequential scans end up mostly random. On-line file defrag helps tremendously. >> > We just had a corrupt table caused by an XFS online defrag. I'm scared > to use this again while the db is live. Has anyone else found this to > be safe? But, I can vouch for the fragmentation issue, it happens very > quickly in our system. > What version? I'm using the latest CentoOS extras build. We've been doing online defrag for a while now on a very busy database with > 8TB of data. Not that that means there areno bugs... It is a relatively simple thing in xfs -- it writes a new file to temp in a way that allocates contiguous space if available,then if the file has not been modified since it was re-written it is essentially moved on top of the other one. This should be safe provided the journaling and storage is safe, etc. > -Dan > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Oct 12, 2010, at 8:54 AM, <david@lang.hm> wrote: > On Tue, 12 Oct 2010, Craig Ringer wrote: > >> On 10/12/2010 04:22 PM, david@lang.hm wrote: >> >>> from a PR point of view, speeding up the trivil count(*) case could be >>> worth it, just to avoid people complaining about it not being fast. >> >> At the cost of a fair bit more complexity, though, and slowing everything >> else down. > > complexity probably, although given how complex the planner is already is > this significant? > > as far as slowing everything else down, why would it do that? (beyond the > simple fact that any new thing the planner can do makes the planner take a > little longer) > > David Lang > I wouldn't even expect the planner to do more work. An Index Scan can simply avoid going to the tuples for visibility undersome circumstances.
On Oct 12, 2010, at 9:46 AM, Scott Carey wrote: > > On Oct 12, 2010, at 8:54 AM, <david@lang.hm> wrote: > >> On Tue, 12 Oct 2010, Craig Ringer wrote: >> >>> On 10/12/2010 04:22 PM, david@lang.hm wrote: >>> >>>> from a PR point of view, speeding up the trivil count(*) case could be >>>> worth it, just to avoid people complaining about it not being fast. >>> >>> At the cost of a fair bit more complexity, though, and slowing everything >>> else down. >> >> complexity probably, although given how complex the planner is already is >> this significant? >> >> as far as slowing everything else down, why would it do that? (beyond the >> simple fact that any new thing the planner can do makes the planner take a >> little longer) >> >> David Lang >> > I wouldn't even expect the planner to do more work. An Index Scan can simply avoid going to the tuples for visibilityunder some circumstances. > > Of course, the planner has to .... Otherwise it won't choose the Index Scan over the sequential scan. So the cost of indexscans when all the info other than visibility is in the index would need to be lowered. > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On 10/12/10 10:44 AM, Scott Carey wrote: > On Oct 12, 2010, at 8:39 AM, Dan Harris wrote: > >> On 10/11/10 8:02 PM, Scott Carey wrote: >>> would give you a 1MB read-ahead. Also, consider XFS and its built-in defragmentation. I have found that a longer livedpostgres DB will get extreme >>> file fragmentation over time and sequential scans end up mostly random. On-line file defrag helps tremendously. >>> >> We just had a corrupt table caused by an XFS online defrag. I'm scared >> to use this again while the db is live. Has anyone else found this to >> be safe? But, I can vouch for the fragmentation issue, it happens very >> quickly in our system. >> > What version? I'm using the latest CentoOS extras build. > > We've been doing online defrag for a while now on a very busy database with> 8TB of data. Not that that means there areno bugs... > > It is a relatively simple thing in xfs -- it writes a new file to temp in a way that allocates contiguous space if available,then if the file has not been modified since it was re-written it is essentially moved on top of the other one. This should be safe provided the journaling and storage is safe, etc. > I'm not sure how to figure out what version of XFS we're on.. but it's Linux kernel 2.6.24-24 x86_64 on Ubuntu Server 8.04.3. Postgres version 8.3 We're due for an upgrade on that server soon so we'll do some more testing once we upgrade. Right now we are just living with the fragmentation. I'm glad to hear the regular on-line defrag is working successfully, at least that gives me hope we can rely on it in the future. -Dan
So I spent a bit of quality time with oprofile this morning, and found once again that there's no substitute for having actual data before theorizing. Test case software: current Git HEAD (plus one code change explained below), compiled with --enable-debug to support oprofile, cassert off; no other special configure options. Running on current Fedora 13 (gcc 4.4.4 in particular). All postgresql.conf options are out-of-the-box. Test case hardware: recently purchased mid-grade desktop, dual Xeon E5503 processors (Nehalem cores, 2GHZ), 4GB DDR3-800 RAM, no-name SATA disk. Test query: "select count(*) from t" where t has 4 nonnull integer columns and 81920000 rows, occupying 3459MB. I chose that size specifically to fit into available RAM, so that on repeated executions no physical I/O will occur. On this setup I find that "select count(*)" runs in about 7.5sec when the data is fully cached in RAM, for a scanning speed of 460MB/sec. This is well in excess of what the machine's disk hardware can do: bonnie++ rates the machine's disk read speed at 152MB/sec. So in theory PG should be able to completely saturate the disk when processing a table bigger than RAM. In reality the test case run time if I've just flushed cache is about 28sec, working out to a scan rate of 123MB/sec. I expect if I'd bothered to tune the kernel readahead parameters as outlined earlier in this thread, I could get to 150MB/sec. Now of course this disk setup is far from industrial strength, but the processor isn't what you'd put in a serious database server either (in particular, its available memory bandwidth is well behind the curve). Also, the table is pretty narrow (only 16 payload bytes per row), and any wider test table would show a pretty much linear scaling of achievable scan rate versus table width. So I don't see much support here at all for the notion that we scan slower than available disk bandwidth. Further details from poking at it with oprofile: in the fully-cached case the CPU time is about 80% Postgres and 20% kernel. That kernel time is of course all to do with moving pages from kernel disk buffers into Postgres shared memory. Although I've not bothered to increase shared_buffers from the default 32MB, it wouldn't matter on this benchmark unless I were able to make shared_buffers hold the entire table ... and even then I'd only save 20%. oprofile further shows that (with stock Postgres sources) the userspace runtime breaks down like this: samples % symbol name 141267 13.0810 heapgettup_pagemode 85947 7.9585 advance_aggregates 83031 7.6885 ExecProject 78975 7.3129 advance_transition_function 75060 6.9504 heapgetpage 73540 6.8096 ExecClearTuple 69355 6.4221 ExecProcNode 59288 5.4899 heap_getnext 57745 5.3470 ExecScan 55618 5.1501 HeapTupleSatisfiesMVCC 47057 4.3574 MemoryContextReset 41904 3.8802 ExecStoreTuple 37146 3.4396 SeqNext 32206 2.9822 ExecAgg 22135 2.0496 XidInMVCCSnapshot 21142 1.9577 int8inc 19280 1.7853 AllocSetReset 18211 1.6863 hash_search_with_hash_value 16285 1.5079 TransactionIdPrecedes I also looked at the source-line-level breakdown, though that's too bulky to post here. The most interesting fact here is that tuple visibility testing (MVCC) overhead is simply nonexistent: it'd be in heapgetpage() if it were being done, which it isn't because all the pages of the table have the PageIsAllVisible bit set. In a previous run where those bits weren't set but the per-tuple hint bits were, visibility testing still only ate a percent or two of the runtime. So the theory some people have espoused in this thread that visibility testing is the bottleneck doesn't hold water either. If you go back and look at previous pgsql-hackers discussions about that, what people have been worried about is not the CPU cost of visibility testing but the need for indexscan queries to visit the heap for no other purpose than to check the visibility flags. In a seqscan it's not going to matter. I looked a bit more closely at the heapgettup_pagemode timing. The lines shown by opannotate as more than 0.1 percent of the runtime are 22545 2.2074 :{ /* heapgettup_pagemode total: 153737 15.0528 */ 5685 0.5566 : bool backward = ScanDirectionIsBackward(dir); 5789 0.5668 : if (!scan->rs_inited) 5693 0.5574 : lineindex = scan->rs_cindex + 1; 11429 1.1190 : dp = (Page) BufferGetPage(scan->rs_cbuf); 5693 0.5574 : linesleft = lines - lineindex; 5766 0.5646 : while (linesleft > 0) 5129 0.5022 : lineoff = scan->rs_vistuples[lineindex]; 44461 4.3533 : tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); 11135 1.0903 : tuple->t_len = ItemIdGetLength(lpp); 5692 0.5573 : if (key != NULL) 5773 0.5653 : HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd), 5674 0.5556 : scan->rs_cindex = lineindex; 11406 1.1168 :} There doesn't seem to be a whole lot of room for improvement there. Maybe we could shave a couple percent with some tenser coding (I'm wondering why HeapKeyTest is being reached, in particular, when there's no WHERE clause). But any local changes here will be marginal at best. One thing I did find is that the time spent in ExecProject/ExecClearTuple, amounting to nearly 15% of the runtime, is just for evaluating the arguments of the aggregate ... and count(*) hasn't got any arguments. So a patch like this improves the run speed by about 15%: diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index a7dafeb..051e70c 100644 *** a/src/backend/executor/nodeAgg.c --- b/src/backend/executor/nodeAgg.c *************** advance_aggregates(AggState *aggstate, A *** 480,486 **** TupleTableSlot *slot; /* Evaluate the current input expressions for this aggregate */ ! slot = ExecProject(peraggstate->evalproj, NULL); if (peraggstate->numSortCols > 0) { --- 480,489 ---- TupleTableSlot *slot; /* Evaluate the current input expressions for this aggregate */ ! if (peraggstate->evalproj) ! slot = ExecProject(peraggstate->evalproj, NULL); ! else ! slot = peraggstate->evalslot; if (peraggstate->numSortCols > 0) { *************** ExecInitAgg(Agg *node, EState *estate, i *** 1728,1738 **** peraggstate->evalslot = ExecInitExtraTupleSlot(estate); ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc); ! /* Set up projection info for evaluation */ ! peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args, ! aggstate->tmpcontext, ! peraggstate->evalslot, ! NULL); /* * If we're doing either DISTINCT or ORDER BY, then we have a list of --- 1731,1744 ---- peraggstate->evalslot = ExecInitExtraTupleSlot(estate); ExecSetSlotDescriptor(peraggstate->evalslot, peraggstate->evaldesc); ! /* Set up projection info for evaluation, if agg has any args */ ! if (aggrefstate->args) ! peraggstate->evalproj = ExecBuildProjectionInfo(aggrefstate->args, ! aggstate->tmpcontext, ! peraggstate->evalslot, ! NULL); ! else ! peraggstate->evalproj = NULL; /* * If we're doing either DISTINCT or ORDER BY, then we have a list of bringing the oprofile results to samples % symbol name 181660 17.9017 heapgettup_pagemode 138049 13.6040 advance_transition_function 102865 10.1368 advance_aggregates 80948 7.9770 ExecProcNode 79943 7.8780 heap_getnext 73384 7.2316 ExecScan 60607 5.9725 MemoryContextReset 53889 5.3105 ExecStoreTuple 46666 4.5987 SeqNext 40535 3.9945 ExecAgg 33481 3.2994 int8inc 32202 3.1733 heapgetpage 26068 2.5689 AllocSetReset 18493 1.8224 hash_search_with_hash_value 8679 0.8553 LWLockAcquire 6615 0.6519 ExecSeqScan 6583 0.6487 LWLockRelease 3928 0.3871 hash_any 3715 0.3661 ReadBuffer_common (note that this, not the stock code, is what corresponds to the 7.5sec runtime I quoted above --- it's about 8.5sec without that change). At this point what we've got is 25% of the runtime in nodeAgg.c overhead, and it's difficult to see how to get any real improvement without tackling that. Rather than apply the patch shown above, I'm tempted to think about hard-wiring COUNT(*) as a special case in nodeAgg.c such that we don't go through advance_aggregates/advance_transition_function at all, but just increment a counter directly. However, that would very clearly be optimizing COUNT(*) and nothing else. Given the opinions expressed elsewhere in this thread that heavy reliance on COUNT(*) represents bad application design, I'm not sure that such a patch would meet with general approval. Actually the patch shown above is optimizing COUNT(*) and nothing else, too, since it's hard to conceive of any other zero-argument aggregate. Anyway, if anyone is hot to make COUNT(*) faster, that's where to look. I don't think any of the previous discussion in this thread is on-point at all, except for the parts where people suggested avoiding it. regards, tom lane
david@lang.hm wrote: > On Tue, 12 Oct 2010, Mladen Gogala wrote: > > >> david@lang.hm wrote: >> >>> from a PR point of view, speeding up the trivil count(*) case could be >>> worth it, just to avoid people complaining about it not being fast. >>> >>> >>> >> Fixing PR stuff is not the approach that I would take. People are complaining >> about select count(*) because they're using it in all the wrong places. >> > > that may be the case, but if it's possible to make it less painful it will > mean more people use postgres, both because it works better for them when > they are using the suboptimal programs, but also because when people do > their trivial testing of databases to decide which one they will use, they > won't rule out postgres because "it's so slow" > > There is no free lunch. If the count field is maintained somewhere, the concurrency will suffer. I find the idea of fixing the "count delusion" ridiculous, may Richard Dawkins forgive me for this pun. Saying that something is slow without testing and a proper consideration is ridiculous. As a DBA, I usually get complaints like "the database is slow today" 3 times before lunch, every day. The database is never slow, the database is a warehouse where you keep your data. What is slow is the access to the data, and that is done by, guess what, the application program. Almost always, it's the application that's slow, not the database. As for the "select count(*)", idiom, what are you trying to do? Where are you using it? If you are using it for pagination, consider the possibility of not specifying the number of pages on the website, just the links "next -->" and "prev <--". Alternatively, you can fetch a small amount into the web page and direct the users who would like to see the complete information to a background reporting too. Mixing batch reports and online reports is a very easy thing to do. If you are using it to establish existence, you're doing it wrong. I've had a problem like that this morning. A developer came to me with the usual phrase that the "database is slow". It was a PHP form which should write an output file and let the user know where the file is. The function looks like this: function put_xls($sth) { global $FNAME; $FNAME=$FNAME.".xls"; $lineno=0; $ncols=$sth->FieldCount(); for ($i = 0;$i <= $ncols;$i++) { $cols[$i] = $sth->FetchField($i); $colnames[$i]=$cols[$i]->name; } $workbook = new Spreadsheet_Excel_Writer("/software$FNAME"); $format_bold =& $workbook->addFormat(); $format_bold->setBold(); $format_bold->setAlign('left'); $format_left =& $workbook->addFormat(); $format_left->setAlign('left'); $worksheet = & $workbook->addWorksheet('Moreover Search'); $worksheet->writeRow($lineno++,0,$colnames,$format_bold); while($row=$sth->FetchRow()) { $worksheet->writeRow($lineno++,0,$row,$format_left); } $workbook->close(); $cnt=$sth->Recordcount(); return($cnt); } The relevant includes are here: require ('Date.php'); require ('adodb5/tohtml.inc.php'); require_once ('adodb5/adodb.inc.php'); require_once ('adodb5/adodb-exceptions.inc.php'); require_once 'Spreadsheet/Excel/Writer.php'; $ADODB_FETCH_MODE = ADODB_FETCH_NUM; So, what is the problem here? Why was the "database slow"? As it turns out, the PEAR module for writing Excel spreadsheets, which is the tool used here, creates the entire spreadsheet in memory and writes it out on the "close" command. What was spinning was "httpd" process, the database was completely and utterly idle, rolling thumbs and awaiting orders. Using the "fputcsv" instead, made the function fly. The only thing that was lost were the bold column titles. Changing little things can result in the big performance gains. Making "select count(*)" unnaturally fast would be tending to bad programming practices. I am not sure that this is a desirable development. You can't expect people to adjust the database software to your application. Applications are always database specific. Writing an application that will access a PostgreSQL database is not the same as writing an application that will access an Oracle database. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
On 2010-10-12 19:07, Tom Lane wrote: > Anyway, if anyone is hot to make COUNT(*) faster, that's where to look. > I don't think any of the previous discussion in this thread is on-point > at all, except for the parts where people suggested avoiding it. > I would have to say that allthough it is nice to get count(*) faster I think your testing is way too simple. It pretty much proves that in terms of the code involved in the count(*) process there is not much to be achieved. But your table has way to little payload. As PG currently is it will start by pushing data off to TOAST when the tuple size reaches 1KB and the speed of count(*) is very much dominated by the amount of "dead weight" it has to draw in together with the heap-access for the row on accessing the table. Creating a case where the table is this slim is (in my viewpoint) very much to the extreme on the small side. Just having 32 bytes bytes of "payload" would more or less double you time to count if I read you test results correctly?. .. and in the situation where diskaccess would be needed .. way more. Dividing by pg_relation_size by the amout of tuples in our production system I end up having no avg tuple size less than 100bytes. .. without having complete insigt.. a visibillity map that could be used in conjunction with indices would solve that. What the cost would be of maintaining it is also a factor. Jesper -- Jesper
Jesper Krogh <jesper@krogh.cc> writes: > On 2010-10-12 19:07, Tom Lane wrote: >> Anyway, if anyone is hot to make COUNT(*) faster, that's where to look. > Just having 32 bytes bytes of "payload" would more or less double > you time to count if I read you test results correctly?. .. and in the > situation where diskaccess would be needed .. way more. > Dividing by pg_relation_size by the amout of tuples in our production > system I end up having no avg tuple size less than 100bytes. Well, yeah. I deliberately tested with a very narrow table so as to stress the per-row CPU costs as much as possible. With any wider table you're just going to be I/O bound. > .. without having complete insigt.. a visibillity map that could be used in > conjunction with indices would solve that. What the cost would be > of maintaining it is also a factor. I'm less than convinced that that approach will result in a significant win. It's certainly not going to do anything to convert COUNT(*) into an O(1) operation, which frankly is what the complainants are expecting. There's basically no hope of solving the "PR problem" without somehow turning COUNT(*) into a materialized-view reference. We've discussed that in the past, and know how to do it in principle, but the complexity and distributed overhead are daunting. regards, tom lane
> suggest that 99% instances of the "select count(*)" idiom are probably >> bad use of the SQL language. Well, suppose you paginate results. If the user sees that the search query returns 500 pages, there are two options : - you're google, and your sorting algorithms are so good that the answer the user wants is in the first page - or the user will refine his search by entering more keywords tu get a manageable result set So, in both cases, the count(*) was useless anyway. And the slowest ones are the most useless, since the user will immediatey discard the result and refine his query. If your full text search is slow, try Xapian or Lucene.
On Tuesday 12 October 2010 07:19:57 you wrote: > >> The biggest single problem with "select count(*)" is that it is > >> seriously overused. People use that idiom to establish existence, which > >> usually leads to a performance disaster in the application using it, > >> unless the table has no more than few hundred records. SQL language, of > >> which PostgreSQL offers an excellent implementation, offers [NOT] > >> EXISTS clause since its inception in the Jurassic era. The problem is > >> with the sequential scan, not with counting. I'd even go as far as to > >> suggest that 99% instances of the "select count(*)" idiom are probably > >> bad use of the SQL language. > > > > I agree, I have seen many very bad examples of using count(*). I will go > > so far as to question the use of count(*) in my examples here. It there > > a better way to come up with a page list than using count(*)? What is > > the best method to make a page of results and a list of links to other > > pages of results? Am I barking up the wrong tree here? > > One way I have dealt with this on very large tables is to cache the > count(*) at the application level (using memcached, terracotta, or > something along those lines) and then increment that cache whenever you > add a row to the relevant table. On application restart that cache is > re-initialized with a regular old count(*). This approach works really > well and all large systems in my experience need caching in front of the > DB eventually. If you have a simpler system with say a single > application/web server you can simply store the value in a variable, the > specifics would depend on the language and framework you are using. I use this method when ever possible. I talked about it in my first post. I generally keep a table around I call counts. It has many rows that store count numbers from frequently used views. The one that I can't do anything about is the case where you nave no control over the WHERE clause, (or where there may be simply too many options to count everything ahead of time without making things even slower). That is the point of this entire thread, or was... ;) -Neil- > > Another more all-DB approach is to create a statistics tables into which > you place aggregated statistics rows (num deleted, num inserted, totals, > etc) at an appropriate time interval in your code. So you have rows > containing aggregated statistics information for the past and some tiny > portion of the new data happening right now that hasn't yet been > aggregated. Queries then look like a summation of the aggregated values > in the statistics table plus a count(*) over just the newest portion of > the data table and are generally very fast. > > Overall I have found that once things get big the layers of your app > stack start to blend together and have to be combined in clever ways to > keep speed up. Postgres is a beast but when you run into things it > can't do well just find a way to cache it or make it work together with > some other persistence tech to handle those cases.
Pierre C wrote: >> > > Well, suppose you paginate results. If the user sees that the search query > returns 500 pages, there are two options : > With Google, I usually lose patience on the page 3. All that I, as an end user, need to know is whether there are more than 10 pages. The fact that there are 1776 pages in the result set is not particularly useful to me. I couldn't care less whether the number of returned pages is 1492, 1776 or 1861, I'm going to look at, at most, the first 5 of them. > - you're google, and your sorting algorithms are so good that the answer > the user wants is in the first page > - or the user will refine his search by entering more keywords tu get a > manageable result set > > So, in both cases, the count(*) was useless anyway. And the slowest ones > are the most useless, since the user will immediatey discard the result > and refine his query. > > If your full text search is slow, try Xapian or Lucene. > > May I also recommend Sphinx? It's a very nice text search engine, with the price equal to that of Lucene. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
On Tuesday 12 October 2010 08:39:19 Dan Harris wrote: > On 10/11/10 8:02 PM, Scott Carey wrote: > > would give you a 1MB read-ahead. Also, consider XFS and its built-in > > defragmentation. I have found that a longer lived postgres DB will get > > extreme file fragmentation over time and sequential scans end up mostly > > random. On-line file defrag helps tremendously. > > We just had a corrupt table caused by an XFS online defrag. I'm scared > to use this again while the db is live. Has anyone else found this to > be safe? But, I can vouch for the fragmentation issue, it happens very > quickly in our system. > > -Dan I would like to know the details of what was going on that caused your problem. I have been using XFS for over 9 years, and it has never caused any trouble at all in a production environment. Sure, I had many problems with it on the test bench, but in most cases the issues were very clear and easy to avoid in production. There were some (older) XFS tools that caused some problems, but that is in the past, and as time goes on, it seems take less and less planning to make it work properly. -Neil-
On Tuesday 12 October 2010 14:35:01 you wrote: > > suggest that 99% instances of the "select count(*)" idiom are probably > > > >> bad use of the SQL language. > > Well, suppose you paginate results. If the user sees that the search query > returns 500 pages, there are two options : > > - you're google, and your sorting algorithms are so good that the answer > the user wants is in the first page > - or the user will refine his search by entering more keywords tu get a > manageable result set > > So, in both cases, the count(*) was useless anyway. And the slowest ones > are the most useless, since the user will immediatey discard the result > and refine his query. > > If your full text search is slow, try Xapian or Lucene. I guess I have to comment here again and point out that while I am having this issue with text searches, I avoid using count(*) in such cases, I just use next and previous links. Where the real problem (for me) is that when someone searches a date or time range. My application keeps track of huge amounts of realtime transactional data. So an administrator might want a report as to what some data point did yesterday between 3 and 4 PM. Under normal conditions the range of records that match can be between 0 and over 5,000. This is really killing me especially when the reporting people want a list of how many transactions each that were on points in a given zipcode had this morning between 8 and 9 AM, it takes about 5 minutes to run on a server that has enough ram to hold the entire table! Pseudo query: Show how many transactions per node in zipcode 92252 between 8:00 and 9:00 today: point_number | number_of_transactions 65889 | 31 34814 | 4865 28349 | 0 3358 | 364 ... 24 total rows, > 5 minutes. Then they want every node to be a link to a list of actual data within the specified timeframe. This is where I have to to the same query twice to first find out how many for the page links, then again to get a page of results. Sure, I could keep tables around that have numbers by the hour, minute, day or whatever to cache up results for speeding things, then the problem is that when the data is put into the server, there are so many statistics tables to update, the front end becomes a huge problem. Also, it makes for a huge mess of tables to think about when I need to make a report. -Neil-
On 10/12/10 4:33 PM, Neil Whelchel wrote: > On Tuesday 12 October 2010 08:39:19 Dan Harris wrote: >> On 10/11/10 8:02 PM, Scott Carey wrote: >>> would give you a 1MB read-ahead. Also, consider XFS and its built-in >>> defragmentation. I have found that a longer lived postgres DB will get >>> extreme file fragmentation over time and sequential scans end up mostly >>> random. On-line file defrag helps tremendously. >> We just had a corrupt table caused by an XFS online defrag. I'm scared >> to use this again while the db is live. Has anyone else found this to >> be safe? But, I can vouch for the fragmentation issue, it happens very >> quickly in our system. >> >> -Dan > I would like to know the details of what was going on that caused your > problem. I have been using XFS for over 9 years, and it has never caused any > trouble at all in a production environment. Sure, I had many problems with it > on the test bench, but in most cases the issues were very clear and easy to > avoid in production. There were some (older) XFS tools that caused some > problems, but that is in the past, and as time goes on, it seems take less and > less planning to make it work properly. > -Neil- > There were roughly 50 transactions/sec going on at the time I ran it. xfs_db reported 99% fragmentation before it ran ( we haven't been running it via cron ). The operation completed in about 15 minutes ( 360GB of used data on the file system ) with no errors. Everything seemed fine until the next morning when a user went to query a table we got a message about a "missing" file inside the pg cluster. We were unable to query the table at all via psql. It was a bit of a panic situation so we restored that table from backup immediately and the problem was solved without doing more research. This database has been running for years with no problem ( and none since ), that was the first time I tried to do an on-line defrag and that was the only unusual variable introduced into the system at that time so it was a strong enough correlation for me to believe that caused it. Hopefully this was just a corner case.. -Dan
On Tue, Oct 12, 2010 at 1:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Anyway, if anyone is hot to make COUNT(*) faster, that's where to look. > I don't think any of the previous discussion in this thread is on-point > at all, except for the parts where people suggested avoiding it. I kind of hope that index-only scans help with this, too. If you have a wide table and a narrow (but not partial) index, and if the visibility map bits are mostly set, it ought to be cheaper to read the index than the table - certainly in the case where any disk I/O is involved, and maybe even if it isn't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sunday 10 October 2010 21:15:56 Neil Whelchel wrote: > Right now, I am building a test machine with two dual core Intel processors > and two 15KRPM mirrored hard drives, 1 GB ram. I am using a small amount of > ram because I will be using small test tables. I may do testing in the > future with more ram and bigger tables, but I think I can accomplish what > we are all after with what I have. The machine will be limited to running > the database server in test, init, bash, and ssh, no other processes will > be running except for what is directly involved with testing. I will post > exact specs when I post test results. I will create some test tables, and > the same tables will be used in all tests. Suggestions for optimal > Postgres and system configuration are welcome. I will try any suggested > settings that I have time to test. -Neil- > Ok the test machine is up and running: A few more details, the hard drives are SCSI Ultra-320, the CPUs are 2.8 GHZ, 533 MHZ FSB. I wanted to make a more memory cramped machine to keep the table to RAM ratio closer to the production machines, but for now, all I have are 1GB DDRs, and the machine requires pairs, so total memory is 2GB. Swap is turned off. The data I will be using is a couple of days of raw data from a production system. The columns of interest are numeric and timestamp. I will use the exact same data for all tests. Table "public.log" Column | Type | Modifiers ------------------+-----------------------------+------------------------ batch_id | integer | t_stamp | timestamp without time zone | not null default now() raw_data | numeric | data_value | numeric | data_value_delta | numeric | journal_value | numeric | journal_data | numeric | machine_id | integer | not null group_number | integer | Indexes: "log_idx" btree (group_number, batch_id) "log_oid_idx" btree (oid) "log_t_stamp" btree (t_stamp) The initial test is with XFS with write barriers turned on, this makes for very slow writes. The point of the first test is to get a baseline of everything out-of-the-box. So, here are the numbers: Insert the data into one table: crash:~# time psql -U test test -q < log.sql real 679m43.678s user 1m4.948s sys 13m1.893s crash:~# echo 3 > /proc/sys/vm/drop_caches crash:~# time psql -U test test -c "SELECT count(*) FROM log;" count ---------- 10050886 (1 row) real 0m11.812s user 0m0.000s sys 0m0.004s crash:~# time psql -U test test -c "SELECT count(*) FROM log;" count ---------- 10050886 (1 row) real 0m3.737s user 0m0.000s sys 0m0.000s As can be seen here, the cache helps.. And the numbers are not all that bad, so let's throw a sabot into the gears: crash:~# time psql -U test test -c "UPDATE log SET raw_data=raw_data+1" UPDATE 10050886 real 14m13.802s user 0m0.000s sys 0m0.000s crash:~# time psql -U test test -c "SELECT count(*) FROM log;" count ---------- 10050886 (1 row) real 3m32.757s user 0m0.000s sys 0m0.000s Just to be sure: crash:~# time psql -U test test -c "SELECT count(*) FROM log;" count ---------- 10050886 (1 row) real 2m38.631s user 0m0.000s sys 0m0.000s It looks like cache knocked about a minute off, still quite sad. So, I shutdown Postgres, ran xfs_fsr, and started Postgres: crash:~# echo 3 > /proc/sys/vm/drop_caches crash:~# time psql -U test test -c "SELECT count(*) FROM log;" count ---------- 10050886 (1 row) real 1m36.304s user 0m0.000s sys 0m0.000s So it seems that defragmentation knocked another minute off: Let's see how much cache helps now: crash:~# time psql -U test test -c "SELECT count(*) FROM log;" count ---------- 10050886 (1 row) real 1m34.873s user 0m0.000s sys 0m0.000s Not much... And we are a long way from the 3.7 seconds with a freshly inserted table. Maybe the maid can help here. crash:~# time psql -U test test -c "VACUUM log;" VACUUM real 22m31.931s user 0m0.000s sys 0m0.000s crash:~# time psql -U test test -c "SELECT count(*) FROM log;" count ---------- 10050886 (1 row) real 1m30.927s user 0m0.000s sys 0m0.000s Nope... So, possible conclusions are: 1. Even with VACUUM database table speed degrades as tables are updated. 2. Time testing on a freshly INSERTed table gives results that are not real- world. 3. Filesystem defragmentation helps (some). 4. Cache only makes a small difference once a table has been UPDATEd. I am going to leave this configuration running for the next day or so. This way I can try any suggestions and play with any more ideas that I have. I will try these same tests on ext4 later, along with any good suggested tests. I will try MySQL with the dame data with both XFS and ext4. -Neil-
On 13/10/10 19:47, Neil Whelchel wrote: > > Nope... > So, possible conclusions are: > 1. Even with VACUUM database table speed degrades as tables are updated. > 2. Time testing on a freshly INSERTed table gives results that are not real- > world. > 3. Filesystem defragmentation helps (some). > 4. Cache only makes a small difference once a table has been UPDATEd. > > I am going to leave this configuration running for the next day or so. This > way I can try any suggestions and play with any more ideas that I have. > I will try these same tests on ext4 later, along with any good suggested > tests. > I will try MySQL with the dame data with both XFS and ext4. > -Neil- > > I think that major effect you are seeing here is that the UPDATE has made the table twice as big on disk (even after VACUUM etc), and it has gone from fitting in ram to not fitting in ram - so cannot be effectively cached anymore. This would not normally happen in real life (assuming UPDATEs only modify a small part of a table per transaction). However administration updates (e.g 'oh! - ref 1 should now be ref 2 please update everything') *will* cause the table size to double. This is an artifact of Postgres's non overwriting storage manager - Mysql will update in place and you will not see this. Try VACUUM FULL on the table and retest. regards Mark
> I guess I have to comment here again and point out that while I am > having this > issue with text searches, I avoid using count(*) in such cases, I just > use > next and previous links. Unfortunately sometimes you got to do an ORDER BY on search results, and then all the rows got to be read... > Where the real problem (for me) is that when someone > searches a date or time range. My application keeps track of huge Have you tried CLUSTER ? Also, it is sad to say, but if you need an engine able to use index-only scans which would fit this type of query, replicate the table to MyISAM. Unfortunately, the MySQL optimizer is really not so smart about complex reporting queries (no hash joins, no hash aggregates) so if you don't have a multicolumn index covering that you can use for index-only scan in your query, you'll get either a really huge sort or a really nasty nested loop index scan...
On Wednesday 13 October 2010 00:19:26 Mark Kirkwood wrote: > On 13/10/10 19:47, Neil Whelchel wrote: > > Nope... > > So, possible conclusions are: > > 1. Even with VACUUM database table speed degrades as tables are updated. > > 2. Time testing on a freshly INSERTed table gives results that are not > > real- world. > > 3. Filesystem defragmentation helps (some). > > 4. Cache only makes a small difference once a table has been UPDATEd. > > > > I am going to leave this configuration running for the next day or so. > > This way I can try any suggestions and play with any more ideas that I > > have. I will try these same tests on ext4 later, along with any good > > suggested tests. > > I will try MySQL with the dame data with both XFS and ext4. > > -Neil- > > I think that major effect you are seeing here is that the UPDATE has > made the table twice as big on disk (even after VACUUM etc), and it has > gone from fitting in ram to not fitting in ram - so cannot be > effectively cached anymore. > > This would not normally happen in real life (assuming UPDATEs only > modify a small part of a table per transaction). However administration > updates (e.g 'oh! - ref 1 should now be ref 2 please update > everything') *will* cause the table size to double. > > This is an artifact of Postgres's non overwriting storage manager - > Mysql will update in place and you will not see this. > > Try VACUUM FULL on the table and retest. > > regards > > Mark There seems to be allot of discussion about VACUUM FULL, and its problems. The overall buzz seems to be that VACUUM FULL is a bad idea (I could be wrong here). It has been some time since I have read the changelogs, but I seem to remember that there have been some major changes to VACUUM FULL recently. Maybe this needs to be re-visited in the documentation. crash:~# time psql -U test test -c "VACUUM FULL log;" VACUUM real 4m49.055s user 0m0.000s sys 0m0.000s crash:~# time psql -U test test -c "SELECT count(*) FROM log;" count ---------- 10050886 (1 row) real 0m9.665s user 0m0.000s sys 0m0.004s A huge improvement from the minute and a half before the VACUUM FULL. crash:~# time psql -U test test -c "SELECT count(*) FROM log;" count ---------- 10050886 (1 row) real 0m3.786s user 0m0.000s sys 0m0.000s And the cache helps... So, we are right back to within 10ms of where we started after INSERTing the data, but it took a VACUUM FULL to accomplish this (by making the table fit in RAM). This is a big problem on a production machine as the VACUUM FULL is likely to get in the way of INSERTing realtime data into the table. So to add to the conclusion pile: 5. When you have no control over the WHERE clause which may send count(*) through more rows of a table that would fit in RAM your performance will be too slow, so count is missing a LIMIT feature to avoid this. 6. Keep tables that are to be updated frequently as narrow as possible: Link them to wider tables to store the columns that are less frequently updated. So with our conclusion pile so far we can deduce that if we were to keep all of our data in two column tables (one to link them together, and the other to store one column of data), we stand a much better chance of making the entire table to be counted fit in RAM, so we simply apply the WHERE clause to a specific table as opposed to a column within a wider table... This seems to defeat the entire goal of the relational database... -Neil-
On 10/13/2010 2:47 AM, Neil Whelchel wrote: > Even with VACUUM database table speed degrades What the heck is the "database table speed"? Tables don't do anything. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On 10/13/2010 3:19 AM, Mark Kirkwood wrote: > I think that major effect you are seeing here is that the UPDATE has > made the table twice as big on disk (even after VACUUM etc), and it has > gone from fitting in ram to not fitting in ram - so cannot be > effectively cached anymore. > In the real world, tables are larger than the available memory. I have tables of several hundred gigabytes in size. Tables shouldn't be "effectively cached", the next step would be to measure "buffer cache hit ratio", tables should be effectively used. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
On 13/10/10 21:38, Neil Whelchel wrote: > > So with our conclusion pile so far we can deduce that if we were to keep all > of our data in two column tables (one to link them together, and the other to > store one column of data), we stand a much better chance of making the entire > table to be counted fit in RAM, so we simply apply the WHERE clause to a > specific table as opposed to a column within a wider table... This seems to > defeat the entire goal of the relational database... > > That is a bit excessive I think - a more reasonable conclusion to draw is that tables bigger than ram will drop to IO max speed to scan, rather than DIMM max speed... There are things you can do to radically improve IO throughput - e.g a pair of AMC or ARECA 12 slot RAID cards setup RAID 10 and tuned properly should give you a max sequential throughput of something like 12*100 MB/s = 1.2 GB/s. So your example table (estimated at 2GB) so be able to be counted by Postgres in about 3-4 seconds... This assumes a more capable machine than you are testing on I suspect. Cheers Mark
On Wednesday 13 October 2010 01:50:23 Mark Kirkwood wrote: > On 13/10/10 21:38, Neil Whelchel wrote: > > So with our conclusion pile so far we can deduce that if we were to keep > > all of our data in two column tables (one to link them together, and the > > other to store one column of data), we stand a much better chance of > > making the entire table to be counted fit in RAM, so we simply apply the > > WHERE clause to a specific table as opposed to a column within a wider > > table... This seems to defeat the entire goal of the relational > > database... > > That is a bit excessive I think - a more reasonable conclusion to draw > is that tables bigger than ram will drop to IO max speed to scan, rather > than DIMM max speed... > > There are things you can do to radically improve IO throughput - e.g a > pair of AMC or ARECA 12 slot RAID cards setup RAID 10 and tuned properly > should give you a max sequential throughput of something like 12*100 > MB/s = 1.2 GB/s. So your example table (estimated at 2GB) so be able to > be counted by Postgres in about 3-4 seconds... > > This assumes a more capable machine than you are testing on I suspect. > > Cheers > > Mark The good ol' bruit force approach! I knew I'd see this one sooner or later. Though I was not sure if I was going to see the 16TB of RAM suggestion first. Seriously though, as the title of this thread suggests, everything is relative. Sure count(*) and everything else will work faster with more system power. It just seems to me that count(*) is slower than it could be given a set of conditions. I started this thread because I think that there must be a better way to count matches from an INDEXed column than shoving the entire table through RAM (including columns that you are not interested in at the minute). And even worse, when you have no (reasonable) control of the WHERE clause preventing your system from thrashing for the next week because somebody put in criteria that matched a few TB of records and there is no way to LIMIT count(*) other than externally timing the query and aborting it if it takes too long. Whet is needed is a way to determine how many rows are likely to match a given WHERE clause so we can cut off useless queries, but we need a fast count(*) for that, or a limit on the existing one... I seem to remember saying something about an index driven estimate(*) at one point... I might go as far as to rattle the cage of the developers to see if it makes any sense to add some column oriented storage capability to Postgres. That would be the hot ticket to be able to specify an attribute on a column so that the back end could shadow or store a column in a column oriented table so aggregate functions could work on them with good efficiency, or is that an INDEX? Since the thread has started, I have had people ask about different system configurations, especially the filesystem (XFS, ext4...). I have never tested ext4, and since we are all involved here, I thought that I could do so and share my results for others, that is why I got into time testing stuff. Time testing count(*) in my later postings is really not the point as count is simply dragging the entire table off of the RAID through RAM, I can use any other function like max()... No that can narrow down its scan with an INDEX... Ok, sum(), there we go! -Neil-
12.10.10 14:44, Craig Ringer написав(ла): > >> in the case where you are doing a count(*) where query and the where is >> on an indexed column, could the search just look at the index + the >> visibility mapping rather than doing an sequential search through the >> table? > > Nope, because the visibility map, which is IIRC only one bit per page, > doesn't record how many tuples there are on the page, or enough > information about them to determine how many of them are visible to > the current transaction*. I'd say it can tell you that your may not recheck given tuple, can't it? You still have to count all index tuples and recheck the ones that are uncertain. Does it work in this way? This can help a lot for wide tuples in table, but with narrow index and mostly read-only data. Best regards, Vitalii Tymchyshyn
12.10.10 21:58, Tom Lane написав(ла): > > I'm less than convinced that that approach will result in a significant > win. It's certainly not going to do anything to convert COUNT(*) into > an O(1) operation, which frankly is what the complainants are expecting. > There's basically no hope of solving the "PR problem" without somehow > turning COUNT(*) into a materialized-view reference. We've discussed > that in the past, and know how to do it in principle, but the complexity > and distributed overhead are daunting. > > I've though about "aggregate" indexes, something like create index index_name on table_name(count(*) group by column1, column2); OR create index index_name on table_name(count(*)); for table-wide count To make it usable one would need: 1) Allow third aggregate function SMERGE that can merge one aggregate state to another 2) The index should be regular index (e.g. btree) on column1, column2 that for each pair has page list to which it's data may belong (in past/current running transactions), and aggregate state for each page that were frozen previously When index is used, it can use precalculated values for "pages with all tuples vacuumed" (I suspect this is information from visibility map) and should do regular calculation for all non-frozen pages with visibility checks and everything what's needed. When vacuum processes the page, it should (in sync or async way) calculate aggregate values for the page. IMHO Such an indexes would make materialized views/triggers/high level caches unneeded in most cases. Best regards, Vitalii Tymchyshyn
On 13/10/2010 12:38 AM, Jesper Krogh wrote: > If some clever postgres hacker could teach postgres to allocate blocks > using posix_fallocate in quite large batches, say .. something like: > fallocate(min(current_relation_size *0.1,1073741824)) There doesn't seem to be any use of posix_fallocate in the sources, at least according to git grep. The patch that introduced posix_fadvise use apparently had posix_fallocate in it, but that use appears to have been removed down the track. It's worth noting that posix_fallocate sucks if your file system doesn't intelligent support for it. IIRC it's horrible on ext3, where it can take a while to return while it allocates (and IIRC zeroes!) all those blocks. This may be part of why it's not used. In past testing with posix_fallocate for other tools I've also found rather mixed performance results - it can slow things down rather than speed them up, depending on the file system in use and all sorts of other factors. If Pg was to use posix_fallocate, it'd probably need control over it on a per-tablespace basis. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
Dan Harris wrote: > I'm not sure how to figure out what version of XFS we're on.. but it's > Linux kernel 2.6.24-24 x86_64 on Ubuntu Server 8.04.3. Postgres > version 8.3 There's the kernel side support that matches your kernel, as well as the xfsprogs package. The latter is where a lot of distributions haven't kept up with upstream changes, and where I suspect the defragmenter bug you ran into is located at. Hardy ships with 2.9.4-2: http://packages.ubuntu.com/hardy/xfsprogs The work incorporating a more stable XFS into RHEL started with xfsprogs 3.0.1-6 going into Fedora 11, and 3.1.X would represent a current release. So your Ubuntu kernel is two major improvement releases behind, 3.0 and 3.1 were the upgrades to xfsprogs where things really got going again making that code modern and solid. Ubuntu Lucid switched to 3.1.0, RHEL6 will probably ship 3.1.0 too. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
On 10/13/2010 8:12 AM, Greg Smith wrote: > The work incorporating a more stable XFS into RHEL started with xfsprogs > 3.0.1-6 going into Fedora 11, and 3.1.X would represent a current > release. So your Ubuntu kernel is two major improvement releases > behind, 3.0 and 3.1 were the upgrades to xfsprogs where things really > got going again making that code modern and solid. Ubuntu Lucid > switched to 3.1.0, RHEL6 will probably ship 3.1.0 too. > I am afraid that my management will not let me use anything that doesn't exist as a RPM package in the current Red Hat distribution. No Ubuntu, no Fedora, no manual linking. There will always be that ominous question: how many other companies are using XFS? From the business perspective, questions like that make perfect sense. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 www.vmsinfo.com
Mladen Gogala wrote: > I am afraid that my management will not let me use anything that > doesn't exist as a RPM package in the current Red Hat distribution. No > Ubuntu, no Fedora, no manual linking. There will always be that > ominous question: how many other companies are using XFS? From the > business perspective, questions like that make perfect sense. XFS support is available as an optional module starting in RHEL 5.5. In CentOS, you just grab it, so that's what I've been doing. My understanding is that you may have to ask your sales rep to enable access to it under the official RedHat Network channels if you're using a subscription from them. I'm not sure exactly what the support situation is with it, but it's definitely available as an RPM from RedHat. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
On Wed, Oct 13, 2010 at 4:38 AM, Neil Whelchel <neil.whelchel@gmail.com> wrote: > There seems to be allot of discussion about VACUUM FULL, and its problems. The > overall buzz seems to be that VACUUM FULL is a bad idea (I could be wrong > here). It has been some time since I have read the changelogs, but I seem to > remember that there have been some major changes to VACUUM FULL recently. > Maybe this needs to be re-visited in the documentation. In 9.0, VACUUM FULL does something similar to what CLUSTER does. This is a much better idea than what it did in 8.4 and prior. > crash:~# time psql -U test test -c "VACUUM FULL log;" > VACUUM > > real 4m49.055s > user 0m0.000s > sys 0m0.000s > > crash:~# time psql -U test test -c "SELECT count(*) FROM log;" > count > ---------- > 10050886 > (1 row) > > real 0m9.665s > user 0m0.000s > sys 0m0.004s > > A huge improvement from the minute and a half before the VACUUM FULL. This is a very surprising result that I would like to understand better. Let's assume that your UPDATE statement bloated the table by 2x (you could use pg_relation_size to find out exactly; the details probably depend on fillfactor which you might want to lower if you're going to do lots of updates). That ought to mean that count(*) has to grovel through twice as much data, so instead of taking 9 seconds it ought to take 18 seconds. Where the heck is the other 1:12 going? This might sort of make sense if the original table was laid out sequentially on disk and the updated table was not, but how and why would that happen? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 13, 2010 at 6:16 AM, Neil Whelchel <neil.whelchel@gmail.com> wrote: > I might go as far as to rattle the cage of the developers to see if it makes > any sense to add some column oriented storage capability to Postgres. That > would be the hot ticket to be able to specify an attribute on a column so that > the back end could shadow or store a column in a column oriented table so > aggregate functions could work on them with good efficiency, or is that an > INDEX? I'd love to work on that, but without funding it's tough to find the time. It's a big project. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Neil Whelchel <neil.whelchel@gmail.com> wrote: > crash:~# time psql -U test test -c "UPDATE log SET > raw_data=raw_data+1" > UPDATE 10050886 > > real 14m13.802s > user 0m0.000s > sys 0m0.000s > > crash:~# time psql -U test test -c "SELECT count(*) FROM log;" > count > ---------- > 10050886 > (1 row) > > real 3m32.757s > user 0m0.000s > sys 0m0.000s > > Just to be sure: > crash:~# time psql -U test test -c "SELECT count(*) FROM log;" > count > ---------- > 10050886 > (1 row) > > real 2m38.631s > user 0m0.000s > sys 0m0.000s > > It looks like cache knocked about a minute off That's unlikely to be caching, since you just updated the rows. It's much more likely to be one or both of rewriting the rows as you read them to set hint bits or competing with autovacuum. The large increase after the update probably means you went from a table which was fully cached to something larger than the total cache. -Kevin
Neil Whelchel <neil.whelchel@gmail.com> writes: > Insert the data into one table: > crash:~# time psql -U test test -q < log.sql > real 679m43.678s > user 1m4.948s > sys 13m1.893s > crash:~# echo 3 > /proc/sys/vm/drop_caches > crash:~# time psql -U test test -c "SELECT count(*) FROM log;" > count > ---------- > 10050886 > (1 row) > real 0m11.812s > user 0m0.000s > sys 0m0.004s > crash:~# time psql -U test test -c "SELECT count(*) FROM log;" > count > ---------- > 10050886 > (1 row) > real 0m3.737s > user 0m0.000s > sys 0m0.000s > As can be seen here, the cache helps.. That's probably got little to do with caching and everything to do with setting hint bits on the first SELECT pass. I concur with Mark's question about whether your UPDATE pushed the table size across the limit of what would fit in RAM. regards, tom lane
On 2010-10-13 15:28, Robert Haas wrote: > On Wed, Oct 13, 2010 at 6:16 AM, Neil Whelchel<neil.whelchel@gmail.com> wrote: > >> I might go as far as to rattle the cage of the developers to see if it makes >> any sense to add some column oriented storage capability to Postgres. That >> would be the hot ticket to be able to specify an attribute on a column so that >> the back end could shadow or store a column in a column oriented table so >> aggregate functions could work on them with good efficiency, or is that an >> INDEX? >> > I'd love to work on that, but without funding it's tough to find the > time. It's a big project. > Is it hugely different from just getting the visibillity map suitable for doing index-only scans and extracting values from the index directly as Heikki has explained? That would essentially do a column oriented table (the index itself) of a specific columns (or column set). ... still a huge task though. -- Jesper
On Wed, Oct 13, 2010 at 07:49, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Neil Whelchel <neil.whelchel@gmail.com> writes: > I concur with Mark's question about whether your UPDATE pushed the table > size across the limit of what would fit in RAM. Yeah, you said you have ~2GB of ram, just counting the bytes and the number of rows (not including padding or overhead) puts you around ~670MB. Some quick testing here on a 64 bit box : => create table log (batch_id int, t_stamp timestamp without time zone not null default now(), raw_data numeric, data_value numeric, data_value_delta numeric, journal_value numeric, journal_data numeric, machine_id integer not null, group_number integer) with oids; CREATE TABLE Time: 34.310 ms => insert into log (batch_id, data_value, data_value_delta, journal_value, journal_data, group_number, machine_id, raw_data) select 1, 1, 1, 1, 1, 1, 1, 1 from generate_series(1, 10050886); INSERT 0 10050886 Time: 32818.529 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 969 MB => update log set raw_data = raw_data+1; UPDATE 10050886 Time: 65805.741 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 1939 MB => SELECT count(*) from log; count ---------- 10050886 (1 row) Time: 11181.005 ms => SELECT count(*) from log; count ---------- 10050886 (1 row) Time: 2825.569 ms This box has ~6GB ram. BTW did anyone else hear the below in a Valeris voice? > And the numbers are not all that bad, so let's throw a sabot into the gears: > crash:~# time psql -U test test -c "UPDATE log SET raw_data=raw_data+1"
On Wed, Oct 13, 2010 at 02:38, Neil Whelchel <neil.whelchel@gmail.com> wrote: > And the cache helps... > So, we are right back to within 10ms of where we started after INSERTing the > data, but it took a VACUUM FULL to accomplish this (by making the table fit in > RAM). > This is a big problem on a production machine as the VACUUM FULL is likely to > get in the way of INSERTing realtime data into the table. Right, but the real point is how often do you plan on mass updating the table? Thats (hopefully) the only time a vacuum full should be needed. Otherwise (auto) vacuum will probably work most of the time. > 6. Keep tables that are to be updated frequently as narrow as possible: Link > them to wider tables to store the columns that are less frequently updated. Again I don't think its updated frequently so much as mass updated. I run some databases here that have tens to hundreds of updates every second. The difference is I don't update *all* 26 million rows at the same time that often. But If I did, Id probably want to lower the fillfactor. For example: => update log set raw_data = raw_data+1; UPDATE 10050886 Time: 59387.021 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 1939 MB => update log set raw_data = raw_data+1; UPDATE 10050886 Time: 70549.425 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 2909 MB => update log set raw_data = raw_data+1; UPDATE 10050886 Time: 78551.544 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 3879 MB => update log set raw_data = raw_data+1; UPDATE 10050886 Time: 74443.945 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 4848 MB Here you see basically linear growth, after some vacuuming: => VACUUM log; VACUUM Time: 193055.857 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 4848 MB => VACUUM log; VACUUM Time: 38281.541 ms whopper=> SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 4848 MB => VACUUM log; VACUUM Time: 28.531 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 4848 MB Hey... its not shrinking it at all...: => VACUUM verbose log; INFO: vacuuming "public.log" INFO: "log": found 0 removable, 0 nonremovable row versions in 31 out of 620425 pages DETAIL: 0 dead row versions cannot be removed yet. There were 2511 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.01u sec elapsed 0.01 sec. INFO: vacuuming "pg_toast.pg_toast_10544753" INFO: index "pg_toast_10544753_index" now contains 0 row versions in 1 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pg_toast_10544753": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM Time: 29.070 ms -- ok lets start over and this time set fillfactor to 50; => alter table log set (fillfactor = 50); => vacuum full log; => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 1963 MB -- 2x the default size, lets see what an update does now => update log set raw_data = raw_data+1; UPDATE 10050886 Time: 70424.752 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 1963 MB -- hey ! same size => update log set raw_data = raw_data+1; UPDATE 10050886 Time: 58112.895 ms => SELECT pg_size_pretty(pg_total_relation_size('log')); pg_size_pretty ---------------- 1963 MB (1 row) -- Still the same So in short... vacuum seems to fall over flat with mass updates, set a lower fillfactor :). > So with our conclusion pile so far we can deduce that if we were to keep all > of our data in two column tables (one to link them together, and the other to > store one column of data), we stand a much better chance of making the entire > table to be counted fit in RAM, I dunno about that... Seems like if you only had 2 tables both would fail to fit in ram fairly quickly :) > so we simply apply the WHERE clause to a > specific table as opposed to a column within a wider table... This seems to > defeat the entire goal of the relational database... Sure... thats one answer. See http://wiki.postgresql.org/wiki/Slow_Counting for more. But the basic ideas are: 1) estimate the count 2) use triggers and keep the count somewhere else 3) keep it in ram
On Wed, 13 Oct 2010, Tom Lane wrote: > Neil Whelchel <neil.whelchel@gmail.com> writes: > > That's probably got little to do with caching and everything to do with > setting hint bits on the first SELECT pass. > > I concur with Mark's question about whether your UPDATE pushed the table > size across the limit of what would fit in RAM. Neil, can you just double the size of your initial test to make sure that it's too large to fit in ram to start with? David Lang
On Wednesday 13 October 2010 05:33:28 Mladen Gogala wrote: > On 10/13/2010 8:12 AM, Greg Smith wrote: > > The work incorporating a more stable XFS into RHEL started with xfsprogs > > 3.0.1-6 going into Fedora 11, and 3.1.X would represent a current > > release. So your Ubuntu kernel is two major improvement releases > > behind, 3.0 and 3.1 were the upgrades to xfsprogs where things really > > got going again making that code modern and solid. Ubuntu Lucid > > switched to 3.1.0, RHEL6 will probably ship 3.1.0 too. > > I am afraid that my management will not let me use anything that doesn't > exist as a RPM package in the current Red Hat distribution. No Ubuntu, > no Fedora, no manual linking. There will always be that ominous > question: how many other companies are using XFS? From the business > perspective, questions like that make perfect sense. XFS sees extensive use in the billing departments of many phone and utility companies. Maybe not the code that you see in Linux, but the on-disk format, which I think is unchanged since its original release. (You can use the modern XFS code in Linux to mount a filesystem from an older SGI machine that used XFS.) The code in Linux is based on the code that SGI released some time in 2000, which worked at that time very well for the SGI machine. At the time that SGI came up with XFS, they had realtime in mind. They added specific features to the filesystem to guarantee IO at a specific rate, this was intended for database and other realtime applications. I have not looked at the Linux version to see if it contains these extensions. I will be doing this soon, however as my next big project will require a true realtime system. -Neil-
On Wednesday 13 October 2010 06:27:34 you wrote: > On Wed, Oct 13, 2010 at 4:38 AM, Neil Whelchel <neil.whelchel@gmail.com> wrote: > > There seems to be allot of discussion about VACUUM FULL, and its > > problems. The overall buzz seems to be that VACUUM FULL is a bad idea (I > > could be wrong here). It has been some time since I have read the > > changelogs, but I seem to remember that there have been some major > > changes to VACUUM FULL recently. Maybe this needs to be re-visited in > > the documentation. > > In 9.0, VACUUM FULL does something similar to what CLUSTER does. This > is a much better idea than what it did in 8.4 and prior. > > > crash:~# time psql -U test test -c "VACUUM FULL log;" > > VACUUM > > > > real 4m49.055s > > user 0m0.000s > > sys 0m0.000s > > > > crash:~# time psql -U test test -c "SELECT count(*) FROM log;" > > count > > ---------- > > 10050886 > > (1 row) > > > > real 0m9.665s > > user 0m0.000s > > sys 0m0.004s > > > > A huge improvement from the minute and a half before the VACUUM FULL. > > This is a very surprising result that I would like to understand > better. Let's assume that your UPDATE statement bloated the table by > 2x (you could use pg_relation_size to find out exactly; the details > probably depend on fillfactor which you might want to lower if you're > going to do lots of updates). That ought to mean that count(*) has to > grovel through twice as much data, so instead of taking 9 seconds it > ought to take 18 seconds. Where the heck is the other 1:12 going? > This might sort of make sense if the original table was laid out > sequentially on disk and the updated table was not, but how and why > would that happen? This is likely due to the table not fitting in memory before the VACUUM FULL. I am glad that you suggested using pg_relation_size, I somehow didn't think of it at the time. I will redo the test and publish the results of pg_relation_size. -Neil-
On 13/10/10 21:44, Mladen Gogala wrote: > On 10/13/2010 3:19 AM, Mark Kirkwood wrote: >> I think that major effect you are seeing here is that the UPDATE has >> made the table twice as big on disk (even after VACUUM etc), and it has >> gone from fitting in ram to not fitting in ram - so cannot be >> effectively cached anymore. >> > In the real world, tables are larger than the available memory. I have > tables of several hundred gigabytes in size. Tables shouldn't be > "effectively cached", the next step would be to measure "buffer cache > hit ratio", tables should be effectively used. > Sorry Mladen, I didn't mean to suggest that all tables should fit into ram... but was pointing out (one reason) why Neil would expect to see a different sequential scan speed after the UPDATE. I agree that in many interesting cases, tables are bigger than ram [1]. Cheers Mark [1] Having said that, these days 64GB of ram is not unusual for a server... and we have many real customer databases smaller than this where I work.
On 13/10/10 23:16, Neil Whelchel wrote: > > The good ol' bruit force approach! I knew I'd see this one sooner or later. > Though I was not sure if I was going to see the 16TB of RAM suggestion first. > Seriously though, as the title of this thread suggests, everything is > relative. Sure count(*) and everything else will work faster with more system > power. It just seems to me that count(*) is slower than it could be given a > set of conditions.... > > Since the thread has started, I have had people ask about different system > configurations, especially the filesystem (XFS, ext4...). I have never tested > ext4, and since we are all involved here, I thought that I could do so and > share my results for others, that is why I got into time testing stuff. > Time testing count(*) in my later postings is really not the point as count is > simply dragging the entire table off of the RAID through RAM, I can use any > other function like max()... No that can narrow down its scan with an INDEX... > Ok, sum(), there we go! > > > Well in some (quite common) use cases, the queries cannot be known in advance, and the tables are considerably bigger than ram... this makes the fast IO a good option - sometimes better (and in the end cheaper) than trying to maintain every conceivable covering index. Of course it would be great if Postgres could use the indexes alone to execute certain queries - we may see some of that capability in the next few release (keep and eye on messages concerning the 'Visibility Map'). regards Mark
On Wed, Oct 13, 2010 at 1:59 PM, Jesper Krogh <jesper@krogh.cc> wrote: > On 2010-10-13 15:28, Robert Haas wrote: >> >> On Wed, Oct 13, 2010 at 6:16 AM, Neil Whelchel<neil.whelchel@gmail.com> >> wrote: >> >>> >>> I might go as far as to rattle the cage of the developers to see if it >>> makes >>> any sense to add some column oriented storage capability to Postgres. >>> That >>> would be the hot ticket to be able to specify an attribute on a column so >>> that >>> the back end could shadow or store a column in a column oriented table so >>> aggregate functions could work on them with good efficiency, or is that >>> an >>> INDEX? >>> >> >> I'd love to work on that, but without funding it's tough to find the >> time. It's a big project. >> > > Is it hugely different from just getting the visibillity map suitable > for doing index-only scans and extracting values from the index > directly as Heikki has explained?] I think that there's a lot more to a real column-oriented database than index-only scans, although, of course, index-only scans are very important. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Could this be an interesting test use of https://www.fossexperts.com/ ? 'Community' driven proposal - multiple people / orgs agree to pay various portions? Maybe with multiple funders a reasonable target fund amount could be reached. Just throwing around ideas here. Mark -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Robert Haas Sent: Wednesday, October 13, 2010 7:29 AM To: Neil Whelchel Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Slow count(*) again... On Wed, Oct 13, 2010 at 6:16 AM, Neil Whelchel <neil.whelchel@gmail.com> wrote: > I might go as far as to rattle the cage of the developers to see if it makes > any sense to add some column oriented storage capability to Postgres. That > would be the hot ticket to be able to specify an attribute on a column so that > the back end could shadow or store a column in a column oriented table so > aggregate functions could work on them with good efficiency, or is that an > INDEX? I'd love to work on that, but without funding it's tough to find the time. It's a big project. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
On 2010-10-14 06:22, mark wrote: > Could this be an interesting test use of https://www.fossexperts.com/ ? > > 'Community' driven proposal - multiple people / orgs agree to pay various > portions? Maybe with multiple funders a reasonable target fund amount could > be reached. > I might convince my boss to chip in... but how do we get the task up there.. should we find one to give an estimate first? -- Jesper
On Thu, Oct 14, 2010 at 12:22 AM, mark <dvlhntr@gmail.com> wrote: > Could this be an interesting test use of https://www.fossexperts.com/ ? > > 'Community' driven proposal - multiple people / orgs agree to pay various > portions? Maybe with multiple funders a reasonable target fund amount could > be reached. > > Just throwing around ideas here. This is a bit off-topic, but as of now, they're only accepting proposals for projects to be performed by CommandPrompt itself. So that doesn't help me much (note the sig). But in theory it's a good idea. Of course, when and if they open it up, then what? If more than one developer or company is interested in a project, who determines who gets to do the work and get paid for it? If that determination is made by CommandPrompt itself, or if it's just a free-for-all to see who can get their name on the patch that ends up being committed, it's going to be hard to get other people/companies to take it very seriously. Another problem is that even when they do open it up, they apparently intend to charge 7.5 - 15% of the contract value as a finder's fee. That's a lot of money. For a $100 project it's totally reasonable, but for a $10,000 project it's far more expensive than the value of the service they're providing can justify. (Let's not even talk about a $100,000 project.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2010-10-14 21:56, Robert Haas wrote: > On Thu, Oct 14, 2010 at 12:22 AM, mark<dvlhntr@gmail.com> wrote: > >> Could this be an interesting test use of https://www.fossexperts.com/ ? >> >> 'Community' driven proposal - multiple people / orgs agree to pay various >> portions? Maybe with multiple funders a reasonable target fund amount could >> be reached. >> >> Just throwing around ideas here. >> > This is a bit off-topic, but as of now, they're only accepting > proposals for projects to be performed by CommandPrompt itself. So > that doesn't help me much (note the sig). > > But in theory it's a good idea. Of course, when and if they open it > up, then what? If more than one developer or company is interested in > a project, who determines who gets to do the work and get paid for it? > If that determination is made by CommandPrompt itself, or if it's > just a free-for-all to see who can get their name on the patch that > ends up being committed, it's going to be hard to get other > people/companies to take it very seriously. > Couldnt you open up a dialog about it? > Another problem is that even when they do open it up, they apparently > intend to charge 7.5 - 15% of the contract value as a finder's fee. > That's a lot of money. For a $100 project it's totally reasonable, > but for a $10,000 project it's far more expensive than the value of > the service they're providing can justify. (Let's not even talk about > a $100,000 project.) > Hi Robert. I can definately see your arguments, but you failed to describe a "better" way? Many of us rely heavily on PostgreSQL and would like to get "this feature", but sponsoring it all alone does not seem like a viable option (just a guess), taken into consideration we dont even have an estimate about how big it is, but I saw the estimate of 15K USD of the "ALTER column position" description.. and the visibillity map is most likely in the "same ballpark" (from my perspective). So in order to get something like a visibillity map (insert your favorite big feature here), you have the option: * Sponsor it all by yourself. (where its most likely going to be too big, or if it is the center of your applictions, then you definitely turn to a RDBMS that has supported it for longer times, if you can). * Wait for someone else to sponsor it all by them selves. (that happens occationally, but for particular features is it hard to see when and what, and the actual sponsor would still have the dilemma in the first point). * Hack it yourselves (many of us dont have time neither skills to do it, and my employer actually wants me to focus on the stuff that brings most direct value for my time, which is a category hacking PG does not fall into when the business is about something totally else). * A kind of microsponsoring like above? * Your proposal in here? To me.. the 4'th bullet point looks like the most viable so far.. To be honest, if it is EDB, Redpill, Command Prompt, 2nd Quadrant or whoever end up doing the job is, seen from this perspective not important, just it ends in the hands of someone "capable" of doing it. ... allthougth Heikki has done some work on this task allready. Preferrably I would like to get it coordinated by the PG project itself. But I can see that it is really hard to do that kind of stuff. And you would still face the challenge about who should end up doing the thing. Jesper .. dropped Joshua Drake on CC, he might have given all of this some seconds of thought allready. -- Jesper
On Wed, 2010-10-13 at 09:02 -0400, Greg Smith wrote: > XFS support is available as an optional module starting in RHEL 5.5. > In CentOS, you just grab it, so that's what I've been doing. My > understanding is that you may have to ask your sales rep to enable > access to it under the official RedHat Network channels if you're > using a subscription from them. I'm not sure exactly what the support > situation is with it, but it's definitely available as an RPM from > RedHat. Right. It is called "Red Hat Scalable File System", and once paid, it is available via RHN. -- Devrim GÜNDÜZ PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer PostgreSQL RPM Repository: http://yum.pgrpms.org Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr http://www.gunduz.org Twitter: http://twitter.com/devrimgunduz
Attachment
Neil Whelchel wrote: > > > That is why I suggested an estimate(*) that works like (a faster) count(*) > except that it may be off a bit. I think that is what he was talking about > when he wrote this. > > The main problem with "select count(*)" is that it gets seriously mis-used. Using "select count(*)" to establish existence is bad for performance and for code readability. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
Jon Nelson wrote: > > Well, I didn't quite mean that - having no familiarity with Oracle I > don't know what the alter system statement does, but I was talking > specifically about the linux buffer and page cache. > Those are not utilized by Oracle. This is a RAC instance, running on top of ASM, which is an Oracle volume manager, using raw devices. There is no file system on those disks: SQL> select file_name from dba_data_files 2 where tablespace_name='ADBASE_DATA'; FILE_NAME -------------------------------------------------------------------------------- +DGDATA/stag3/datafile/adbase_data.262.727278257 +DGDATA/stag3/datafile/adbase_data.263.727278741 +DGDATA/stag3/datafile/adbase_data.264.727280145 +DGDATA/stag3/datafile/adbase_data.265.727280683 [oracle@lpo-oracle-30 ~]$ $ORA_CRS_HOME/bin/crs_stat -l NAME=ora.STAG3.STAG31.inst TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-30 NAME=ora.STAG3.STAG32.inst TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-31 NAME=ora.STAG3.db TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-31 NAME=ora.lpo-oracle-30.ASM1.asm TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-30 NAME=ora.lpo-oracle-30.LISTENER_LPO-ORACLE-30.lsnr TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-30 NAME=ora.lpo-oracle-30.gsd TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-30 NAME=ora.lpo-oracle-30.ons TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-30 NAME=ora.lpo-oracle-30.vip TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-30 NAME=ora.lpo-oracle-31.ASM2.asm TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-31 NAME=ora.lpo-oracle-31.LISTENER_LPO-ORACLE-31.lsnr TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-31 NAME=ora.lpo-oracle-31.gsd TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-31 NAME=ora.lpo-oracle-31.ons TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-31 NAME=ora.lpo-oracle-31.vip TYPE=application TARGET=ONLINE STATE=ONLINE on lpo-oracle-31 The only way to flush cache is the aforementioned "alter system" command. AFAIK, Postgres doesn't have anything like that. Oracle uses raw devices precisely to avoid double buffering. -- Mladen Gogala Sr. Oracle DBA 1500 Broadway New York, NY 10036 (212) 329-5251 http://www.vmsinfo.com The Leader in Integrated Media Intelligence Solutions
Jesper Krogh wrote: > To be honest, if it is EDB, Redpill, Command Prompt, 2nd Quadrant or > whoever end up doing the job is, seen from this perspective not > important, just it ends in the hands of someone "capable" of doing > it. ... although Heikki has done some work on this task already. Now you're closing in on why this is a touchy subject. Heikki has already done work here funded by EDB. As such, the idea of anyone else being put in charge of fund raising and allocation for this particular feature would be a political mess. While it would be nice if there was a completely fair sponsorship model for developing community PostgreSQL features, overseen by a benevolent, free, and completely unaffiliated overlord, we're not quite there yet. In cases like these, where there's evidence a company with a track record of delivering features is already involved, you're probably better off contacting someone from there directly--rather than trying to fit that into the public bounty model some PostgreSQL work is getting done via lately. The visibility map is a particularly troublesome one, because the list of "capable" people who could work on that, but who aren't already working at a company having some relations with EDB, is rather slim. I know that's kind of frustrating to hear, for people who would like to get a feature done but can't finance the whole thing themselves. But look on the bright side--the base price is free, and when you give most PostgreSQL companies money to work on something it's at least possible to get what you want done. You'd have to pay a whole lot more than the $15K number you threw out there before any of the commercial database vendors would pay any attention to your particular feature request. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
bricklen wrote: > On Sat, Oct 9, 2010 at 4:26 PM, Neil Whelchel <neil.whelchel@gmail.com> wrote: > > Maybe an > > estimate(*) that works like count but gives an answer from the index without > > checking visibility? I am sure that this would be good enough to make a page > > list, it is really no big deal if it errors on the positive side, maybe the > > list of pages has an extra page off the end. I can live with that. What I > > can't live with is taking 13 seconds to get a page of results from 850,000 > > rows in a table. > > -Neil- > > > > FWIW, Michael Fuhr wrote a small function to parse the EXPLAIN plan a > few years ago and it works pretty well assuming your stats are up to > date. > > http://markmail.org/message/gknqthlwry2eoqey What I recommend is to execute the query with EXPLAIN, and look at the estimated rows and costs. If the row number is large, just round it to the nearest thousand and return it to the application as a count --- this is what Google does for searches (just try it). If the row count/cost are low, run the query and return an exact count. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Oct 12, 2010, at 11:58 AM, Tom Lane wrote: > Jesper Krogh <jesper@krogh.cc> writes: >> On 2010-10-12 19:07, Tom Lane wrote: >>> Anyway, if anyone is hot to make COUNT(*) faster, that's where to look. > >> Just having 32 bytes bytes of "payload" would more or less double >> you time to count if I read you test results correctly?. .. and in the >> situation where diskaccess would be needed .. way more. > >> Dividing by pg_relation_size by the amout of tuples in our production >> system I end up having no avg tuple size less than 100bytes. > > Well, yeah. I deliberately tested with a very narrow table so as to > stress the per-row CPU costs as much as possible. With any wider table > you're just going to be I/O bound. On a wimpy disk, I/O bound for sure. But my disks go 1000MB/sec. No query can go fast enough for them. The best I'vegotten is 800MB/sec, on a wide row (average 800 bytes). Most tables go 300MB/sec or so. And with 72GB of RAM, manyscans are in-memory anyway. A single SSD with supercapacitor will go about 500MB/sec by itself next spring. I will easily be able to build a systemwith 2GB/sec I/O for under $10k.
On 2010-10-21 06:47, Scott Carey wrote:
> On a wimpy disk, I/O bound for sure. But my disks go 1000MB/sec.
> No query can go fast enough for them. The best I've gotten is
> 800MB/sec, on a wide row (average 800 bytes). Most tables go
> 300MB/sec or so. And with 72GB of RAM, many scans are in-memory
> anyway.
> A single SSD with supercapacitor will go about 500MB/sec by itself
> next spring. I will easily be able to build a system with 2GB/sec
> I/O for under $10k.
> On a wimpy disk, I/O bound for sure. But my disks go 1000MB/sec.
> No query can go fast enough for them. The best I've gotten is
> 800MB/sec, on a wide row (average 800 bytes). Most tables go
> 300MB/sec or so. And with 72GB of RAM, many scans are in-memory
> anyway.
Is it cpu or io bound while doing it?Can you scan it faster using time cat relation-oid.* > /dev/null
> A single SSD with supercapacitor will go about 500MB/sec by itself
> next spring. I will easily be able to build a system with 2GB/sec
> I/O for under $10k.
What filesystem are you using? Readahead? Can you try to check the filesystemfragmentation of the table using filefrag? -- Jesper
On Oct 21, 2010, at 11:13 AM, Jesper Krogh wrote: > On 2010-10-21 06:47, Scott Carey wrote: > > On a wimpy disk, I/O bound for > sure. But my disks go 1000MB/sec. > > > No query can go fast enough for them. The best I've gotten is > > > 800MB/sec, on a wide row (average 800 bytes). Most tables go > > > 300MB/sec or so. And with 72GB of RAM, many scans are in-memory > > > anyway. > > > Is it cpu or io bound while doing it? I/O bound with the fio benchmark tool if 16K blocks or greater, CPU bound with 8K blocks or smaller. CentOS 5.5. CPU bound with postgres. > Can you scan it faster using time cat relation-oid.* > /dev/null > I'm not sure what you mean. in psql, select * piped to /dev/null is VERY CPU bound because of all the formatting. I haven'ttoyed with COPY. Do you mean the actual files? 'dd' tests from actual files are similar to fio, but not as consistentand hard to add concurrency. That is faster than postgres. > > > A single SSD with supercapacitor will go about 500MB/sec by itself > > > next spring. I will easily be able to build a system with 2GB/sec > > > I/O for under $10k. > > > > What filesystem are you using? Readahead? > Can you try to check the filesystemfragmentation of the table using filefrag? > XFS, defragmented once a day. Readahead 40960 (20MB, 1MB per spindle). two raid 10 arrays, each 10 discs each (2 hot spare),software raid-0 tying those together (md, 1MB blocks). Two Adaptec 5805 (or 5085, the external SAS one). A thirdraid card for the OS/xlog with 4x10krpm sas drives internal. Fragmentation quickly takes this down a lot as do small files and concurrent activity, since its only enough spindles for~2000 iops. But its almost all large reporting queries on partitioned tables (500,000 partitions). A few smaller tablesare starting to cause too many seeks so those might end up on a smaller, high iops tablespace later. Over time the disks have filled up and there is a significant slowdown in sequential transfer at the end of the partition-- 600MB/sec max. That is still CPU bound on most scans, but postgres can go that fast on some scans. Off topic: Other interesting features is how this setup causes the system tables to bloat by factors of 2x to 8x each week, and requiresfrequent vacuum full + reindex on several of them else they become 1.5GB in size. Nothing like lots of temp tablework + hour long concurrent transactions to make the system catalog bloat. I suppose with 8.4 many temp tables couldbe replaced using WITH queries, but in other cases analyzing a temp table is the only way to get a sane query plan. > -- > Jesper > > >