Thread: Yet Another COUNT(*)...WHERE...question
I'm grappling with a lot of reporting code for our app that relies on queries such as: SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... And I still do not find, from the discussions on this thread, any truly viable solution for this. The one suggestion is to have a separate counts table, which is fine for total aggregates related to, say, an ID. E.g., a table with: trader_id, trade_count But this is an overall count for the trader (in my example). What if I need a count of all his trades in the last one week. Then I need a timestamp condition in there as well. The number of such possibilities for multiple WHERE conditions is infinite...how should we account for all these avenues? Would love to hear experiences of others and what compromises they have made. From a reporting perspective, waiting for 10 minutes for a simple count to return seems untenable. TIA!
"Phoenix Kiula" <phoenix.kiula@gmail.com> writes: > I'm grappling with a lot of reporting code for our app that relies on > queries such as: > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... >... > The number of such possibilities for multiple WHERE conditions is > infinite... Depends on the "conditions" bit. You can't solve all of the infinite possibilities -- well you can, just run the query above -- but if you want to do better it's all about understanding your data. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote: > "Phoenix Kiula" <phoenix.kiula@gmail.com> writes: > > > I'm grappling with a lot of reporting code for our app that relies on > > queries such as: > > > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... > >... > > The number of such possibilities for multiple WHERE conditions is > > infinite... > > Depends on the "conditions" bit. You can't solve all of the infinite > possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understandingyour data. I am not sure what the advice here is. The WHERE condition comes from the indices. So if the query was not "COUNT(*)" but just a couple of columns, the query executes in less than a second. Just that COUNT(*) becomes horribly slow. And since the file system based query caching feature of PG is unclear to me (I am just moving from MySQL where the cache is quite powerful) I don't quite know what to do to speed up these queries!
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > I'm grappling with a lot of reporting code for our app that relies on > queries such as: > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... > > And I still do not find, from the discussions on this thread, any > truly viable solution for this. The one suggestion is to have a > separate counts table, which is fine for total aggregates related to, > say, an ID. E.g., a table with: > > trader_id, trade_count > > But this is an overall count for the trader (in my example). What if I > need a count of all his trades in the last one week. Then I need a > timestamp condition in there as well. The number of such possibilities > for multiple WHERE conditions is infinite...how should we account for > all these avenues? > > Would love to hear experiences of others and what compromises they > have made. From a reporting perspective, waiting for 10 minutes for a > simple count to return seems untenable. Generally, for these kinds of things it's often best to use materialized views / rollup tables so that you aren't re-aggregating the same data over and over.
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote: > > "Phoenix Kiula" <phoenix.kiula@gmail.com> writes: > > > > > I'm grappling with a lot of reporting code for our app that relies on > > > queries such as: > > > > > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... > > >... > > > The number of such possibilities for multiple WHERE conditions is > > > infinite... > > > > Depends on the "conditions" bit. You can't solve all of the infinite > > possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understandingyour data. > > > I am not sure what the advice here is. The WHERE condition comes from > the indices. So if the query was not "COUNT(*)" but just a couple of > columns, the query executes in less than a second. Just that COUNT(*) > becomes horribly slow. Sorry, but I don't believe you. if you're doing a count(*) on the same dataset that returns in < 1 second, then the count(*) with the same where clause will run in < 1 second. I haven't seen pgsql do anything else. > And since the file system based query caching > feature of PG is unclear to me There is no "query caching" in pgsql. There is data caching. Each query has to get planned and executed though (unless prepared, then just executed) > (I am just moving from MySQL where the > cache is quite powerful) As long as nothing is changing behind the query, and invalidating the query cache. It is useful for reporting apps, but in a constantly updating db pretty much useless. > I don't quite know what to do to speed up > these queries! Post them with explain analyze output. i.e. explain analyze yourqueryhere cut and past the query and the output. as well as the table schema.
Phoenix Kiula wrote: >>> >>> SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... > > I am not sure what the advice here is. The WHERE condition comes from > the indices. So if the query was not "COUNT(*)" but just a couple of > columns, the query executes in less than a second. Just that COUNT(*) > becomes horribly slow. The count(*) shouldn't slow things down compared to running the query to fetch columns. It should be at least as fast, or faster if the columns you fetch are large. 1. Do you have an example? 2. You're not running a query to get the columns, then a separate count(*) to get a rowcount are you? > And since the file system based query caching > feature of PG is unclear to me (I am just moving from MySQL where the > cache is quite powerful) I don't quite know what to do to speed up > these queries! There isn't a "file system based query caching" feature, there's your operating-systems file-cache and PG's buffers. Neither of which cache query-results, but cache disk pages instead. -- Richard Huxton Archonet Ltd
--- Scott Marlowe <scott.marlowe@gmail.com> wrote: > Generally, for these kinds of things it's often best to use > materialized views / rollup tables so that you aren't re-aggregating > the same data over and over. I don't know if this was already mentioned, but here is one of the links that describe the method of implementing a materialized view. http://www.jonathangardner.net/PostgreSQL/materialized_views/matviews.html other useful docs like this one can be found here: http://www.postgresql.org/docs/techdocs.2 Regards, Richard Broersma Jr.
On 15/08/07, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > > On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote: > > > "Phoenix Kiula" <phoenix.kiula@gmail.com> writes: > > > > > > > I'm grappling with a lot of reporting code for our app that relies on > > > > queries such as: > > > > > > > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... > > > >... > > > > The number of such possibilities for multiple WHERE conditions is > > > > infinite... > > > > > > Depends on the "conditions" bit. You can't solve all of the infinite > > > possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understandingyour data. > > > > > > I am not sure what the advice here is. The WHERE condition comes from > > the indices. So if the query was not "COUNT(*)" but just a couple of > > columns, the query executes in less than a second. Just that COUNT(*) > > becomes horribly slow. > > Sorry, but I don't believe you. if you're doing a count(*) on the > same dataset that returns in < 1 second, then the count(*) with the > same where clause will run in < 1 second. I haven't seen pgsql do > anything else. Sorry I was not clear. Imagine an Amazon.com search results page. It has about 15 results on Page 1, then it shows "Page 1 of 190". To show each page, the query probably has a "LIMIT 15 OFFSET 0" for Page 1. However, to calculate the total number of pages, they probably do a separate counts query, because doing a "select *" and then counting the number of rows returned would be even more inefficient than a count(*). So, in reporting, two queries are fairly common I would think, unless I am missing something?
On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On 15/08/07, Scott Marlowe <scott.marlowe@gmail.com> wrote: > > On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > > > On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote: > > > > "Phoenix Kiula" <phoenix.kiula@gmail.com> writes: > > > > > > > > > I'm grappling with a lot of reporting code for our app that relies on > > > > > queries such as: > > > > > > > > > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... > > > > >... > > > > > The number of such possibilities for multiple WHERE conditions is > > > > > infinite... > > > > > > > > Depends on the "conditions" bit. You can't solve all of the infinite > > > > possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understandingyour data. > > > > > > > > > I am not sure what the advice here is. The WHERE condition comes from > > > the indices. So if the query was not "COUNT(*)" but just a couple of > > > columns, the query executes in less than a second. Just that COUNT(*) > > > becomes horribly slow. > > > > Sorry, but I don't believe you. if you're doing a count(*) on the > > same dataset that returns in < 1 second, then the count(*) with the > > same where clause will run in < 1 second. I haven't seen pgsql do > > anything else. > > > > Sorry I was not clear. Imagine an Amazon.com search results page. It > has about 15 results on Page 1, then it shows "Page 1 of 190". > > To show each page, the query probably has a "LIMIT 15 OFFSET 0" for > Page 1. However, to calculate the total number of pages, they probably > do a separate counts query, because doing a "select *" and then > counting the number of rows returned would be even more inefficient > than a count(*). When I go to amazon.com I only ever get three pages of results. ever. Because they know that returning 190 pages is not that useful, as hardly anyone is going to wander through that many pages. Google, you'll notice says "Results 1 - 10 of about 5,610,000 for blacksmith" i.e. it's guesstimating as well. no reason for google to look at every single row for blacksmith to know that there's about 5.6 million. > So, in reporting, two queries are fairly common I would think, unless > I am missing something? Yes, optimization. :) You don't need an exact count to tell someone that there's more data and they can go to it. Note that if you are planning on doing things google sized, you'll need to do what they did, invent your own specialized database. For us mere mortals, it's quite likely that you can do something like: explain select * from table where field like 'abc%'; and then parse the explain output for an approximate number.
--- Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > Sorry I was not clear. Imagine an Amazon.com search results page. It > has about 15 results on Page 1, then it shows "Page 1 of 190". I don't think that amazon or google really need to give an accurate count in determining an estimated number of pages... Could you determine the number of pages quickly from postgresql: [ row count estimate ] / [ number of rows you want per page] The estimated row count is updated every time you vacuum your tables. And getting the estimate takes very little time. > To show each page, the query probably has a "LIMIT 15 OFFSET 0" for > Page 1. The "LIMIT 15 OFFSET 1500" technique can be a performance killer since offset does not use an index. Is is better to use the last entry of each page in the query for the next page, so you can write your query this way: SELECT * FROM your_table WHERE item_nbr > [: last item on previous page :] ORDER BY item_nbr LIMIT 15; This method was discuss on the list a couple of months ago. Regards, Richard Broersma Jr.
> Yes, optimization. :) You don't need an exact count to tell someone > that there's more data and they can go to it. In general, I agree. But my example of Amazon was only to illustrate the point about two queries and why they may be needed. I seem to see many more pages than you do, but in any case, Google and Amazon can afford to be less precise. Thanks for the suggestion of using EXPLAIN and parsing an approximation, but when you need to show a trader how many trades he has made, for instance, then approximation is not a possibility at all. Especially not if the numbers sway so wildly -- FIRSTDB=# explain select * from trades where t_id = 'kXjha'; QUERY PLAN ----------------------------------------------------------------------------------- Bitmap Heap Scan on trades (cost=15.77..1447.12 rows=374 width=224) Recheck Cond: ((t_id)::text = 'kXjha'::text) -> Bitmap Index Scan on trades_tid_date (cost=0.00..15.67 rows=374 width=0) Index Cond: ((t_id)::text = 'kXjha'::text) (4 rows) FIRSTDB=# select count(*) from trades where t_id = 'kXjha'; count ------- 3891 (1 row) Could I do something so that the EXPLAIN showed up with slightly more close-to-accurate stats? The above query is just after a "vacuum analyze"! Much appreciate the suggestions.
In response to "Phoenix Kiula" <phoenix.kiula@gmail.com>: > > Yes, optimization. :) You don't need an exact count to tell someone > > that there's more data and they can go to it. > > > In general, I agree. But my example of Amazon was only to illustrate > the point about two queries and why they may be needed. I seem to see > many more pages than you do, but in any case, Google and Amazon can > afford to be less precise. > > Thanks for the suggestion of using EXPLAIN and parsing an > approximation, but when you need to show a trader how many trades he > has made, for instance, then approximation is not a possibility at > all. Especially not if the numbers sway so wildly -- > > > FIRSTDB=# explain select * from trades where t_id = 'kXjha'; > QUERY PLAN > ----------------------------------------------------------------------------------- > Bitmap Heap Scan on trades (cost=15.77..1447.12 rows=374 width=224) > Recheck Cond: ((t_id)::text = 'kXjha'::text) > -> Bitmap Index Scan on trades_tid_date (cost=0.00..15.67 rows=374 width=0) > Index Cond: ((t_id)::text = 'kXjha'::text) > (4 rows) > > FIRSTDB=# select count(*) from trades where t_id = 'kXjha'; > count > ------- > 3891 > (1 row) > > > > Could I do something so that the EXPLAIN showed up with slightly more > close-to-accurate stats? The above query is just after a "vacuum > analyze"! In the above case, you could probably materialize the data with a trigger that updates a counter in a separate table every time a new trade is added. This will give you 100% accurate results with _very_ fast response time. Part of the problem is that there's no one answer to your question, there are multiple approaches to solving it, depending on the details of the problem and the acceptable time/accuracy of the answers. Some basic approaches: 1) Materialize the data. MySQL actually does this automatically for you with MyISAM tables, which is why count(*) is so fast. But if you absolutely need fast, accurate counts, you can build your own triggers in PG. This is unlikely to be practical with all queries. 2) Estimate. The accuracy of estimates can vary wildly by query and how often the database is analyzed, etc. For something like, "show results 1 - 10 of about 50,000", estimates are great and fast, but for other cases, not acceptable. The good news is you can get a fast estimate from any query with no up-front work. 3) Accept that sometimes to get accurate answers it's going to take time. Around here, we call it the "Orbitz" technique, because when we discuss it, everyone thinks of the "please wait while I process your query" page you get from orbitz.com. You'd be surprised how willing your users are to wait, as long as they know they have to wait. 4) Throw more hardware at it. If you absolutely _must_have_ super- accurate results faster, then you may need to buy more RAM, faster disks and faster CPUs to accomplish it. 5) Come up with something revolutionary that nobody's every thought of before. Good luck with this one. Of course, all of these ideas are only practical if you've already ensured that your system is properly tuned. Crappy values for shared_buffers and other tuning will lead you to waste time trying to redesign something that should work just fine, so verify all your configuration first. You may be able to get more acceptable estimates by increasing your statistics targets, for example. -- Bill Moran http://www.potentialtech.com
On Aug 15, 2007, at 9:36 AM, Phoenix Kiula wrote: > I'm grappling with a lot of reporting code for our app that relies on > queries such as: > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... > > And I still do not find, from the discussions on this thread, any > truly viable solution for this. The one suggestion is to have a > separate counts table, which is fine for total aggregates related to, > say, an ID. E.g., a table with: > > trader_id, trade_count > > But this is an overall count for the trader (in my example). What if I > need a count of all his trades in the last one week. Then I need a > timestamp condition in there as well. The number of such possibilities > for multiple WHERE conditions is infinite...how should we account for > all these avenues? There is no general solution. While theoretically the multiple WHERE conditions are infinite, in reality their limited to your actual use cases and the solutions are thereby dictated by those. Using a separate cache table is often a viable option used in situations where constantly up to date realtime values. Another common option is smart usage of indexes, i.e remember that you can index on the results of a function applied to row values as well as partial indexes. Another is table partitioning. Asking how to optimize "SELECT COUNT(*) FROM TABLE WHER... (conditions)" is not a good question as the solution is dependent on those conditions. Pick your most common conditions and optimize for those. Also, in many cases for reporting apps, 10 minutes is not long at all. If you have reports that you can't make happen faster, schedule and automate them. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
I don't know how PSQL does it, but MySQL has an SQL_CALC_FOUND_ROWS extension which allows the query to also return how many rows exist without the LIMIT clause. Perhaps there is similar for PSQL (check LIMIT docs?) - Andrew -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Scott Marlowe Sent: Thursday, 16 August 2007 1:24 AM To: Phoenix Kiula Cc: Gregory Stark; Postgres General Subject: Re: [GENERAL] Yet Another COUNT(*)...WHERE...question On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > On 15/08/07, Scott Marlowe <scott.marlowe@gmail.com> wrote: > > On 8/15/07, Phoenix Kiula <phoenix.kiula@gmail.com> wrote: > > > On 15/08/07, Gregory Stark <stark@enterprisedb.com> wrote: > > > > "Phoenix Kiula" <phoenix.kiula@gmail.com> writes: > > > > > > > > > I'm grappling with a lot of reporting code for our app that relies on > > > > > queries such as: > > > > > > > > > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)... > > > > >... > > > > > The number of such possibilities for multiple WHERE conditions is > > > > > infinite... > > > > > > > > Depends on the "conditions" bit. You can't solve all of the infinite > > > > possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understanding your data. > > > > > > > > > I am not sure what the advice here is. The WHERE condition comes from > > > the indices. So if the query was not "COUNT(*)" but just a couple of > > > columns, the query executes in less than a second. Just that COUNT(*) > > > becomes horribly slow. > > > > Sorry, but I don't believe you. if you're doing a count(*) on the > > same dataset that returns in < 1 second, then the count(*) with the > > same where clause will run in < 1 second. I haven't seen pgsql do > > anything else. > > > > Sorry I was not clear. Imagine an Amazon.com search results page. It > has about 15 results on Page 1, then it shows "Page 1 of 190". > > To show each page, the query probably has a "LIMIT 15 OFFSET 0" for > Page 1. However, to calculate the total number of pages, they probably > do a separate counts query, because doing a "select *" and then > counting the number of rows returned would be even more inefficient > than a count(*). When I go to amazon.com I only ever get three pages of results. ever. Because they know that returning 190 pages is not that useful, as hardly anyone is going to wander through that many pages. Google, you'll notice says "Results 1 - 10 of about 5,610,000 for blacksmith" i.e. it's guesstimating as well. no reason for google to look at every single row for blacksmith to know that there's about 5.6 million. > So, in reporting, two queries are fairly common I would think, unless > I am missing something? Yes, optimization. :) You don't need an exact count to tell someone that there's more data and they can go to it. Note that if you are planning on doing things google sized, you'll need to do what they did, invent your own specialized database. For us mere mortals, it's quite likely that you can do something like: explain select * from table where field like 'abc%'; and then parse the explain output for an approximate number. ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
"Scott Marlowe" wrote: >When I go to amazon.com I only ever get three pages of results. ever. > Because they know that returning 190 pages is not that useful, as >hardly anyone is going to wander through that many pages. > >Google, you'll notice says "Results 1 - 10 of about 5,610,000 for >blacksmith" i.e. it's guesstimating as well. no reason for google to >look at every single row for blacksmith to know that there's about 5.6 >million. But if you go to eBay, they always give you an accurate count. Even if the no. of items found is pretty large (example: <http://search.ebay.com/new>). Rainer
On Thu, Aug 16, 2007 at 12:12:03PM +0200, Rainer Bauer wrote: > "Scott Marlowe" wrote: > > >When I go to amazon.com I only ever get three pages of results. ever. > > Because they know that returning 190 pages is not that useful, as > >hardly anyone is going to wander through that many pages. > > > >Google, you'll notice says "Results 1 - 10 of about 5,610,000 for > >blacksmith" i.e. it's guesstimating as well. no reason for google to > >look at every single row for blacksmith to know that there's about 5.6 > >million. > > But if you go to eBay, they always give you an accurate count. Even if the no. > of items found is pretty large (example: <http://search.ebay.com/new>). And I'd bet money that they're using a full text search of some kind to get those results, which isn't remotely close to the same thing as a generic SELECT count(*). -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
Decibel! wrote: >On Thu, Aug 16, 2007 at 12:12:03PM +0200, Rainer Bauer wrote: >> "Scott Marlowe" wrote: >> >> >When I go to amazon.com I only ever get three pages of results. ever. >> > Because they know that returning 190 pages is not that useful, as >> >hardly anyone is going to wander through that many pages. >> > >> >Google, you'll notice says "Results 1 - 10 of about 5,610,000 for >> >blacksmith" i.e. it's guesstimating as well. no reason for google to >> >look at every single row for blacksmith to know that there's about 5.6 >> >million. >> >> But if you go to eBay, they always give you an accurate count. Even if the no. >> of items found is pretty large (example: <http://search.ebay.com/new>). > >And I'd bet money that they're using a full text search of some kind to >get those results, which isn't remotely close to the same thing as a >generic SELECT count(*). Without text search (but with a category restriction): <http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList> I only wanted to show a counter-example for a big site which uses pagination to display result sets and still reports accurate counts. Anyway, what Phoenix is trying to say is that 2 queries are required: One to get the total count and one to get the tuples for the current page. I reckon it would help, if the query returning the result set could also report the total no. of tuples found. Somthing like SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l> Or is there a way to do that? Rainer
"Rainer Bauer" <usenet@munnin.com> writes: > Anyway, what Phoenix is trying to say is that 2 queries are required: One to > get the total count and one to get the tuples for the current page. I reckon > it would help, if the query returning the result set could also report the > total no. of tuples found. Somthing like > SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l> > > Or is there a way to do that? Well anything like the above would just report l as the count. The only way to do it in Postgres currently is to create a temporary table. Then you can populate it once, then select the count from the temporary table in one query and the required page from it in the second query. But temporary tables in Postgres are not really designed for this. In particular they count as DDL so you have to grant privileges to create tables to the application and it has to create and delete entries in pg_class for every use. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On 8/16/07, Rainer Bauer <usenet@munnin.com> wrote: > >> But if you go to eBay, they always give you an accurate count. Even if the no. > >> of items found is pretty large (example: <http://search.ebay.com/new>). > > > >And I'd bet money that they're using a full text search of some kind to > >get those results, which isn't remotely close to the same thing as a > >generic SELECT count(*). > > Without text search (but with a category restriction): > <http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList> > > I only wanted to show a counter-example for a big site which uses pagination > to display result sets and still reports accurate counts. Categories are still finite state: you can simply store a count for each category. Again it's just a case of knowing your data and queries; it's not trying to solve a general infinite-possibilities situation. For instance, the OP mentioned wanting to get data on a particular trader for the last week. Maintain a summary table that keeps counts of each trader for each week, and ID bounds for the actual data table. When you need to query the last 4 weeks, sum(). When you need to query the last 30 days, sum() 4 weeks + a query on the master table bounded by timestamp and ID range for the 5th week from the summary table. I'm sure there are sites out there that provide precise counts quickly for extremely complex queries on gigantic datasets, but all the common stuff is about specifics, not arbitrary queries. There are also systems other than SQL RDBMS that can be used to drive such reporting.
On Thu, Aug 16, 2007 at 01:09:32PM +0200, Rainer Bauer wrote: > Anyway, what Phoenix is trying to say is that 2 queries are required: One to > get the total count and one to get the tuples for the current page. I reckon > it would help, if the query returning the result set could also report the > total no. of tuples found. Somthing like > SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l> Well, thee is another possibility, use cursors: DECLARE CURSOR ... AS <query>; FETCH 30 -- or however many to want now MOVE TO END -- or whatever the command is, this gives you the number of rows Hope this helps, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Attachment
"Trevor Talbot" wrote: >On 8/16/07, Rainer Bauer <usenet@munnin.com> wrote: > >> >> But if you go to eBay, they always give you an accurate count. Even if the no. >> >> of items found is pretty large (example: <http://search.ebay.com/new>). >> > >> >And I'd bet money that they're using a full text search of some kind to >> >get those results, which isn't remotely close to the same thing as a >> >generic SELECT count(*). >> >> Without text search (but with a category restriction): >> <http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList> >> >> I only wanted to show a counter-example for a big site which uses pagination >> to display result sets and still reports accurate counts. > >Categories are still finite state: you can simply store a count for >each category. Again it's just a case of knowing your data and >queries; it's not trying to solve a general infinite-possibilities >situation. Consider this query with multiple WHERE conditions: <http://search.ebay.com/ne-ol-an_W0QQfasiZ1QQfbdZ1QQfcdZ1QQfcidZ77QQfclZ3QQfmcZ1QQfrppZ50QQfsooZ1QQfsopZ1QQftidZ1QQpriceZ1QQsabdhiZ100QQsacurZ999QQsalicZQ2d15QQsaprchiZ50000QQsatitleZQ28neQ2aQ2colQ2aQ2canQ2aQ29QQsojsZ0> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate* numberof items found. Before this drifts off: * I do know *why* count(*) is slow using Postgres. * I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>). * I realize that pagination for multi-million tuple results does not make sense. Rainer
Gregory Stark wrote: >"Rainer Bauer" <usenet@munnin.com> writes: > >> Anyway, what Phoenix is trying to say is that 2 queries are required: One to >> get the total count and one to get the tuples for the current page. I reckon >> it would help, if the query returning the result set could also report the >> total no. of tuples found. Somthing like >> SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l> >> >> Or is there a way to do that? > >Well anything like the above would just report l as the count. True, but what about this: SELECT (SELECT COUNT(*) FROM <table> WHERE <cond>), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l> I just tested this on a query and its about 5-10% faster than issuing both commands separately (caching effects?). I wonderwhether there would be any chance that Postgres could detect that the "count" select and the "data" select result inthe same query plan? In my example (which is included below) the hash join is executed twice. >The only way to do it in Postgres currently is to create a temporary table. >Then you can populate it once, then select the count from the temporary table >in one query and the required page from it in the second query. > >But temporary tables in Postgres are not really designed for this. In >particular they count as DDL so you have to grant privileges to create tables >to the application and it has to create and delete entries in pg_class for >every use. Well I don't think popuplating a temporary table with possible millions of rows is faster than executing the query twice.Remember that a performance problem only occurs if there are a lot of tuples returned. Rainer ====================================================================== This is the count query: SELECT COUNT(*) FROM "tblItem" AS i INNER JOIN (SELECT "intItemID" FROM "tblItem2Category" WHERE "intCategoryID"=88869805) AS vrtChild ON i."intItemIDCnt"=vrtChild."intItemID" WHERE ("intTimeEnd" < 1187273177) This is the select analyse output: "Aggregate (cost=356098.46..356098.47 rows=1 width=0) (actual time=29411.570..29411.570 rows=1 loops=1)" " -> Hash Join (cost=177545.23..350137.60 rows=2384343 width=0) (actual time=17382.286..28864.851 rows=2383740 loops=1)" " Hash Cond: ("tblItem2Category"."intItemID" = i."intItemIDCnt")" " -> Bitmap Heap Scan on "tblItem2Category" (cost=41560.03..134660.50 rows=2561397 width=4) (actual time=1984.006..11048.762rows=2513204 loops=1)" " Recheck Cond: ("intCategoryID" = 88869805)" " -> Bitmap Index Scan on ccitem2categorycategoryidix (cost=0.00..40919.69 rows=2561397 width=0) (actual time=1980.614..1980.614rows=2513204 loops=1)" " Index Cond: ("intCategoryID" = 88869805)" " -> Hash (cost=95316.41..95316.41 rows=2339583 width=4) (actual time=15024.827..15024.827 rows=2383832 loops=1)" " -> Seq Scan on "tblItem" i (cost=0.00..95316.41 rows=2339583 width=4) (actual time=8.634..13763.878 rows=2383832loops=1)" " Filter: ("intTimeEnd" < 1187273177)" "Total runtime: 29411.668 ms" ====================================================================== This is the data query: SELECT i."intItemIDCnt" FROM "tblItem" AS i INNER JOIN (SELECT "intItemID" FROM "tblItem2Category" WHERE "intCategoryID"=88869805) AS vrtChild ON i."intItemIDCnt"=vrtChild."intItemID" WHERE ("intTimeEnd" < 1187273177) ORDER BY "intFlagListingFeatured" DESC, "intTimeEnd", "intItemIDCnt" OFFSET 500 LIMIT 50 This is the select analyse output: "Limit (cost=733011.30..733011.42 rows=50 width=12) (actual time=37852.007..37852.058 rows=50 loops=1)" " -> Sort (cost=733010.05..738970.91 rows=2384343 width=12) (actual time=37851.581..37851.947 rows=550 loops=1)" " Sort Key: i."intFlagListingFeatured", i."intTimeEnd", i."intItemIDCnt"" " -> Hash Join (cost=179830.23..354707.60 rows=2384343 width=12) (actual time=17091.753..29040.425 rows=2383740loops=1)" " Hash Cond: ("tblItem2Category"."intItemID" = i."intItemIDCnt")" " -> Bitmap Heap Scan on "tblItem2Category" (cost=41560.03..134660.50 rows=2561397 width=4) (actual time=1976.599..10970.394rows=2513204 loops=1)" " Recheck Cond: ("intCategoryID" = 88869805)" " -> Bitmap Index Scan on ccitem2categorycategoryidix (cost=0.00..40919.69 rows=2561397 width=0) (actualtime=1973.160..1973.160 rows=2513204 loops=1)" " Index Cond: ("intCategoryID" = 88869805)" " -> Hash (cost=95316.41..95316.41 rows=2339583 width=12) (actual time=14758.256..14758.256 rows=2383832 loops=1)" " -> Seq Scan on "tblItem" i (cost=0.00..95316.41 rows=2339583 width=12) (actual time=8.592..13373.179rows=2383832 loops=1)" " Filter: ("intTimeEnd" < 1187273177)" "Total runtime: 38247.533 ms" ====================================================================== This is the combined count/data query: SELECT (SELECT COUNT(*) FROM "tblItem" AS i INNER JOIN (SELECT "intItemID" FROM "tblItem2Category" WHERE "intCategoryID"=88869805)AS vrtChild ON i."intItemIDCnt"=vrtChild."intItemID" WHERE ("intTimeEnd" < 1187273177)), i."intItemIDCnt"FROM "tblItem" AS i INNER JOIN (SELECT "intItemID" FROM "tblItem2Category" WHERE "intCategoryID"=88869805)AS vrtChild ON i."intItemIDCnt"=vrtChild."intItemID" WHERE ("intTimeEnd" < 1187273177) ORDER BY"intFlagListingFeatured" DESC, "intTimeEnd", "intItemIDCnt" OFFSET 500 LIMIT 50 "Limit (cost=1089109.77..1089109.90 rows=50 width=12) (actual time=62547.673..62547.727 rows=50 loops=1)" " InitPlan" " -> Aggregate (cost=356098.46..356098.47 rows=1 width=0) (actual time=16385.927..16385.927 rows=1 loops=1)" " -> Hash Join (cost=177545.23..350137.60 rows=2384343 width=0) (actual time=4320.749..15843.759 rows=2383740loops=1)" " Hash Cond: ("tblItem2Category"."intItemID" = i."intItemIDCnt")" " -> Bitmap Heap Scan on "tblItem2Category" (cost=41560.03..134660.50 rows=2561397 width=4) (actual time=293.025..9211.673rows=2513204 loops=1)" " Recheck Cond: ("intCategoryID" = 88869805)" " -> Bitmap Index Scan on ccitem2categorycategoryidix (cost=0.00..40919.69 rows=2561397 width=0) (actualtime=289.602..289.602 rows=2513204 loops=1)" " Index Cond: ("intCategoryID" = 88869805)" " -> Hash (cost=95316.41..95316.41 rows=2339583 width=4) (actual time=3461.292..3461.292 rows=2383832 loops=1)" " -> Seq Scan on "tblItem" i (cost=0.00..95316.41 rows=2339583 width=4) (actual time=3.658..1692.979rows=2383832 loops=1)" " Filter: ("intTimeEnd" < 1187273177)" " -> Sort (cost=733010.05..738970.91 rows=2384343 width=12) (actual time=62547.242..62547.616 rows=550 loops=1)" " Sort Key: i."intFlagListingFeatured", i."intTimeEnd", i."intItemIDCnt"" " -> Hash Join (cost=179830.23..354707.60 rows=2384343 width=12) (actual time=38625.024..51768.452 rows=2383740loops=1)" " Hash Cond: ("tblItem2Category"."intItemID" = i."intItemIDCnt")" " -> Bitmap Heap Scan on "tblItem2Category" (cost=41560.03..134660.50 rows=2561397 width=4) (actual time=1968.052..2893.092rows=2513204 loops=1)" " Recheck Cond: ("intCategoryID" = 88869805)" " -> Bitmap Index Scan on ccitem2categorycategoryidix (cost=0.00..40919.69 rows=2561397 width=0) (actualtime=1964.642..1964.642 rows=2513204 loops=1)" " Index Cond: ("intCategoryID" = 88869805)" " -> Hash (cost=95316.41..95316.41 rows=2339583 width=12) (actual time=19915.284..19915.284 rows=2383832 loops=1)" " -> Seq Scan on "tblItem" i (cost=0.00..95316.41 rows=2339583 width=12) (actual time=8.622..18369.696rows=2383832 loops=1)" " Filter: ("intTimeEnd" < 1187273177)" "Total runtime: 63042.165 ms" ======================================================================
Martijn van Oosterhout wrote: >On Thu, Aug 16, 2007 at 01:09:32PM +0200, Rainer Bauer wrote: >> Anyway, what Phoenix is trying to say is that 2 queries are required: One to >> get the total count and one to get the tuples for the current page. I reckon >> it would help, if the query returning the result set could also report the >> total no. of tuples found. Somthing like >> SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l> > >Well, thee is another possibility, use cursors: > >DECLARE CURSOR ... AS <query>; >FETCH 30 -- or however many to want now >MOVE TO END -- or whatever the command is, this gives you the number of rows > >Hope this helps, Thanks! I will give this a try. Rainer
On 16/08/07, Rainer Bauer <usenet@munnin.com> wrote: > Gregory Stark wrote: > > >"Rainer Bauer" <usenet@munnin.com> writes: > > > >> Anyway, what Phoenix is trying to say is that 2 queries are required: One to > >> get the total count and one to get the tuples for the current page. I reckon > >> it would help, if the query returning the result set could also report the > >> total no. of tuples found. Somthing like > >> SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l> > >> > >> Or is there a way to do that? > > > >Well anything like the above would just report l as the count. > > True, but what about this: > > SELECT (SELECT COUNT(*) FROM <table> WHERE <cond>), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l> > Whoa, this may not please SQL puritans but I love it! And yes, it is cached. I find the idea of temporary tables and storing counts for different 'slices' of my data untenable with all the complex mishmash of triggers and such. The count(*) query seems to take a bit in the beginning but works ok thereafter because it seems to be auto-cached. Sweet. Thanks for sharing!!
In response to Rainer Bauer <usenet@munnin.com>: > "Trevor Talbot" wrote: > > >On 8/16/07, Rainer Bauer <usenet@munnin.com> wrote: > > > >> >> But if you go to eBay, they always give you an accurate count. Even if the no. > >> >> of items found is pretty large (example: <http://search.ebay.com/new>). > >> > > >> >And I'd bet money that they're using a full text search of some kind to > >> >get those results, which isn't remotely close to the same thing as a > >> >generic SELECT count(*). > >> > >> Without text search (but with a category restriction): > >> <http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList> > >> > >> I only wanted to show a counter-example for a big site which uses pagination > >> to display result sets and still reports accurate counts. > > > >Categories are still finite state: you can simply store a count for > >each category. Again it's just a case of knowing your data and > >queries; it's not trying to solve a general infinite-possibilities > >situation. > > Consider this query with multiple WHERE conditions: > <http://search.ebay.com/ne-ol-an_W0QQfasiZ1QQfbdZ1QQfcdZ1QQfcidZ77QQfclZ3QQfmcZ1QQfrppZ50QQfsooZ1QQfsopZ1QQftidZ1QQpriceZ1QQsabdhiZ100QQsacurZ999QQsalicZQ2d15QQsaprchiZ50000QQsatitleZQ28neQ2aQ2colQ2aQ2canQ2aQ29QQsojsZ0> > > My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate*number of items found. While I don't _want_ to argue with you ... I can't seem to help myself. How do you _know_ that's the exact number of items? There are 50 items on that page, the paginator at the bottom shows 97,686 pages, but there's no way (that I can find) to go to the _last_ page to ensure that said numbers are correct. It could simply be estimating the number of items and calculating the # of pages based on that. With 4mil items, a few 1000 off isn't anything anyone would notice. > Before this drifts off: > * I do know *why* count(*) is slow using Postgres. > * I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>). That could be possible, but it's still speculation at this point. If someone with Oracle-fu could say for sure one way or the other, that would be interesting ... Unless there's data on that sun.com page that provides more detail. It doesn't seem to be willing to load for me at this point ... > * I realize that pagination for multi-million tuple results does not make sense. Then what is the point to this thread? Are we just shooting the breeze at this point? -- Bill Moran http://www.potentialtech.com
Rainer Bauer <usenet@munnin.com> writes: > My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate*number of items found. And exactly how do you know that that's true? regards, tom lane
On Aug 16, 5:19 am, deci...@decibel.org (Decibel!) wrote: > On Thu, Aug 16, 2007 at 12:12:03PM +0200, Rainer Bauer wrote: > > "Scott Marlowe" wrote: > > But if you go to eBay, they always give you an accurate count. Even if the no. > > of items found is pretty large (example: <http://search.ebay.com/new>). > > And I'd bet money that they're using a full text search of some kind to > get those results, which isn't remotely close to the same thing as a > generic SELECT count(*). http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf Search for the text "Scaling Search". Interesting stuff.
Bill Moran wrote: >> Consider this query with multiple WHERE conditions: >> <http://search.ebay.com/ne-ol-an_W0QQfasiZ1QQfbdZ1QQfcdZ1QQfcidZ77QQfclZ3QQfmcZ1QQfrppZ50QQfsooZ1QQfsopZ1QQftidZ1QQpriceZ1QQsabdhiZ100QQsacurZ999QQsalicZQ2d15QQsaprchiZ50000QQsatitleZQ28neQ2aQ2colQ2aQ2canQ2aQ29QQsojsZ0> >> >> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate*number of items found. > >While I don't _want_ to argue with you ... I can't seem to help myself. > >How do you _know_ that's the exact number of items? There are 50 items on >that page, the paginator at the bottom shows 97,686 pages, but there's no >way (that I can find) to go to the _last_ page to ensure that said numbers >are correct. It could simply be estimating the number of items and >calculating the # of pages based on that. With 4mil items, a few 1000 off >isn't anything anyone would notice. No, those numbers are correct. You can go to the last page using the input box at the end of the listing "Go to page". However, with a 4 million result set, the count will have changed while the last page is retrieved (new items have been listed and otheres have ended in the mean time). You will have to check this out with a search yielding fewer results (couple of hundred thousands) and load the first and last page simultaneously. >> * I realize that pagination for multi-million tuple results does not make sense. > >Then what is the point to this thread? Are we just shooting the breeze at >this point? I just wanted to show an example where accurate counts are reported for large search results. Rainer
Tom Lane wrote: >Rainer Bauer <usenet@munnin.com> writes: >> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate*number of items found. > >And exactly how do you know that that's true? 5 years experience with developing a browser for eBay? Rainer
On 8/16/07, Rainer Bauer <usenet@munnin.com> wrote: > My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate*number of items found. > > Before this drifts off: > * I do know *why* count(*) is slow using Postgres. > * I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>). > * I realize that pagination for multi-million tuple results does not make sense. You got me curious, so I went hunting for more hints on what eBay actually does, and found these slides from a presentation given by two eBay engineers last year: http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf It's, er, a whole different ballgame there. Database behavior is barely involved in their searching; they do joins and RI across database clusters within the _application_. I knew eBay was big, but wow...
"Trevor Talbot" wrote: >On 8/16/07, Rainer Bauer <usenet@munnin.com> wrote: > >> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate*number of items found. >> >> Before this drifts off: >> * I do know *why* count(*) is slow using Postgres. >> * I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>). >> * I realize that pagination for multi-million tuple results does not make sense. > >You got me curious, so I went hunting for more hints on what eBay >actually does, and found these slides from a presentation given by two >eBay engineers last year: >http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf Quite interesting. >It's, er, a whole different ballgame there. Database behavior is >barely involved in their searching; they do joins and RI across >database clusters within the _application_. I knew eBay was big, but >wow... Well then: forget the Oracle count(*) argument :-( Rainer
On 8/16/07, Rainer Bauer <usenet@munnin.com> wrote: > "Trevor Talbot" wrote: > > >On 8/16/07, Rainer Bauer <usenet@munnin.com> wrote: > > > >> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate*number of items found. > >> > >> Before this drifts off: > >> * I do know *why* count(*) is slow using Postgres. > >> * I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>). > >> * I realize that pagination for multi-million tuple results does not make sense. > > > >You got me curious, so I went hunting for more hints on what eBay > >actually does, and found these slides from a presentation given by two > >eBay engineers last year: > >http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf > > Quite interesting. > > >It's, er, a whole different ballgame there. Database behavior is > >barely involved in their searching; they do joins and RI across > >database clusters within the _application_. I knew eBay was big, but > >wow... > > Well then: forget the Oracle count(*) argument :-( FYI, I went to the ebay page you posted, which listed something like 98011 pages, and asked for page 96000. It searched for about a minute and timed out with the error message There was a problem executing your request. Please try again. Tried it again, twice, about 5 minutes apart, and got the same error each time. So I'm guessing that ebay is better at making your THINK it has the exact count than actually having the exact count.
"Scott Marlowe" wrote: >FYI, I went to the ebay page you posted, which listed something like >98011 pages, and asked for page 96000. It searched for about a minute >and timed out with the error message > >There was a problem executing your request. Please try again. > >Tried it again, twice, about 5 minutes apart, and got the same error each time. > >So I'm guessing that ebay is better at making your THINK it has the >exact count than actually having the exact count. Well that depends on the current traffic on eBay. It worked allright this afternoon (GMT), but I get the same error message now (btw, a few years ago you only got an error message if a search retruned more than 1M items). Try it with this search: <http://search.ebay.com/new> which gives about 2.2M items. Rainer