Thread: browsing table with 2 million records
I am running Postgre 7.4 on FreeBSD. The main table have 2 million record (we would like to do at least 10 mil or more). It is mainly a FIFO structure with maybe 200,000 new records coming in each day that displace the older records.
We have a GUI that let user browser through the record page by page at about 25 records a time. (Don't ask me why but we have to have this GUI). This translates to something like
select count(*) from table <-- to give feedback about the DB size
select * from table order by date limit 25 offset 0
Tables seems properly indexed, with vacuum and analyze ran regularly. Still this very basic SQLs takes up to a minute run.
I read some recent messages that select count(*) would need a table scan for Postgre. That's disappointing. But I can accept an approximation if there are some way to do so. But how can I optimize select * from table order by date limit x offset y? One minute response time is not acceptable.
Any help would be appriciated.
Wy
We have a GUI that let user browser through the record page by page at about 25 records a time. (Don't ask me why but we have to have this GUI). This translates to something like
select count(*) from table <-- to give feedback about the DB size
select * from table order by date limit 25 offset 0
Tables seems properly indexed, with vacuum and analyze ran regularly. Still this very basic SQLs takes up to a minute run.
I read some recent messages that select count(*) would need a table scan for Postgre. That's disappointing. But I can accept an approximation if there are some way to do so. But how can I optimize select * from table order by date limit x offset y? One minute response time is not acceptable.
Any help would be appriciated.
Wy
Do you have an index on the date column? Can you post an EXPLAIN ANALYZE for the slow query? -- Mark Lewis On Wed, 2005-10-26 at 13:41 -0700, aurora wrote: > I am running Postgre 7.4 on FreeBSD. The main table have 2 million > record (we would like to do at least 10 mil or more). It is mainly a > FIFO structure with maybe 200,000 new records coming in each day that > displace the older records. > > We have a GUI that let user browser through the record page by page at > about 25 records a time. (Don't ask me why but we have to have this > GUI). This translates to something like > > select count(*) from table <-- to give feedback about the DB size > select * from table order by date limit 25 offset 0 > > Tables seems properly indexed, with vacuum and analyze ran regularly. > Still this very basic SQLs takes up to a minute run. > > I read some recent messages that select count(*) would need a table > scan for Postgre. That's disappointing. But I can accept an > approximation if there are some way to do so. But how can I optimize > select * from table order by date limit x offset y? One minute > response time is not acceptable. > > Any help would be appriciated. > > Wy > >
On Wed, 2005-10-26 at 15:41, aurora wrote: > I am running Postgre 7.4 on FreeBSD. The main table have 2 million > record (we would like to do at least 10 mil or more). It is mainly a > FIFO structure with maybe 200,000 new records coming in each day that > displace the older records. > > We have a GUI that let user browser through the record page by page at > about 25 records a time. (Don't ask me why but we have to have this > GUI). This translates to something like > > select count(*) from table <-- to give feedback about the DB size > select * from table order by date limit 25 offset 0 > > Tables seems properly indexed, with vacuum and analyze ran regularly. > Still this very basic SQLs takes up to a minute run. > > I read some recent messages that select count(*) would need a table > scan for Postgre. That's disappointing. But I can accept an > approximation if there are some way to do so. But how can I optimize > select * from table order by date limit x offset y? One minute > response time is not acceptable. Have you run your script without the select count(*) part and timed it? What does explain analyze select * from table order by date limit 25 offset 0 say? Is date indexed?
> We have a GUI that let user browser through the record page by page at > about 25 records a time. (Don't ask me why but we have to have this > GUI). This translates to something like > > select count(*) from table <-- to give feedback about the DB size Do you have a integer field that is an ID that increments? E.g; serial? > select * from table order by date limit 25 offset 0 You could use a cursor. Sincerely, Joshua D. Drake > > Tables seems properly indexed, with vacuum and analyze ran regularly. > Still this very basic SQLs takes up to a minute run. > > I read some recent messages that select count(*) would need a table > scan for Postgre. That's disappointing. But I can accept an > approximation if there are some way to do so. But how can I optimize > select * from table order by date limit x offset y? One minute > response time is not acceptable. > > Any help would be appriciated. > > Wy > > -- The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 PostgreSQL Replication, Consulting, Custom Development, 24x7 support Managed Services, Shared and Dedicated Hosting Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/
>> select * from table order by date limit 25 offset 0
> Do you have an index on the date column? Can you post an EXPLAIN
> ANALYZE for the slow query?
Wow! Now that I look again there are actually 2 date fields. One is indexed and one is not. Order by was done on the column without index. Using the indexed column turn a seq scan into index scan and the query performance is totally fine now.
It would still be helpful if select count(*) can perform well.
Thanks!
Wy
> Do you have an index on the date column? Can you post an EXPLAIN
> ANALYZE for the slow query?
Wow! Now that I look again there are actually 2 date fields. One is indexed and one is not. Order by was done on the column without index. Using the indexed column turn a seq scan into index scan and the query performance is totally fine now.
It would still be helpful if select count(*) can perform well.
Thanks!
Wy
You could also create your own index so to speak as a table that simply contains a list of primary keys and an order value field that you can use as your offset. This can be kept in sync with the master table using triggers pretty easily. 2 million is not very much if you only have a integer pkey, and an integer order value, then you can join it against the main table. create table my_index_table ( primary_key_value int, order_val int, primary key (primary_key_value)); create index my_index_table_order_val_i on index_table (order_val); select * from main_table a, my_index_table b where b.order_val>=25 and b.order_val<50 and a.primary_key_id=b.primary_key_id If the data updates alot then this won't work as well though as the index table will require frequent updates to potentialy large number of records (although a small number of pages so it still won't be horrible). Alex Turner NetEconomist On 10/26/05, Joshua D. Drake <jd@commandprompt.com> wrote: > > > We have a GUI that let user browser through the record page by page at > > about 25 records a time. (Don't ask me why but we have to have this > > GUI). This translates to something like > > > > select count(*) from table <-- to give feedback about the DB size > > Do you have a integer field that is an ID that increments? E.g; serial? > > > select * from table order by date limit 25 offset 0 > > You could use a cursor. > > Sincerely, > > Joshua D. Drake > > > > > > Tables seems properly indexed, with vacuum and analyze ran regularly. > > Still this very basic SQLs takes up to a minute run. > > > > I read some recent messages that select count(*) would need a table > > scan for Postgre. That's disappointing. But I can accept an > > approximation if there are some way to do so. But how can I optimize > > select * from table order by date limit x offset y? One minute > > response time is not acceptable. > > > > Any help would be appriciated. > > > > Wy > > > > > -- > The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564 > PostgreSQL Replication, Consulting, Custom Development, 24x7 support > Managed Services, Shared and Dedicated Hosting > Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/ > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
aurora <aurora00@gmail.com> writes: > It would still be helpful if select count(*) can perform well. If you can settle for an approximate count, pg_class.reltuples might help you. regards, tom lane
> I am running Postgre 7.4 on FreeBSD. The main table have 2 million record > (we would like to do at least 10 mil or more). It is mainly a FIFO > structure > with maybe 200,000 new records coming in each day that displace the older > records. I'm so sorry, but I have to rant XDDD People who present a list of 100 items, paginated with 10 items per page so that it fits on half a 800x600 screen should be shot. I can scroll with my mousewheel and use text search in my browser... People who present a paginated view with 100.000 pages where you have to apply bisection search by hand to find records starting with "F" are on page 38651 should be forced to use a keyboard with just 1 key and type in morse code. Problem of pagination is that the page number is meaningless and rather useless to the user. It is also meaningless to the database, which means you have to use slow kludges like count() and limit/offset. And as people insert stuff in the table while you browse, when you hit next page you will see on top, half of what was on the previous page, because it was pushed down by new records. Or you might miss records. So, rather than using a meaningless "record offset" as a page number, you can use something meaningful, like a date, first letter of a name, region, etc. Of course, MySQL, always eager to encourage sucky-sucky practices, provides a neat CALC_FOUND_ROWS hack, which, while not being super SQL standard compliant, allows you to retrieve the number of rows the query would have returned if you wouldn't have used limit, so you can compute the number of pages and grab one page with only one query. So people use paginators instead of intelligent solutions, like xmlhttp+javascript enabled autocompletion in forms, etc. And you have to scroll to page 38651 to find letter "F". So if you need to paginate on your site : CHEAT !!!! Who needs a paginated view with 100.000 pages ? - Select min(date) and max(date) from your table - Present a nifty date selector to choose the records from any day, hour, minute, second - show them, with "next day" and "previous day" buttons - It's more useful to the user (most likely he wants to know what happened on 01/05/2005 rather than view page 2857) - It's faster (no more limit/offset ! just "date BETWEEN a AND b", indexed of course) - no more new items pushing old ones to the next page while you browse - you can pretend to your boss it's just like a paginated list
> We have a GUI that let user browser through the record page by page at > about 25 records a time. (Don't ask me why but we have to have this > GUI). This translates to something like > > select count(*) from table <-- to give feedback about the DB size > select * from table order by date limit 25 offset 0 Heh, sounds like phpPgAdmin...I really should do something about that. > Tables seems properly indexed, with vacuum and analyze ran regularly. > Still this very basic SQLs takes up to a minute run. Yes, COUNT(*) on a large table is always slow in PostgreSQL. Search the mailing lists for countless discussions about it. Chris
> Who needs a paginated view with 100.000 pages ? > > - Select min(date) and max(date) from your table > - Present a nifty date selector to choose the records from any day, > hour, minute, second > - show them, with "next day" and "previous day" buttons > > - It's more useful to the user (most likely he wants to know what > happened on 01/05/2005 rather than view page 2857) > - It's faster (no more limit/offset ! just "date BETWEEN a AND b", > indexed of course) > - no more new items pushing old ones to the next page while you browse > - you can pretend to your boss it's just like a paginated list All very well and good, but now do it generically...
Christopher > > - Present a nifty date selector to choose the records from any day, > > hour, minute, second > > - show them, with "next day" and "previous day" buttons > > > > - It's more useful to the user (most likely he wants to know what > > happened on 01/05/2005 rather than view page 2857) > > - It's faster (no more limit/offset ! just "date BETWEEN a AND b", > > indexed of course) > > - no more new items pushing old ones to the next page while you > browse > > - you can pretend to your boss it's just like a paginated list > > All very well and good, but now do it generically... I've done it... First of all I totally agree with PFC's rant regarding absolute positioning while browsing datasets. Among other things, it has serious problems if you have multiple updating your table. Also it's kind of silly to be doing this in a set based data paradigm. The 'SQL' way to browse a dataset is by key. If your key has multiple parts or you are trying to sort on two or more fields, you are supposed to use the row constructor: select * from t where (x, y) > (xc, yc) order by x,y; Unfortunately, this gives the wrong answer in postgresql :(. The alternative is to use boolean logic. Here is a log snippit from my ISAM driver (in ISAM, you are *always* browsing datasets): prepare system_read_next_menu_item_favorite_file_0 (character varying, int4, int4, int4) as select from system.menu_item_favorite_file where mif_user_id >= $1 and (mif_user_id > $1 or mif_menu_item_id >= $2) and (mif_user_id > $1 or mif_menu_item_id > $2 or mif_sequence_no > $3) order by mif_user_id, mif_menu_item_id, mif_sequence_no limit $4 This is a Boolean based 'get next record' in a 3 part key plus a parameterized limit. You can do this without using prepared statements of course but with the prepared version you can at least do execute system_read_next_menu_item_favorite_file_0('abc', 1, 2, 1); Merlin
> I've done it... > First of all I totally agree with PFC's rant regarding absolute > positioning while browsing datasets. Among other things, it has serious > problems if you have multiple updating your table. Also it's kind of > silly to be doing this in a set based data paradigm. Recently I've been browsing some site and it had this problem : as users kept adding new entries as I was browsing the list page after page, when I hit "next page" I got on the next page half of what I already saw on the previous page. Of course the webmaster has set the "visited links" color the same as "unvisited links", so I couldn't tell, and had to use my brain, which was quite upsetting XDDD And bookmarking a page to resume browsing at some later time does not work either, because if I bookmark page 15, then when I come back, users have added 10 pages of content and what I bookmarked is now on page 25... >> All very well and good, but now do it generically... Hehe. I like ranting... It is not possible to do it in a generic way that works in all cases. For instance : Forum topic case : - posts are added at the bottom and not at the top - page number is relevant and meaningful However, in most cases, you can use a multipart key and get it right. Suppose that, for instance, you have a base of several million records, organized according to : - date (like the original poster) or : - country, region, city, customer last name, first name. You could ask for the first three, but then you'll get 50000 Smiths in New York and 1 Van Bliezinsky. Or you could precalculate, once a week, a key interval distribution that creates reasonable sized intervals (for instance, 100 values in each), maybe asking that each interval should only contain only one city. So, you would get : Country Region City LastName FirstName USA NYC NY Smith, '' USA NYC NY Smith, Albert USA NYC NY Smith, Bernard ..... USA NYC NY Smith, William ... USA NYC NY Von Braun ... So you'd predetermine your "page breaks" ahead of time, and recompute them once in a while. You won't get identically sized pages, but if the statistical distribution of the data plays nice, you should get evenly sized pages. The interesting part is that you can present the user with a selector which presents meaningful and useful data, AND is fast to compute, AND is fast to use. In this case, it would amount to "Select country, region, city", then, display a list like this : Smith, ...Albert Smith, Albus...Bernard ... Smith, William... ... Von Braun...Von Schwarts ... So Jeannette Smith would be easy to find, being in the link "Smith, Jean...John" for instance. If the aim is to quickly locate a particular record, I like javascript-powered autocompletion better ; but for browsing, this pagination method is cool. Regards !