Thread: browsing table with 2 million records

browsing table with 2 million records

From
aurora
Date:
I am running Postgre 7.4 on FreeBSD. The main table have 2 million record (we would like to do at least 10 mil or more). It is mainly a FIFO structure with maybe 200,000 new records coming in each day that displace the older records.

We have a GUI that let user browser through the record page by page at about 25 records a time. (Don't ask me why but we have to have this GUI). This translates to something like

  select count(*) from table   <-- to give feedback about the DB size
  select * from table order by date limit 25 offset 0

Tables seems properly indexed, with vacuum and analyze ran regularly. Still this very basic SQLs takes up to a minute run.

I read some recent messages that select count(*) would need a table scan for Postgre. That's disappointing. But I can accept an approximation if there are some way to do so. But how can I optimize select * from table order by date limit x offset y? One minute response time is not acceptable.

Any help would be appriciated.

Wy


Re: browsing table with 2 million records

From
Mark Lewis
Date:
Do you have an index on the date column?  Can you post an EXPLAIN
ANALYZE for the slow query?

-- Mark Lewis

On Wed, 2005-10-26 at 13:41 -0700, aurora wrote:
> I am running Postgre 7.4 on FreeBSD. The main table have 2 million
> record (we would like to do at least 10 mil or more). It is mainly a
> FIFO structure with maybe 200,000 new records coming in each day that
> displace the older records.
>
> We have a GUI that let user browser through the record page by page at
> about 25 records a time. (Don't ask me why but we have to have this
> GUI). This translates to something like
>
>   select count(*) from table   <-- to give feedback about the DB size
>   select * from table order by date limit 25 offset 0
>
> Tables seems properly indexed, with vacuum and analyze ran regularly.
> Still this very basic SQLs takes up to a minute run.
>
> I read some recent messages that select count(*) would need a table
> scan for Postgre. That's disappointing. But I can accept an
> approximation if there are some way to do so. But how can I optimize
> select * from table order by date limit x offset y? One minute
> response time is not acceptable.
>
> Any help would be appriciated.
>
> Wy
>
>


Re: browsing table with 2 million records

From
Scott Marlowe
Date:
On Wed, 2005-10-26 at 15:41, aurora wrote:
> I am running Postgre 7.4 on FreeBSD. The main table have 2 million
> record (we would like to do at least 10 mil or more). It is mainly a
> FIFO structure with maybe 200,000 new records coming in each day that
> displace the older records.
>
> We have a GUI that let user browser through the record page by page at
> about 25 records a time. (Don't ask me why but we have to have this
> GUI). This translates to something like
>
>   select count(*) from table   <-- to give feedback about the DB size
>   select * from table order by date limit 25 offset 0
>
> Tables seems properly indexed, with vacuum and analyze ran regularly.
> Still this very basic SQLs takes up to a minute run.
>
> I read some recent messages that select count(*) would need a table
> scan for Postgre. That's disappointing. But I can accept an
> approximation if there are some way to do so. But how can I optimize
> select * from table order by date limit x offset y? One minute
> response time is not acceptable.

Have you run your script without the select count(*) part and timed it?

What does

explain analyze select * from table order by date limit 25 offset 0

say?

Is date indexed?

Re: browsing table with 2 million records

From
"Joshua D. Drake"
Date:
> We have a GUI that let user browser through the record page by page at
> about 25 records a time. (Don't ask me why but we have to have this
> GUI). This translates to something like
>
>   select count(*) from table   <-- to give feedback about the DB size

Do you have a integer field that is an ID that increments? E.g; serial?

>   select * from table order by date limit 25 offset 0

You could use a cursor.

Sincerely,

Joshua D. Drake


>
> Tables seems properly indexed, with vacuum and analyze ran regularly.
> Still this very basic SQLs takes up to a minute run.
>
> I read some recent messages that select count(*) would need a table
> scan for Postgre. That's disappointing. But I can accept an
> approximation if there are some way to do so. But how can I optimize
> select * from table order by date limit x offset y? One minute
> response time is not acceptable.
>
> Any help would be appriciated.
>
> Wy
>
>
--
The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/


Re: browsing table with 2 million records

From
aurora
Date:
>>   select * from table order by date limit 25 offset 0

> Do you have an index on the date column?  Can you post an EXPLAIN
> ANALYZE for the slow query?

Wow! Now that I look again there are actually 2 date fields. One is indexed and one is not. Order by was done on the column without index. Using the indexed column turn a seq scan into index scan and the query performance is totally fine now.

It would still be helpful if select count(*) can perform well.

Thanks!

Wy

Re: browsing table with 2 million records

From
Alex Turner
Date:
You could also create your own index so to speak as a table that
simply contains a list of primary keys and an order value field that
you can use as your offset.  This can be kept in sync with the master
table using triggers pretty easily.  2 million is not very much if you
only have a integer pkey, and an integer order value, then you can
join it against the main table.

create table my_index_table (
primary_key_value int,
order_val int,
primary key (primary_key_value));

create index my_index_table_order_val_i on index_table (order_val);

select * from main_table a, my_index_table b where b.order_val>=25 and
b.order_val<50 and a.primary_key_id=b.primary_key_id

If the data updates alot then this won't work as well though as the
index table will require frequent updates to potentialy large number
of records (although a small number of pages so it still won't be
horrible).

Alex Turner
NetEconomist

On 10/26/05, Joshua D. Drake <jd@commandprompt.com> wrote:
>
> > We have a GUI that let user browser through the record page by page at
> > about 25 records a time. (Don't ask me why but we have to have this
> > GUI). This translates to something like
> >
> >   select count(*) from table   <-- to give feedback about the DB size
>
> Do you have a integer field that is an ID that increments? E.g; serial?
>
> >   select * from table order by date limit 25 offset 0
>
> You could use a cursor.
>
> Sincerely,
>
> Joshua D. Drake
>
>
> >
> > Tables seems properly indexed, with vacuum and analyze ran regularly.
> > Still this very basic SQLs takes up to a minute run.
> >
> > I read some recent messages that select count(*) would need a table
> > scan for Postgre. That's disappointing. But I can accept an
> > approximation if there are some way to do so. But how can I optimize
> > select * from table order by date limit x offset y? One minute
> > response time is not acceptable.
> >
> > Any help would be appriciated.
> >
> > Wy
> >
> >
> --
> The PostgreSQL Company - Command Prompt, Inc. 1.503.667.4564
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
> Managed Services, Shared and Dedicated Hosting
> Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

Re: browsing table with 2 million records

From
Tom Lane
Date:
aurora <aurora00@gmail.com> writes:
> It would still be helpful if select count(*) can perform well.

If you can settle for an approximate count, pg_class.reltuples might
help you.

            regards, tom lane

Re: browsing table with 2 million records

From
PFC
Date:
> I am running Postgre 7.4 on FreeBSD. The main table have 2 million record
> (we would like to do at least 10 mil or more). It is mainly a FIFO
> structure
> with maybe 200,000 new records coming in each day that displace the older
> records.

    I'm so sorry, but I have to rant XDDD

    People who present a list of 100 items, paginated with 10 items per page
so that it fits on half a 800x600 screen should be shot.
    I can scroll with my mousewheel and use text search in my browser...

    People who present a paginated view with 100.000 pages where you have to
apply bisection search by hand to find records starting with "F" are on
page 38651 should be forced to use a keyboard with just 1 key and type in
morse code.

    Problem of pagination is that the page number is meaningless and rather
useless to the user. It is also meaningless to the database, which means
you have to use slow kludges like count() and limit/offset. And as people
insert stuff in the table while you browse, when you hit next page you
will see on top, half of what was on the previous page, because it was
pushed down by new records. Or you might miss records.

    So, rather than using a meaningless "record offset" as a page number, you
can use something meaningful, like a date, first letter of a name, region,
etc.

    Of course, MySQL, always eager to encourage sucky-sucky practices,
provides a neat CALC_FOUND_ROWS hack, which, while not being super SQL
standard compliant, allows you to retrieve the number of rows the query
would have returned if you wouldn't have used limit, so you can compute
the number of pages and grab one page with only one query.

    So people use paginators instead of intelligent solutions, like
xmlhttp+javascript enabled autocompletion in forms, etc. And you have to
scroll to page 38651 to find letter "F".

    So if you need to paginate on your site :

    CHEAT !!!!

    Who needs a paginated view with 100.000 pages ?

    - Select min(date) and max(date) from your table
    - Present a nifty date selector to choose the records from any day, hour,
minute, second
    - show them, with "next day" and "previous day" buttons

    - It's more useful to the user (most likely he wants to know what
happened on 01/05/2005 rather than view page 2857)
    - It's faster (no more limit/offset ! just "date BETWEEN a AND b",
indexed of course)
    - no more new items pushing old ones to the next page while you browse
    - you can pretend to your boss it's just like a paginated list














Re: browsing table with 2 million records

From
Christopher Kings-Lynne
Date:
> We have a GUI that let user browser through the record page by page at
> about 25 records a time. (Don't ask me why but we have to have this
> GUI). This translates to something like
>
>   select count(*) from table   <-- to give feedback about the DB size
>   select * from table order by date limit 25 offset 0

Heh, sounds like phpPgAdmin...I really should do something about that.

> Tables seems properly indexed, with vacuum and analyze ran regularly.
> Still this very basic SQLs takes up to a minute run.

Yes, COUNT(*) on a large table is always slow in PostgreSQL.  Search the
mailing lists for countless discussions about it.

Chris


Re: browsing table with 2 million records

From
Christopher Kings-Lynne
Date:
>     Who needs a paginated view with 100.000 pages ?
>
>     - Select min(date) and max(date) from your table
>     - Present a nifty date selector to choose the records from any day,
> hour,  minute, second
>     - show them, with "next day" and "previous day" buttons
>
>     - It's more useful to the user (most likely he wants to know what
> happened on 01/05/2005 rather than view page 2857)
>     - It's faster (no more limit/offset ! just "date BETWEEN a AND b",
> indexed of course)
>     - no more new items pushing old ones to the next page while you browse
>     - you can pretend to your boss it's just like a paginated list

All very well and good, but now do it generically...


Re: browsing table with 2 million records

From
"Merlin Moncure"
Date:
Christopher
> >     - Present a nifty date selector to choose the records from any
day,
> > hour,  minute, second
> >     - show them, with "next day" and "previous day" buttons
> >
> >     - It's more useful to the user (most likely he wants to know
what
> > happened on 01/05/2005 rather than view page 2857)
> >     - It's faster (no more limit/offset ! just "date BETWEEN a AND
b",
> > indexed of course)
> >     - no more new items pushing old ones to the next page while you
> browse
> >     - you can pretend to your boss it's just like a paginated list
>
> All very well and good, but now do it generically...

I've done it...
First of all I totally agree with PFC's rant regarding absolute
positioning while browsing datasets.  Among other things, it has serious
problems if you have multiple updating your table.  Also it's kind of
silly to be doing this in a set based data paradigm.

The 'SQL' way to browse a dataset is by key.  If your key has multiple
parts or you are trying to sort on two or more fields, you are supposed
to use the row constructor:

select * from t where (x, y) > (xc, yc) order by x,y;

Unfortunately, this gives the wrong answer in postgresql :(.

The alternative is to use boolean logic.  Here is a log snippit from my
ISAM driver (in ISAM, you are *always* browsing datasets):

prepare system_read_next_menu_item_favorite_file_0 (character varying,
int4, int4, int4)
    as select from system.menu_item_favorite_file
    where mif_user_id >= $1 and
        (mif_user_id >  $1 or  mif_menu_item_id >= $2) and
        (mif_user_id >  $1 or  mif_menu_item_id >  $2 or
mif_sequence_no >  $3)
    order by mif_user_id, mif_menu_item_id, mif_sequence_no
    limit $4

This is a Boolean based 'get next record' in a 3 part key plus a
parameterized limit.  You can do this without using prepared statements
of course but with the prepared version you can at least do

execute system_read_next_menu_item_favorite_file_0('abc', 1, 2, 1);

Merlin


Re: browsing table with 2 million records

From
PFC
Date:

> I've done it...
> First of all I totally agree with PFC's rant regarding absolute
> positioning while browsing datasets.  Among other things, it has serious
> problems if you have multiple updating your table.  Also it's kind of
> silly to be doing this in a set based data paradigm.

    Recently I've been browsing some site and it had this problem : as users
kept adding new entries as I was browsing the list page after page, when I
hit "next page" I got on the next page half of what I already saw on the
previous page. Of course the webmaster has set the "visited links" color
the same as "unvisited links", so I couldn't tell, and had to use my
brain, which was quite upsetting XDDD

    And bookmarking a page to resume browsing at some later time does not
work either, because if I bookmark page 15, then when I come back, users
have added 10 pages of content and what I bookmarked is now on page 25...

>> All very well and good, but now do it generically...

    Hehe. I like ranting...
    It is not possible to do it in a generic way that works in all cases. For
instance :

    Forum topic case :
    - posts are added at the bottom and not at the top
    - page number is relevant and meaningful

    However, in most cases, you can use a multipart key and get it right.
    Suppose that, for instance, you have a base of several million records,
organized according to :

    - date (like the original poster)
    or :
    - country, region, city, customer last name, first name.

    You could ask for the first three, but then you'll get 50000 Smiths in
New York and 1 Van Bliezinsky.

    Or you could precalculate, once a week, a key interval distribution that
creates reasonable sized intervals (for instance, 100 values in each),
maybe asking that each interval should only contain only one city. So, you
would get :

    Country Region City    LastName    FirstName
    USA    NYC    NY    Smith,    ''
    USA    NYC    NY    Smith,    Albert
    USA    NYC    NY    Smith,    Bernard
    .....
    USA    NYC    NY    Smith,    William
    ...
    USA    NYC    NY    Von Braun
    ...

    So you'd predetermine your "page breaks" ahead of time, and recompute
them once in a while. You won't get identically sized pages, but if the
statistical distribution of the data plays nice, you should get evenly
sized pages.

    The interesting part is that you can present the user with a selector
which presents meaningful and useful data, AND is fast to compute, AND is
fast to use.
    In this case, it would amount to "Select country, region, city", then,
display a list like this :
    Smith, ...Albert
    Smith, Albus...Bernard
    ...
    Smith, William...
    ...
    Von Braun...Von Schwarts
    ...

    So Jeannette Smith would be easy to find, being in the link "Smith,
Jean...John" for instance.

    If the aim is to quickly locate a particular record, I like
javascript-powered autocompletion better ; but for browsing, this
pagination method is cool.

    Regards !