Thread: INDEX suggestion needed

INDEX suggestion needed

From

Thomas Beutin

Date:

11 December 2002, 13:45:04

Hi,

i need some help or suggestions for performance increasing on my queries.
My version: PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.3

My table is:

CREATE TABLE "stat_pages" (
    "visit" timestamp with time zone,
    "script_id" integer,
    "a_id" character(30),
    "p_id" character(30),
    "m_id" smallint,
    "s_id" smallint,
    "session_id" character(50),
    "action" character(20)
);
This table contains 343554 rows and i have the following index:
CREATE INDEX "stat_pages_m_id_idx" on "stat_pages" using btree ( "m_id" "int2_ops" );

i cannot create an index like this:
CREATE INDEX "stat_pages_datum_idx" on "stat_pages" ( date ("visit") );
The error is about the »iscachable« tag of the index function.

The table is vacuumed full analyzed.

My typical queries are like that:
SELECT count(a_id) AS count
  FROM (
    SELECT DISTINCT a_id FROM stat_pages
      WHERE m_id = '35'
        AND visit >= '2002-09-01'
        AND visit <= '2002-09-30'
  ) AS foo;

The explain shows only sequence scans on stat_pages:
EXPLAIN SELECT count(a_id) FROM ( SELECT DISTINCT a_id FROM stat_pages  WHERE m_id = '35' AND visit >= '2002-09-01' AND
visit<= '2002-09-30' ) AS foo; 

Aggregate  (cost=41479.21..41479.21 rows=1 width=34)
  ->  Subquery Scan foo  (cost=40947.80..41430.90 rows=19324 width=34)
        ->  Unique  (cost=40947.80..41430.90 rows=19324 width=34)
              ->  Sort  (cost=40947.80..40947.80 rows=193241 width=34)
                    ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=193241 width=34

How can i improve the speed? What kind of index could be usefull in this case?
How can query this table to get the result fast?

Any help, documentation pointers or suggestions welcome!

Greetings,
-tb
--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

Re: INDEX suggestion needed

From

Jeff Eckermann

Date:

11 December 2002, 13:49:13

--- Thomas Beutin <tyrone@laokoon.IN-Berlin.DE> wrote:
> My typical queries are like that:
> SELECT count(a_id) AS count
>   FROM (
>     SELECT DISTINCT a_id FROM stat_pages
>       WHERE m_id = '35'
>         AND visit >= '2002-09-01'
>         AND visit <= '2002-09-30'
>   ) AS foo;
>

You can do SELECT count(DISTINCT a_id) FROM ...

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

Re: INDEX suggestion needed

From

Frank Bax

Date:

11 December 2002, 14:00:36

At 01:43 PM 12/11/02, Thomas Beutin wrote:

>Hi,
>
>i need some help or suggestions for performance increasing on my queries.
>My version: PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.3
>
>My table is:
>
>CREATE TABLE "stat_pages" (
>         "visit" timestamp with time zone,
>         "script_id" integer,
>         "a_id" character(30),
>         "p_id" character(30),
>         "m_id" smallint,
>         "s_id" smallint,
>         "session_id" character(50),
>         "action" character(20)
>);
>This table contains 343554 rows and i have the following index:
>CREATE INDEX "stat_pages_m_id_idx" on "stat_pages" using btree ( "m_id"
>"int2_ops" );
>
>i cannot create an index like this:
>CREATE INDEX "stat_pages_datum_idx" on "stat_pages" ( date ("visit") );
>The error is about the »iscachable« tag of the index function.
>
>The table is vacuumed full analyzed.
>
>My typical queries are like that:
>SELECT count(a_id) AS count
>   FROM (
>     SELECT DISTINCT a_id FROM stat_pages
>       WHERE m_id = '35'
>         AND visit >= '2002-09-01'
>         AND visit <= '2002-09-30'
>   ) AS foo;


Does this trigger use of index?
         visit >= '2002-09-01'::timestamp AND visit <= '2002-09-30'::timestamp

Recovery Mode

From

"Samuel J. Sutjiono"

Date:

11 December 2002, 14:08:13

Hi,

My database crashed and I received the following message.....   psql: FATAL
1:  The database system is in recovery mode
Can anyone advise me on what I need to do to restore the database ? and also
tell me what might have caused the crash ?

I appreciate any help.

Regards,
Sam

Re: INDEX suggestion needed

From

Thomas Beutin

Date:

11 December 2002, 14:15:19

On Wed, Dec 11, 2002 at 01:58:59PM -0500, Frank Bax wrote:
> At 01:43 PM 12/11/02, Thomas Beutin wrote:
>
> >Hi,
> >
> >i need some help or suggestions for performance increasing on my queries.
> >My version: PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.3
> >
> >My table is:
> >
> >CREATE TABLE "stat_pages" (
> >         "visit" timestamp with time zone,
> >         "script_id" integer,
> >         "a_id" character(30),
> >         "p_id" character(30),
> >         "m_id" smallint,
> >         "s_id" smallint,
> >         "session_id" character(50),
> >         "action" character(20)
> >);
> >This table contains 343554 rows and i have the following index:
> >CREATE INDEX "stat_pages_m_id_idx" on "stat_pages" using btree ( "m_id"
> >"int2_ops" );
> >
> >i cannot create an index like this:
> >CREATE INDEX "stat_pages_datum_idx" on "stat_pages" ( date ("visit") );
> >The error is about the »iscachable« tag of the index function.
> >
> >The table is vacuumed full analyzed.
> >
> >My typical queries are like that:
> >SELECT count(a_id) AS count
> >   FROM (
> >     SELECT DISTINCT a_id FROM stat_pages
> >       WHERE m_id = '35'
> >         AND visit >= '2002-09-01'
> >         AND visit <= '2002-09-30'
> >   ) AS foo;
>
>
> Does this trigger use of index?
>          visit >= '2002-09-01'::timestamp AND visit <= '2002-09-30'::timestamp

no, does not :-( but there is no index on visit.

-tb
--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

Re: INDEX suggestion needed

From

Thomas Beutin

Date:

11 December 2002, 14:20:42

On Wed, Dec 11, 2002 at 10:49:11AM -0800, Jeff Eckermann wrote:
> --- Thomas Beutin <tyrone@laokoon.IN-Berlin.DE> wrote:
> > My typical queries are like that:
> > SELECT count(a_id) AS count
> >   FROM (
> >     SELECT DISTINCT a_id FROM stat_pages
> >       WHERE m_id = '35'
> >         AND visit >= '2002-09-01'
> >         AND visit <= '2002-09-30'
> >   ) AS foo;
> >
>
> You can do SELECT count(DISTINCT a_id) FROM ...
It looks little bit better, but the inner select is a sequence scan as well :-(
Explain:
EXPLAIN SELECT count(DISTINCT a_id) FROM stat_pages  WHERE m_id = '35' AND visit >= '2002-09-01' AND visit <=
'2002-09-30';
NOTICE:  QUERY PLAN:

Aggregate  (cost=14304.30..14304.30 rows=1 width=34)
  ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=193241 width=34)

Thanks,
-tb
--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

Re: INDEX suggestion needed

From

Frank Bax

Date:

11 December 2002, 14:37:35

At 02:09 PM 12/11/02, Thomas Beutin wrote:

>On Wed, Dec 11, 2002 at 01:58:59PM -0500, Frank Bax wrote:
> > At 01:43 PM 12/11/02, Thomas Beutin wrote:
> >
> > >Hi,
> > >
> > >i need some help or suggestions for performance increasing on my queries.
> > >My version: PostgreSQL 7.2.1 on i686-pc-linux-gnu, compiled by GCC 2.95.3
> > >
> > >My table is:
> > >
> > >CREATE TABLE "stat_pages" (
> > >         "visit" timestamp with time zone,
> > >         "script_id" integer,
> > >         "a_id" character(30),
> > >         "p_id" character(30),
> > >         "m_id" smallint,
> > >         "s_id" smallint,
> > >         "session_id" character(50),
> > >         "action" character(20)
> > >);
> > >This table contains 343554 rows and i have the following index:
> > >CREATE INDEX "stat_pages_m_id_idx" on "stat_pages" using btree ( "m_id"
> > >"int2_ops" );
> > >
> > >i cannot create an index like this:
> > >CREATE INDEX "stat_pages_datum_idx" on "stat_pages" ( date ("visit") );
> > >The error is about the »iscachable« tag of the index function.
> > >
> > >The table is vacuumed full analyzed.
> > >
> > >My typical queries are like that:
> > >SELECT count(a_id) AS count
> > >   FROM (
> > >     SELECT DISTINCT a_id FROM stat_pages
> > >       WHERE m_id = '35'
> > >         AND visit >= '2002-09-01'
> > >         AND visit <= '2002-09-30'
> > >   ) AS foo;
> >
> >
> > Does this trigger use of index?
> >          visit >= '2002-09-01'::timestamp AND visit <=
> '2002-09-30'::timestamp
>
>no, does not :-( but there is no index on visit.


So add an index on "visit" - but forget trying to make it a 'date' index.
         CREATE INDEX "stat_pages_datum_idx" on "stat_pages" ( "visit" );
Instead of trying to make the index match the constants in your query, just
cast the constants in your query match the index.

On second glance, I've noticed something else.  I seem to remember seeing
question like this one before... and the suggested fix... create an index
on two fields (m_id, visit) and rewrite the query:

WHERE m_id >= '35' AND visit >= timestamp('2002-09-01') AND m_id <= '35'
AND visit <= timestamp('2002-09-30')

Frank

Re: Recovery Mode

From

Justin Clift

Date:

11 December 2002, 14:58:49

Samuel J. Sutjiono wrote:
> Hi,
>
> My database crashed and I received the following message.....   psql: FATAL
> 1:  The database system is in recovery mode
> Can anyone advise me on what I need to do to restore the database ?

Hi Samuel,

It means the database is repairing itself before going live.  If everything comes back together fine, then you
shouldn't 
have lost anything.

If there are further problems, it will let you know and will require you to fix them before it will start.

> and also tell me what might have caused the crash ?

Hopefully there is info about it in the log file.

You'd probably want to look there first and see if there is anything obvious.

Any ideas of what was accessing the database at the time of the crash?

:-)

Regards and best wishes,

Justin Clift

> I appreciate any help.
>
> Regards,
> Sam
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

Re: Recovery Mode

From

"Samuel J. Sutjiono"

Date:

11 December 2002, 15:16:34

Justin,

I think I found out the cause of the crash.  When I restarted the db, I
specified invalid data directory as indicated in the logfile.

The database has been in recovery mode for about three hours.  Why is it
doing that ?  Is is safe to restart the database ?

Thanks for your help.

Regards,
Sam
----- Original Message -----
From: "Justin Clift" <justin@postgresql.org>
To: "Samuel J. Sutjiono" <ssutjiono@wc-group.com>
Cc: <pgsql-general@postgresql.org>
Sent: Wednesday, December 11, 2002 2:57 PM
Subject: Re: [GENERAL] Recovery Mode


> Samuel J. Sutjiono wrote:
> > Hi,
> >
> > My database crashed and I received the following message.....   psql:
FATAL
> > 1:  The database system is in recovery mode
> > Can anyone advise me on what I need to do to restore the database ?
>
> Hi Samuel,
>
> It means the database is repairing itself before going live.  If
everything comes back together fine, then you shouldn't
> have lost anything.
>
> If there are further problems, it will let you know and will require you
to fix them before it will start.
>
>
> > and also tell me what might have caused the crash ?
>
> Hopefully there is info about it in the log file.
>
> You'd probably want to look there first and see if there is anything
obvious.
>
> Any ideas of what was accessing the database at the time of the crash?
>
> :-)
>
> Regards and best wishes,
>
> Justin Clift
>
>
> > I appreciate any help.
> >
> > Regards,
> > Sam
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: you can get off all lists at once with the unregister command
> >     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>
> --
> "My grandfather once told me that there are two kinds of people: those
> who work and those who take the credit. He told me to try to be in the
> first group; there was less competition there."
> - Indira Gandhi
>
>
>

Re: INDEX suggestion needed

From

Tom Lane

Date:

12 December 2002, 00:57:06

Thomas Beutin <tyrone@laokoon.IN-Berlin.DE> writes:
> i cannot create an index like this:
> CREATE INDEX "stat_pages_datum_idx" on "stat_pages" ( date ("visit") );

You are overthinking the problem.  A plain index on "visit" would work
fine for the example query you gave.

            regards, tom lane

Re: Recovery Mode

From

Tom Lane

Date:

12 December 2002, 01:02:17

"Samuel J. Sutjiono" <ssutjiono@wc-group.com> writes:
> The database has been in recovery mode for about three hours.

In that case you probably have a problem :-(.  Ordinarily I'd expect the
recovery process runtime to not exceed the checkpoint interval, which is
5 minutes unless you changed it.  It sounds like the thing may be in
some kind of infinite loop.  What is showing up in the postmaster log?

            regards, tom lane

Re: INDEX suggestion needed

From

Thomas Beutin

Date:

12 December 2002, 09:36:31

On Thu, Dec 12, 2002 at 12:56:38AM -0500, Tom Lane wrote:
> Thomas Beutin <tyrone@laokoon.IN-Berlin.DE> writes:
> > i cannot create an index like this:
> > CREATE INDEX "stat_pages_datum_idx" on "stat_pages" ( date ("visit") );
>
> You are overthinking the problem.  A plain index on "visit" would work
> fine for the example query you gave.
Unfortunally it doesn't work, i'm still running in sequence
scans all the time:
tb=# CREATE INDEX tb5 ON stat_pages (visit);
CREATE
tb=# VACUUM FULL ANALYZE stat_pages;
VACUUM
tb=# EXPLAIN select count(distinct a_id) from stat_pages WHERE m_id = 35 AND ( visit >= '2002-12-01' OR visit <=
'2002-12-11');
NOTICE:  QUERY PLAN:

Aggregate  (cost=14679.99..14679.99 rows=1 width=34)
  ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=343520 width=34)

Is there any hope left? ;-)

-tb
--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

Re: INDEX suggestion needed

From

Tom Lane

Date:

12 December 2002, 10:25:50

Thomas Beutin <tyrone@laokoon.IN-Berlin.DE> writes:
> tb=# EXPLAIN select count(distinct a_id) from stat_pages WHERE m_id = 35 AND ( visit >= '2002-12-01' OR visit <=
'2002-12-11');

Do you really mean OR there, and not AND?  As given, the query looks
mighty unselective to me --- I'd think it should *not* use an index.
The planner evidently thinks so too.  (How accurate is the rowcount
estimate, anyway?)

            regards, tom lane

Re: INDEX suggestion needed

From

Jean-Luc Lachance

Date:

12 December 2002, 10:32:43

Thomas,

Are you sure about ( visit >= '2002-12-01' OR visit <= '2002-12-11')?
Shouldn't it be AND?
If visit is sorted that mean all record as in if a > 10 or a < 20...
Or maybe you comparator are wrong.

JLL

Thomas Beutin wrote:
>
> On Thu, Dec 12, 2002 at 12:56:38AM -0500, Tom Lane wrote:
> > Thomas Beutin <tyrone@laokoon.IN-Berlin.DE> writes:
> > > i cannot create an index like this:
> > > CREATE INDEX "stat_pages_datum_idx" on "stat_pages" ( date ("visit") );
> >
> > You are overthinking the problem.  A plain index on "visit" would work
> > fine for the example query you gave.
> Unfortunally it doesn't work, i'm still running in sequence
> scans all the time:
> tb=# CREATE INDEX tb5 ON stat_pages (visit);
> CREATE
> tb=# VACUUM FULL ANALYZE stat_pages;
> VACUUM
> tb=# EXPLAIN select count(distinct a_id) from stat_pages WHERE m_id = 35 AND ( visit >= '2002-12-01' OR visit <=
'2002-12-11');
> NOTICE:  QUERY PLAN:
>
> Aggregate  (cost=14679.99..14679.99 rows=1 width=34)
>   ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=343520 width=34)
>
> Is there any hope left? ;-)
>
> -tb
> --
> Thomas Beutin                             tb@laokoon.IN-Berlin.DE
> Beam me up, Scotty. There is no intelligent live down in Redmond.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: INDEX suggestion needed

From

Manfred Koizar

Date:

12 December 2002, 10:50:51

On Thu, 12 Dec 2002 15:33:11 +0100, Thomas Beutin
<tyrone@laokoon.IN-Berlin.DE> wrote:
>   AND ( visit >= '2002-12-01' OR visit <= '2002-12-11');
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is logically equivalent to (visit IS NOT NULL), I guess you want
AND, not OR.

>  ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=343520 width=34)
>
>Is there any hope left? ;-)

Please show us the output of
    VACUUM VERBOSE ANALYZE stat_pages;

        EXPLAIN ANALYZE
    SELECT COUNT(DISTINCT a_id)
      FROM stat_pages
     WHERE m_id = 35::smallint
       AND (visit >= '2002-12-01' AND visit <= '2002-12-11');

    SELECT attname, null_frac, avg_width, n_distinct,
           most_common_vals, histogram_bounds, correlation
      FROM pg_stats
     WHERE tablename = 'stat_pages'
       AND attname IN ('m_id', 'visit');

Servus
 Manfred

Re: Recovery Mode

From

Bruce Momjian

Date:

12 December 2002, 13:29:47

Tom Lane wrote:
> "Samuel J. Sutjiono" <ssutjiono@wc-group.com> writes:
> > The database has been in recovery mode for about three hours.
>
> In that case you probably have a problem :-(.  Ordinarily I'd expect the
> recovery process runtime to not exceed the checkpoint interval, which is
> 5 minutes unless you changed it.  It sounds like the thing may be in
> some kind of infinite loop.  What is showing up in the postmaster log?

Also, can you try attaching to the running process and give us a
backtrace of that (bt)?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: INDEX suggestion needed

From

Thomas Beutin

Date:

12 December 2002, 14:15:07

On Thu, Dec 12, 2002 at 04:50:48PM +0100, Manfred Koizar wrote:
> On Thu, 12 Dec 2002 15:33:11 +0100, Thomas Beutin
> <tyrone@laokoon.IN-Berlin.DE> wrote:
> >   AND ( visit >= '2002-12-01' OR visit <= '2002-12-11');
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This is logically equivalent to (visit IS NOT NULL), I guess you want
> AND, not OR.
Yes, You are right, this is not my expected result =:-/
This must be an "AND".

> Please show us the output of
>     VACUUM VERBOSE ANALYZE stat_pages;
tb=# VACUUM VERBOSE ANALYZE stat_pages;
NOTICE:  --Relation stat_pages--
NOTICE:  Pages 7809: Changed 0, Empty 0; Tup 343554: Vac 0, Keep 0, UnUsed 0.
    Total CPU 0.66s/0.12u sec elapsed 4.95 sec.
NOTICE:  Analyzing stat_pages
VACUUM

>         EXPLAIN ANALYZE
>     SELECT COUNT(DISTINCT a_id)
>       FROM stat_pages
>      WHERE m_id = 35::smallint
>        AND (visit >= '2002-12-01' AND visit <= '2002-12-11');
tb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-12-01'
ANDvisit <= '2002-12-11'); 
NOTICE:  QUERY PLAN:

Aggregate  (cost=3.19..3.19 rows=1 width=34) (actual time=52.89..52.89 rows=1 loops=1)
  ->  Index Scan using tb5 on stat_pages  (cost=0.00..3.18 rows=1 width=34) (actual time=52.74..52.74 rows=0 loops=1)
Total runtime: 53.11 msec

EXPLAIN

This looks good, but look at this (first date changed):

tb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-06-01'
ANDvisit <= '2002-12-11'); 
NOTICE:  QUERY PLAN:

Aggregate  (cost=14679.91..14679.91 rows=1 width=34) (actual time=76204.22..76204.22 rows=1 loops=1)
  ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=343486 width=34) (actual time=375.61..5197.26 rows=343554
loops=1)
Total runtime: 76204.40 msec

EXPLAIN

i got the following index/sequence scans by date ranges:
(visit >= '2002-06-01' AND visit <= '2002-06-30')    index scan
(visit >= '2002-06-01' AND visit <= '2002-07-31')    index scan
(visit >= '2002-06-01' AND visit <= '2002-08-31')    sequence scan
(visit >= '2002-07-01' AND visit <= '2002-07-31')    index scan
(visit >= '2002-08-01' AND visit <= '2002-08-31')    sequence scan
(visit >= '2002-09-01' AND visit <= '2002-09-30')    sequence scan
(visit >= '2002-10-01' AND visit <= '2002-10-31')    index scan, but long (>5sec)

And: The date range in my table is from 2002-07-10 11:36:53+02 up to
2002-10-29 23:31:47+01.



>
>     SELECT attname, null_frac, avg_width, n_distinct,
>            most_common_vals, histogram_bounds, correlation
>       FROM pg_stats
>      WHERE tablename = 'stat_pages'
>        AND attname IN ('m_id', 'visit');
tb=# SELECT attname, null_frac, avg_width, n_distinct, most_common_vals, histogram_bounds, correlation FROM pg_stats
WHEREtablename = 'stat_pages' AND attname IN ('m_id', 'visit'); 
 attname | null_frac | avg_width | n_distinct |
                                            most_common_vals
                                                          |
                                                                     histogram_bounds
                                                                                               | correlation  

---------+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 visit   |         0 |         8 |  -0.543682 | {"2002-08-21 10:29:10+02","2002-08-21 15:19:22+02","2002-08-29
17:29:41+02","2002-09-0414:10:47+02","2002-09-11 09:45:48+02","2002-09-17 14:06:05+02","2002-09-17
16:24:59+02","2002-09-2015:53:47+02","2002-09-23 13:46:59+02","2002-09-23 22:44:21+02"} | {"2002-07-25
16:37:12+02","2002-08-1512:36:18+02","2002-08-23 12:36:15+02","2002-08-29 17:30:54+02","2002-09-05
12:54:31+02","2002-09-1018:03:54+02","2002-09-16 15:44:56+02","2002-09-20 14:34:40+02","2002-09-24
13:59:29+02","2002-09-2909:09:31+02","2002-10-29 23:25:13+01"} |   -0.972118 
 m_id    |         0 |         2 |          1 | {35}

                                                          |

                                                                                               |           1 
(2 rows)

??? Is this output ok?


Regards,
-tb
--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

Re: INDEX suggestion needed

From

"Magnus Naeslund(f)"

Date:

12 December 2002, 14:32:04

Thomas Beutin <tyrone@laokoon.IN-Berlin.DE> wrote:
[snip all]

Does it help anything if you cast to the relevant type before executing,
i.e.:

( visit >= '2002-12-01'::timestamp AND visit <=
'2002-12-11'::timestamp )

Magnus

Re: INDEX suggestion needed

From

Manfred Koizar

Date:

12 December 2002, 16:00:54

On Thu, 12 Dec 2002 20:13:24 +0100, Thomas Beutin
<tyrone@laokoon.IN-Berlin.DE> wrote:
>tb=# VACUUM VERBOSE ANALYZE stat_pages;
>NOTICE:  Pages 7809: Changed 0, Empty 0; Tup 343554: Vac 0, Keep 0, UnUsed 0.

>Aggregate  (cost=3.19..3.19 rows=1 width=34) (actual time=52.89..52.89 rows=1 loops=1)
>  ->  Index Scan using tb5 on stat_pages  (cost=0.00..3.18 rows=1 width=34) (actual time=52.74..52.74 rows=0 loops=1)
>Total runtime: 53.11 msec

>tb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-06-01'
ANDvisit <= '2002-12-11'); 

This selects (almost) all rows.  An index cannot help.

>i got the following index/sequence scans by date ranges:
>(visit >= '2002-06-01' AND visit <= '2002-06-30')    index scan

0 rows satisfy this condition, index scan is ok

>(visit >= '2002-06-01' AND visit <= '2002-07-31')    index scan

Ca. 10000 rows, 3% of the whole table, index scan ok

>(visit >= '2002-06-01' AND visit <= '2002-08-31')    sequence scan

32%, sequence scan is expected to be faster, unless tuples are almost
perfectly ordered by visit or most of the table (~ 8000 pages) fits
into the cache.  How much physical memory is installed?  What are your
shared_buffers and effective_cache_size settings?  You might want to
experiment with
    SET enable_seqscan = off;

>(visit >= '2002-07-01' AND visit <= '2002-07-31')    index scan

Same as 06-01 to 07-31.

>(visit >= '2002-08-01' AND visit <= '2002-08-31')    sequence scan

Ca. 29%

>(visit >= '2002-09-01' AND visit <= '2002-09-30')    sequence scan

More than 50%

>(visit >= '2002-10-01' AND visit <= '2002-10-31')    index scan, but long (>5sec)

Ca. 3%; interesting that it takes so long; I'd like to see EXPLAIN
ANALYZE output for enable_seqscan on and off.

>And: The date range in my table is from 2002-07-10 11:36:53+02 up to
>2002-10-29 23:31:47+01.

Yes, this is approximately reflected by the histogram bounds.

> attname | null_frac | avg_wi | n_distinct | correlation
>---------+-----------+--------+------------+-------------
> visit   |         0 |      8 |  -0.543682 |   -0.972118

The negative correlation looks strange.  How did you insert your data?

> m_id    |         0 |      2 |          1 |           1
                                         ^^^
Only one distinct value in m_id?  This explains why your m_id index is
never used.

> attname | most_common_vals
>---------+--------------------------
> visit   | "2002-08-21 10:29:10+02", ...
> m_id    | 35

> attname | histogram_bounds
>---------+--------------------------
> visit   | "2002-07-25 16:37:12+02"
>           "2002-08-15 12:36:18+02"
>           "2002-08-23 12:36:15+02"
>           "2002-08-29 17:30:54+02"
>           "2002-09-05 12:54:31+02"
>           "2002-09-10 18:03:54+02"
>           "2002-09-16 15:44:56+02"
>           "2002-09-20 14:34:40+02"
>           "2002-09-24 13:59:29+02"
>           "2002-09-29 09:09:31+02"
>           "2002-10-29 23:25:13+01"
> m_id    |
>

>??? Is this output ok?

Almost.  I forgot to ask for most_common_freqs (cut'n'paste error).
But I don't expect any value of visit to occur much more than twice,
so these values should be irrelevant to our estimations.

Servus
 Manfred

Re: INDEX suggestion needed

From

Thomas Beutin

Date:

13 December 2002, 10:50:04

Hi,

at first thanks to all people for help!

On Thu, Dec 12, 2002 at 10:00:48PM +0100, Manfred Koizar wrote:
> On Thu, 12 Dec 2002 20:13:24 +0100, Thomas Beutin
> <tyrone@laokoon.IN-Berlin.DE> wrote:
[...]

> >tb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >=
'2002-06-01'AND visit <= '2002-12-11'); 
>
> This selects (almost) all rows.  An index cannot help.
>
> >i got the following index/sequence scans by date ranges:
> >(visit >= '2002-06-01' AND visit <= '2002-06-30')    index scan
>
> 0 rows satisfy this condition, index scan is ok
>
> >(visit >= '2002-06-01' AND visit <= '2002-07-31')    index scan
>
> Ca. 10000 rows, 3% of the whole table, index scan ok
Ahh, now i get a clear look into index usability ;-)

> >(visit >= '2002-06-01' AND visit <= '2002-08-31')    sequence scan
>
> 32%, sequence scan is expected to be faster, unless tuples are almost
> perfectly ordered by visit or most of the table (~ 8000 pages) fits
> into the cache.  How much physical memory is installed?  What are your
> shared_buffers and effective_cache_size settings?  You might want to
> experiment with
>     SET enable_seqscan = off;
Ok, this queries are running on my developer workstation (notebook):
550 MHz mobile celeron, 128 MB RAM, 512MB swap space, IDE
running a apache/php, postgres 7.2.3, X11/gnome and mozilla 1.2.1

This are my entries in postgresql.conf:
#max_connections = 32
#shared_buffers = 64        # 2*max_connections, min 16
#effective_cache_size = 1000  # default in 8k pages
so i'm guessing i run the default values. Playing around with
enable_seqscan = on/off shows the planner's right choices.

Actually i cannot connect to the production server for stats, but the
hardware is:
2x iPIII 850MHz, 2GB RAM, 1GB swap, SCSI RAID5 for database
running a apache/php and postgres 7.2.?
Should be more reasonable...


[...]

> >(visit >= '2002-10-01' AND visit <= '2002-10-31')    index scan, but long (>5sec)
>
> Ca. 3%; interesting that it takes so long; I'd like to see EXPLAIN
> ANALYZE output for enable_seqscan on and off.
tb=# set enable_seqscan=on;
SET VARIABLE
itb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-10-01'
ANDvisit <= '2002-10-31'); 
NOTICE:  QUERY PLAN:

Aggregate  (cost=8788.75..8788.75 rows=1 width=34) (actual time=4663.69..4663.70 rows=1 loops=1)
  ->  Index Scan using tb5 on stat_pages  (cost=0.00..8712.60 rows=30459 width=34) (actual time=0.41..468.35 rows=29937
loops=1)
Total runtime: 4663.99 msec

EXPLAIN

tb=# set enable_seqscan=off;
SET VARIABLE
tb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-10-01'
ANDvisit <= '2002-10-31'); 
NOTICE:  QUERY PLAN:

Aggregate  (cost=8788.75..8788.75 rows=1 width=34) (actual time=4788.35..4788.35 rows=1 loops=1)
  ->  Index Scan using tb5 on stat_pages  (cost=0.00..8712.60 rows=30459 width=34) (actual time=0.41..439.44 rows=29937
loops=1)
Total runtime: 4788.65 msec

EXPLAIN

There is no difference in cost.

> >And: The date range in my table is from 2002-07-10 11:36:53+02 up to
> >2002-10-29 23:31:47+01.
>
> Yes, this is approximately reflected by the histogram bounds.
>
> > attname | null_frac | avg_wi | n_distinct | correlation
> >---------+-----------+--------+------------+-------------
> > visit   |         0 |      8 |  -0.543682 |   -0.972118
>
> The negative correlation looks strange.  How did you insert your data?
It is a dump from the production system, and the production system gets
the data once a day from webserver logs line by line.

> > m_id    |         0 |      2 |          1 |           1
>                                          ^^^
> Only one distinct value in m_id?  This explains why your m_id index is
> never used.
Yes, because i copied only a part of the data, it's still a lot for
my developer system ;-)
I will playing around on the production system next week, may be i give
combined indicees (m_id, visit) with enable_seqscan=on/off a try.

> > attname | most_common_vals
> >---------+--------------------------
> > visit   | "2002-08-21 10:29:10+02", ...
> > m_id    | 35
>
> > attname | histogram_bounds
> >---------+--------------------------
> > visit   | "2002-07-25 16:37:12+02"
> >           "2002-08-15 12:36:18+02"
> >           "2002-08-23 12:36:15+02"
> >           "2002-08-29 17:30:54+02"
> >           "2002-09-05 12:54:31+02"
> >           "2002-09-10 18:03:54+02"
> >           "2002-09-16 15:44:56+02"
> >           "2002-09-20 14:34:40+02"
> >           "2002-09-24 13:59:29+02"
> >           "2002-09-29 09:09:31+02"
> >           "2002-10-29 23:25:13+01"
> > m_id    |
> >
>
> >??? Is this output ok?
>
> Almost.  I forgot to ask for most_common_freqs (cut'n'paste error).
Here is the result:
tb=# SELECT attname, null_frac, avg_width, n_distinct, most_common_vals, most_common_freqs, histogram_bounds,
correlationFROM pg_stats WHERE tablename = 'stat_pages' AND attname IN ('m_id', 'visit'); 
 attname | null_frac | avg_width | n_distinct |
                                            most_common_vals
                                                          |
most_common_freqs                                                    |
                                                                                histogram_bounds
                                                                                                          | correlation


---------+-----------+-----------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 visit   |         0 |         8 |  -0.465972 | {"2002-08-21 10:31:18+02","2002-08-28 15:28:04+02","2002-09-02
08:50:08+02","2002-09-0213:48:49+02","2002-09-04 13:00:03+02","2002-09-06 18:55:19+02","2002-09-12
15:24:14+02","2002-09-1313:12:39+02","2002-09-18 12:55:07+02","2002-09-18 15:01:52+02"} |
{0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667}
|{"2002-07-25 07:04:05+02","2002-08-15 07:52:47+02","2002-08-22 11:00:35+02","2002-08-29 11:59:47+02","2002-09-05
13:56:08+02","2002-09-1108:08:52+02","2002-09-16 10:48:37+02","2002-09-20 11:50:46+02","2002-09-23
22:25:32+02","2002-09-2713:01:03+02","2002-10-29 23:31:18+01"} |   -0.964541 
 m_id    |         0 |         2 |          1 | {35}

                                                          | {1}
                                                             |

                                                                                                  |           1 
(2 rows)


> But I don't expect any value of visit to occur much more than twice,
> so these values should be irrelevant to our estimations.
The probability isn't high, but it could happen to see the same value multiple
times. These are web server log data in a one second grid.



Regards,
-tb
--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

Re: INDEX suggestion needed

From

Alvaro Herrera

Date:

13 December 2002, 11:00:22

On Fri, Dec 13, 2002 at 04:41:38PM +0100, Thomas Beutin wrote:

> itb=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >=
'2002-10-01'AND visit <= '2002-10-31'); 
> NOTICE:  QUERY PLAN:
>
> Aggregate  (cost=8788.75..8788.75 rows=1 width=34) (actual time=4663.69..4663.70 rows=1 loops=1)
>   ->  Index Scan using tb5 on stat_pages  (cost=0.00..8712.60 rows=30459 width=34) (actual time=0.41..468.35
rows=29937loops=1) 
> Total runtime: 4663.99 msec

Now this catched my attention (in the questions' side, sorry, not the
answers').  Why the aggregate takes 10 times the time needed for the
indexscan?  One would think that a function like count() should be
pretty cheap, and the planner seems to think so (total cost for the
Aggregate node is about the same as total cost for IndexScan node), but
the executor has a completely different view...

Can that be a cut'n paste error?

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"¿Qué importan los años?  Lo que realmente importa es comprobar que
a fin de cuentas la mejor edad de la vida es estar vivo"  (Mafalda)

Re: INDEX suggestion needed

From

Manfred Koizar

Date:

13 December 2002, 12:14:35

On Fri, 13 Dec 2002 16:41:38 +0100, Thomas Beutin
<tyrone@laokoon.IN-Berlin.DE> wrote:
>> >(visit >= '2002-10-01' AND visit <= '2002-10-31')    index scan, but long (>5sec)
>>
>> Ca. 3%; interesting that it takes so long; I'd like to see EXPLAIN

Oops!  Should be 10% according to the last two histogram bounds:
>> >           "2002-09-29 09:09:31+02"
>> >           "2002-10-29 23:25:13+01"

>> ANALYZE output for enable_seqscan on and off.
>There is no difference in cost.

Oops again!  If it's already using an index scan, switching seqscan
off won't change anything.  I should have meant "for enable_indexscan
on and off".

>> The negative correlation looks strange.  How did you insert your data?
>It is a dump from the production system, and the production system gets
>the data once a day from webserver logs line by line.

Did you insert in reverse order (newest first)?

| most_common_freqs = {0.000666667,...,0.000666667}

These values occurred exactly twice in the analyzer's sample of 3000
values, so for the accuracy we need here it's ok to ignore them.

Servus
 Manfred

Re: INDEX suggestion needed

From

Manfred Koizar

Date:

13 December 2002, 12:37:56

On Fri, 13 Dec 2002 13:00:14 -0300, Alvaro Herrera
<alvherre@dcc.uchile.cl> wrote:
>Now this catched my attention (in the questions' side, sorry, not the
>answers').  Why the aggregate takes 10 times the time needed for the
>indexscan?

Good point!

>  One would think that a function like count() should be
>pretty cheap,

COUNT is cheap.  But COUNT(DISTINCT something) is not trivial, it has
to keep a list of all values it has already counted.  I didn't look at
the implementation.  Do we have O(n^2) cost here?

Thomas, could you EXPLAIN ANALYZE some test cases with

    SELECT COUNT(*) FROM (
        SELECT DISTINCT a_id
          FROM stat_pages
         WHERE ...
    ) AS x;

and compare them to the results of SELECT COUNT(DISTINCT ...)?

So now you are back where you started.  At least you have an index on
"visit" now ;-)

Servus
 Manfred

Re: INDEX suggestion needed

From

Tom Lane

Date:

13 December 2002, 13:24:55

Manfred Koizar <mkoi-pg@aon.at> writes:
> COUNT is cheap.  But COUNT(DISTINCT something) is not trivial, it has
> to keep a list of all values it has already counted.  I didn't look at
> the implementation.  Do we have O(n^2) cost here?

No, more like O(n ln n) --- it's a sort/uniq implementation.

            regards, tom lane

Re: INDEX suggestion needed

From

Thomas Beutin

Date:

17 December 2002, 08:50:02

Hi,

now i'm back on track; had some trouble with the tax at the and of the
year, and this consumes lot of time, most waiting rigth bevor a boring
office :-(



On Fri, Dec 13, 2002 at 06:37:59PM +0100, Manfred Koizar wrote:
> On Fri, 13 Dec 2002 13:00:14 -0300, Alvaro Herrera
> <alvherre@dcc.uchile.cl> wrote:
> >Now this catched my attention (in the questions' side, sorry, not the
> >answers').  Why the aggregate takes 10 times the time needed for the
> >indexscan?
>
> Good point!
>
> >  One would think that a function like count() should be
> >pretty cheap,
>
> COUNT is cheap.  But COUNT(DISTINCT something) is not trivial, it has
> to keep a list of all values it has already counted.  I didn't look at
> the implementation.  Do we have O(n^2) cost here?
>
> Thomas, could you EXPLAIN ANALYZE some test cases with
>
>     SELECT COUNT(*) FROM (
>         SELECT DISTINCT a_id
>           FROM stat_pages
>          WHERE ...
>     ) AS x;
>
> and compare them to the results of SELECT COUNT(DISTINCT ...)?
Here we are:

EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-07-01' AND
visit<= '2002-07-31'); 
NOTICE:  QUERY PLAN:

Aggregate  (cost=3.21..3.21 rows=1 width=34) (actual time=8.67..8.67 rows=1 loops=1)
  ->  Index Scan using tb5 on stat_pages  (cost=0.00..3.21 rows=1 width=34) (actual time=0.08..2.00 rows=164 loops=1)
Total runtime: 8.86 msec


EXPLAIN ANALYZE SELECT COUNT(a_id) FROM ( SELECT DISTINCT a_id FROM stat_pages WHERE m_id = 35::smallint AND visit >=
'2002-07-01'AND visit <= '2002-07-31' ) AS foo; 
NOTICE:  QUERY PLAN:

Aggregate  (cost=3.22..3.22 rows=1 width=34) (actual time=9.42..9.42 rows=1 loops=1)
  ->  Subquery Scan foo  (cost=3.22..3.22 rows=1 width=34) (actual time=7.25..9.39 rows=8 loops=1)
        ->  Unique  (cost=3.22..3.22 rows=1 width=34) (actual time=7.24..9.34 rows=8 loops=1)
              ->  Sort  (cost=3.22..3.22 rows=1 width=34) (actual time=7.24..7.66 rows=164 loops=1)
                    ->  Index Scan using tb5 on stat_pages  (cost=0.00..3.21 rows=1 width=34) (actual time=0.08..1.92
rows=164loops=1) 
Total runtime: 9.67 msec




EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-08-01' AND
visit<= '2002-08-31'); 
NOTICE:  QUERY PLAN:

Aggregate  (cost=14103.16..14103.16 rows=1 width=34) (actual time=22864.95..22864.95 rows=1 loops=1)
  ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=112784 width=34) (actual time=3664.51..5145.32 rows=113612
loops=1)
Total runtime: 22889.60 msec

EXPLAIN ANALYZE SELECT COUNT(a_id) FROM ( SELECT DISTINCT a_id FROM stat_pages WHERE m_id = 35::smallint AND visit >=
'2002-08-01'AND visit <= '2002-08-31' ) AS foo; 
NOTICE:  QUERY PLAN:

Aggregate  (cost=28034.99..28034.99 rows=1 width=34) (actual time=25055.61..25055.61 rows=1 loops=1)
  ->  Subquery Scan foo  (cost=27724.83..28006.79 rows=11278 width=34) (actual time=23467.83..25053.67 rows=555
loops=1)
        ->  Unique  (cost=27724.83..28006.79 rows=11278 width=34) (actual time=23467.81..25050.34 rows=555 loops=1)
              ->  Sort  (cost=27724.83..27724.83 rows=112784 width=34) (actual time=23467.80..23798.54 rows=113612
loops=1)
                    ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=112784 width=34) (actual
time=3936.11..5426.79rows=113612 loops=1) 
Total runtime: 25726.64 msec




EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-09-01' AND
visit<= '2002-09-30'); 
NOTICE:  QUERY PLAN:

Aggregate  (cost=14293.93..14293.93 rows=1 width=34) (actual time=37803.08..37803.08 rows=1 loops=1)
  ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=189093 width=34) (actual time=461.37..5648.66 rows=195265
loops=1)
Total runtime: 37803.25 msec


EXPLAIN ANALYZE SELECT COUNT(a_id) FROM ( SELECT DISTINCT a_id FROM stat_pages WHERE m_id = 35::smallint AND visit >=
'2002-09-01'AND visit <= '2002-09-30' ) AS foo; 
NOTICE:  QUERY PLAN:

Aggregate  (cost=40763.98..40763.98 rows=1 width=34) (actual time=42325.52..42325.52 rows=1 loops=1)
  ->  Subquery Scan foo  (cost=40243.98..40716.71 rows=18909 width=34) (actual time=39487.50..42323.18 rows=702
loops=1)
        ->  Unique  (cost=40243.98..40716.71 rows=18909 width=34) (actual time=39487.48..42319.02 rows=702 loops=1)
              ->  Sort  (cost=40243.98..40243.98 rows=189093 width=34) (actual time=39487.47..40079.07 rows=195265
loops=1)
                    ->  Seq Scan on stat_pages  (cost=0.00..13821.19 rows=189093 width=34) (actual time=624.73..5814.48
rows=195265loops=1) 
Total runtime: 44169.09 msec




EXPLAIN ANALYZE SELECT COUNT(DISTINCT a_id) FROM stat_pages WHERE m_id = 35::smallint AND (visit >= '2002-10-01' AND
visit<= '2002-10-31'); 
NOTICE:  QUERY PLAN:

Aggregate  (cost=8792.18..8792.18 rows=1 width=34) (actual time=5419.84..5419.84 rows=1 loops=1)
  ->  Index Scan using tb5 on stat_pages  (cost=0.00..8708.74 rows=33376 width=34) (actual time=62.75..1520.06
rows=29937loops=1) 
Total runtime: 5420.06 msec


EXPLAIN ANALYZE SELECT COUNT(a_id) FROM ( SELECT DISTINCT a_id FROM stat_pages WHERE m_id = 35::smallint AND visit >=
'2002-10-01'AND visit <= '2002-10-31' ) AS foo; 
NOTICE:  QUERY PLAN:

Aggregate  (cost=11812.87..11812.87 rows=1 width=34) (actual time=4755.42..4755.42 rows=1 loops=1)
  ->  Subquery Scan foo  (cost=11721.09..11804.53 rows=3338 width=34) (actual time=4325.68..4753.70 rows=662 loops=1)
        ->  Unique  (cost=11721.09..11804.53 rows=3338 width=34) (actual time=4325.66..4750.65 rows=662 loops=1)
              ->  Sort  (cost=11721.09..11721.09 rows=33376 width=34) (actual time=4325.64..4401.14 rows=29937 loops=1)
                    ->  Index Scan using tb5 on stat_pages  (cost=0.00..8708.74 rows=33376 width=34) (actual
time=0.21..473.04rows=29937 loops=1) 
Total runtime: 5242.39 msec



> So now you are back where you started.  At least you have an index on
> "visit" now ;-)
Yes, but this is on the production system not used (pgversion 7.2.1), so i'm
looking for a low usage time slot to upgrade the system to at least 7.2.3.
Then i can start playing with the indicees. I think, i got an idea about the
usage of indicees.

Thanks for Your help!

Greetings,
-tb
--
Thomas Beutin                             tb@laokoon.IN-Berlin.DE
Beam me up, Scotty. There is no intelligent live down in Redmond.

Re: INDEX suggestion needed

From

Manfred Koizar

Date:

18 December 2002, 06:00:44

On Tue, 17 Dec 2002 14:40:58 +0100, Thomas Beutin
<tyrone@laokoon.IN-Berlin.DE> wrote:
>On Fri, Dec 13, 2002 at 06:37:59PM +0100, Manfred Koizar wrote:
>> and compare them to the results of SELECT COUNT(DISTINCT ...)?
>Here we are:

SELECT COUNT(DISTINCT ...) vs. SELECT COUNT(*) FROM subselect
    8.86 :     9.67
22889.60 : 25726.64
37803.25 : 44169.09
 5420.06 :  5242.39

Not much of a difference.  From Tom's answer ("it's a sort/uniq
implementation") I conclude that what happens internally is basically
the same for both variants.  EXPLAIN is more informative for the
subselect version.  Looking at its output we see that most of the time
is spent in the sort step.

> ->  Sort  (actual time=39487.47..40079.07 rows=195265 loops=1)
> ->  Seq Scan (actual time=624.73..5814.48 rows=195265 loops=1)
                                    ^^^^^^^
                                 Compare these values!
Scan   vs. Sort   for N rows  rel t
   1.92 :     7.66     164   0.0068
5426.79 : 23798.54  113612   0.0139
5814.48 : 40079.07  195265   0.0144
 473.04 :  4401.14   29937   0.0127

rel t is (Sort - Scan) / (N * ln N)

Make sure you have configured enough sort_mem to perform the sort in
memory for a reasonable number of tuples.  The default is 1024 (1 MB)
which is not enough for 200000 tuples.

>Then i can start playing with the indicees. I think, i got an idea about the
>usage of indicees.

If on your production system WHERE clauses look like
    m_id = ... and visit BETWEEN ... AND ...

then an index on (m_id, visit) should help.  If the planner doesn't
choose that index while you think it should, post here again (after
you have verified with EXPLAIN ANALYZE with enable_seqscan on/off).

Servus
 Manfred