Thread: Why DISTINCT ... DESC is slow?

Why DISTINCT ... DESC is slow?

From
Anton
Date:
Hi. With this table (about 800 000 rows):

=# \d n_traffic
                         Table "public.n_traffic"
    Column    |            Type             |          Modifiers
--------------+-----------------------------+------------------------------
 login_id     | integer                     | not null
 traftype_id  | integer                     | not null
 collect_time | timestamp without time zone | not null default now()
 bytes_in     | bigint                      | not null default (0)::bigint
 bytes_out    | bigint                      | not null default (0)::bigint
Indexes:
    "n_traffic_collect_time" btree (collect_time)
    "n_traffic_login_id" btree (login_id)
    "n_traffic_login_id_collect_time" btree (login_id, collect_time)
Foreign-key constraints:
    "n_traffic_login_id_fkey" FOREIGN KEY (login_id) REFERENCES
n_logins(login_id) ON UPDATE CASCADE
    "n_traffic_traftype_id_fkey" FOREIGN KEY (traftype_id) REFERENCES
n_traftypes(traftype_id) ON UPDATE CASCADE


=# explain analyze SELECT DISTINCT ON (login_id) login_id,
collect_time AS dt FROM n_traffic ORDER BY login_id collect_time DESC;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=91711.13..95669.41 rows=532 width=12) (actual
time=10698.860..13790.918 rows=798 loops=1)
   ->  Sort  (cost=91711.13..93690.27 rows=791656 width=12) (actual
time=10698.851..12140.496 rows=791656 loops=1)
         Sort Key: login_id, collect_time
         ->  Seq Scan on n_traffic  (cost=0.00..14150.56 rows=791656
width=12) (actual time=0.013..2776.572 rows=791656 loops=1)
 Total runtime: 14049.378 ms
(5 rows)

For me it is strange that postgres uses Seq Scan, but set
enable_seqscan = off don't get any changes.

While without DESC query goes faster... But not so fast!
=# explain analyze SELECT DISTINCT ON (login_id) login_id,
collect_time AS dt FROM n_traffic ORDER BY login_id collect_time;

     QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.00..29843.08 rows=532 width=12) (actual
time=0.045..5146.768 rows=798 loops=1)
   ->  Index Scan using n_traffic_login_id_collect_time on n_traffic
(cost=0.00..27863.94 rows=791656 width=12) (actual
time=0.037..3682.853 rows=791656 loops=1)
 Total runtime: 5158.735 ms
(3 rows)

Why? 768 rows is about 1000 times smaller than entire n_traffic. And
why Index Scan used without DESC but with DESC is not?
--
engineer

Re: Why DISTINCT ... DESC is slow?

From
Richard Huxton
Date:
Anton wrote:
> While without DESC query goes faster... But not so fast!
> =# explain analyze SELECT DISTINCT ON (login_id) login_id,
> collect_time AS dt FROM n_traffic ORDER BY login_id collect_time;
>
>     QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

>
> Unique  (cost=0.00..29843.08 rows=532 width=12) (actual
> time=0.045..5146.768 rows=798 loops=1)
>   ->  Index Scan using n_traffic_login_id_collect_time on n_traffic
> (cost=0.00..27863.94 rows=791656 width=12) (actual
> time=0.037..3682.853 rows=791656 loops=1)
> Total runtime: 5158.735 ms
> (3 rows)
>
> Why? 768 rows is about 1000 times smaller than entire n_traffic. And
> why Index Scan used without DESC but with DESC is not?

For the DESC version to use the index try "login_id DESC collect_time
DESC" - so both are reversed.

I'm also not sure what this query is meant to do precisely. ORDER BY is
usually the last stage in a query, so it might be applied *after* the
DISTINCT ON.

If you want the most recent collect_time for each login I'd use
something like:

SELECT login_id, MAX(collect_time) AS most_recent
FROM n_traffic
GROUP BY login_id
ORDER BY login_id DESC, collect_time DESC

--
   Richard Huxton
   Archonet Ltd

Re: Why DISTINCT ... DESC is slow?

From
Michael Glaesemann
Date:
On Dec 12, 2006, at 16:43 , Richard Huxton wrote:

> Anton wrote:
>> While without DESC query goes faster... But not so fast!
>> =# explain analyze SELECT DISTINCT ON (login_id) login_id,
>> collect_time AS dt FROM n_traffic ORDER BY login_id collect_time;
>>     QUERY PLAN
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ------------------------- Unique  (cost=0.00..29843.08 rows=532
>> width=12) (actual
>> time=0.045..5146.768 rows=798 loops=1)
>>   ->  Index Scan using n_traffic_login_id_collect_time on n_traffic
>> (cost=0.00..27863.94 rows=791656 width=12) (actual
>> time=0.037..3682.853 rows=791656 loops=1)
>> Total runtime: 5158.735 ms
>> (3 rows)
>> Why? 768 rows is about 1000 times smaller than entire n_traffic. And
>> why Index Scan used without DESC but with DESC is not?
>
> For the DESC version to use the index try "login_id DESC
> collect_time DESC" - so both are reversed.
>
> I'm also not sure what this query is meant to do precisely. ORDER
> BY is usually the last stage in a query, so it might be applied
> *after* the DISTINCT ON.

My understanding is that DISTINCT ON requires the ORDER BY, so I'd be
surprised if ORDER BY is applied after. (Though I'm happy to hear
more about this.)


Michael Glaesemann
grzm seespotcode net



Re: Why DISTINCT ... DESC is slow?

From
Richard Huxton
Date:
Michael Glaesemann wrote:
>
> On Dec 12, 2006, at 16:43 , Richard Huxton wrote:
>
>> Anton wrote:
>>> While without DESC query goes faster... But not so fast!
>>> =# explain analyze SELECT DISTINCT ON (login_id) login_id,
>>> collect_time AS dt FROM n_traffic ORDER BY login_id collect_time;
>>>     QUERY PLAN
>>>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

>>> Unique  (cost=0.00..29843.08 rows=532 width=12) (actual
>>> time=0.045..5146.768 rows=798 loops=1)
>>>   ->  Index Scan using n_traffic_login_id_collect_time on n_traffic
>>> (cost=0.00..27863.94 rows=791656 width=12) (actual
>>> time=0.037..3682.853 rows=791656 loops=1)
>>> Total runtime: 5158.735 ms
>>> (3 rows)
>>> Why? 768 rows is about 1000 times smaller than entire n_traffic. And
>>> why Index Scan used without DESC but with DESC is not?
>>
>> For the DESC version to use the index try "login_id DESC collect_time
>> DESC" - so both are reversed.
>>
>> I'm also not sure what this query is meant to do precisely. ORDER BY
>> is usually the last stage in a query, so it might be applied *after*
>> the DISTINCT ON.
>
> My understanding is that DISTINCT ON requires the ORDER BY, so I'd be
> surprised if ORDER BY is applied after. (Though I'm happy to hear more
> about this.)

(goes away and tests) Ah, you're quite right. I was worried about
ill-defined results, but it prevents you from doing that.

--
   Richard Huxton
   Archonet Ltd

Re: Why DISTINCT ... DESC is slow?

From
Anton
Date:
> > Why? 768 rows is about 1000 times smaller than entire n_traffic. And
> > why Index Scan used without DESC but with DESC is not?
>
> For the DESC version to use the index try "login_id DESC collect_time
> DESC" - so both are reversed.

Yes, it helps!

But
> If you want the most recent collect_time for each login I'd use
> something like:
> SELECT login_id, MAX(collect_time) AS most_recent
> FROM n_traffic
> GROUP BY login_id
> ORDER BY login_id DESC, collect_time DESC
is not so good:
=# SELECT login_id, MAX(collect_time) AS most_recent
-# FROM n_traffic
-# GROUP BY login_id
-# ORDER BY login_id DESC, collect_time DESC;
ERROR:  column "n_traffic.collect_time" must appear in the GROUP BY
clause or be used in an aggregate function

If I correct this error (add collect time to GROUP BY) I'll just get
full table, sorted. And I tried to not use aggregate functions because
they make to do full table scan...

So,
=# explain analyze SELECT DISTINCT ON (login_id) login_id,
collect_time AS dt FROM n_traffic ORDER BY login_idDESC, collect_time
DESC;

         QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.00..29843.08 rows=532 width=12) (actual
time=60.656..9747.985 rows=796 loops=1)
  ->  Index Scan Backward using n_traffic_login_id_collect_time on
n_traffic  (cost=0.00..27863.94 rows=791656 width=12) (actual
time=60.645..8221.891 rows=789934 loops=1)
 Total runtime: 9750.189 ms
(3 rows)

Indexes are used, this is good, but speed still not so good for
2xPIIIx1Ghz + 1Gb RAM + RAID5 on SCSI...

Anyhow, thank you!
--
engineer

Re: Why DISTINCT ... DESC is slow?

From
Richard Huxton
Date:
Anton wrote:
>> SELECT login_id, MAX(collect_time) AS most_recent
>> FROM n_traffic
>> GROUP BY login_id
>> ORDER BY login_id DESC, collect_time DESC
> is not so good:
> =# SELECT login_id, MAX(collect_time) AS most_recent
> -# FROM n_traffic
> -# GROUP BY login_id
> -# ORDER BY login_id DESC, collect_time DESC;
> ERROR:  column "n_traffic.collect_time" must appear in the GROUP BY
> clause or be used in an aggregate function
>
> If I correct this error (add collect time to GROUP BY) I'll just get
> full table, sorted. And I tried to not use aggregate functions because
> they make to do full table scan...

Sorry - my typo. The order-by doesn't need "collect_time" of course.

--
   Richard Huxton
   Archonet Ltd

Re: Why DISTINCT ... DESC is slow?

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/12/06 01:28, Anton wrote:
> Hi. With this table (about 800 000 rows):
>
> =# \d n_traffic
>                         Table "public.n_traffic"
>    Column    |            Type             |          Modifiers
> --------------+-----------------------------+------------------------------
> login_id     | integer                     | not null
> traftype_id  | integer                     | not null
> collect_time | timestamp without time zone | not null default now()
> bytes_in     | bigint                      | not null default (0)::bigint
> bytes_out    | bigint                      | not null default (0)::bigint
> Indexes:
>    "n_traffic_collect_time" btree (collect_time)
>    "n_traffic_login_id" btree (login_id)
>    "n_traffic_login_id_collect_time" btree (login_id, collect_time)
> Foreign-key constraints:
>    "n_traffic_login_id_fkey" FOREIGN KEY (login_id) REFERENCES
> n_logins(login_id) ON UPDATE CASCADE
>    "n_traffic_traftype_id_fkey" FOREIGN KEY (traftype_id) REFERENCES
> n_traftypes(traftype_id) ON UPDATE CASCADE

Why do you have indexes on both LOGIN_ID *and* LOGIN_ID + COLLECT_TIME?

ISTM that you can drop the LOGIN_ID index.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFfsqwS9HxQb37XmcRAssSAKDYkQc0VlF7nuEcuMbe6Eub9T++egCgwNec
2ZT0LmH/iDaotUyKi/4hQjg=
=5y2t
-----END PGP SIGNATURE-----

Re: Why DISTINCT ... DESC is slow?

From
Erik Jones
Date:
Ron Johnson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 12/12/06 01:28, Anton wrote:
>
>> Hi. With this table (about 800 000 rows):
>>
>> =# \d n_traffic
>>                         Table "public.n_traffic"
>>    Column    |            Type             |          Modifiers
>> --------------+-----------------------------+------------------------------
>> login_id     | integer                     | not null
>> traftype_id  | integer                     | not null
>> collect_time | timestamp without time zone | not null default now()
>> bytes_in     | bigint                      | not null default (0)::bigint
>> bytes_out    | bigint                      | not null default (0)::bigint
>> Indexes:
>>    "n_traffic_collect_time" btree (collect_time)
>>    "n_traffic_login_id" btree (login_id)
>>    "n_traffic_login_id_collect_time" btree (login_id, collect_time)
>> Foreign-key constraints:
>>    "n_traffic_login_id_fkey" FOREIGN KEY (login_id) REFERENCES
>> n_logins(login_id) ON UPDATE CASCADE
>>    "n_traffic_traftype_id_fkey" FOREIGN KEY (traftype_id) REFERENCES
>> n_traftypes(traftype_id) ON UPDATE CASCADE
>>
>
> Why do you have indexes on both LOGIN_ID *and* LOGIN_ID + COLLECT_TIME?
>
> ISTM that you can drop the LOGIN_ID index.
>
Hmm...   Will queries that use only login_id and not collect_time use
the (login_id, collect_time) index?

--
erik jones <erik@myemma.com>
software development
emma(r)


Re: Why DISTINCT ... DESC is slow?

From
"Brandon Aiken"
Date:
If you have, say, an index(x, y) then that index will often double as an
index(x).  It will generally not double as an index(y).

I'm not sure if that's how all RDBMSs work, but I'm pretty sure that's
how Oracle works.  It never surprises me when PostgreSQL mimics Oracle.

--
Brandon Aiken
CS/IT Systems Engineer
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Erik Jones
Sent: Tuesday, December 12, 2006 11:33 AM
To: Ron Johnson
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Why DISTINCT ... DESC is slow?

Ron Johnson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 12/12/06 01:28, Anton wrote:
>
>> Hi. With this table (about 800 000 rows):
>>
>> =# \d n_traffic
>>                         Table "public.n_traffic"
>>    Column    |            Type             |          Modifiers
>>
--------------+-----------------------------+---------------------------
---
>> login_id     | integer                     | not null
>> traftype_id  | integer                     | not null
>> collect_time | timestamp without time zone | not null default now()
>> bytes_in     | bigint                      | not null default
(0)::bigint
>> bytes_out    | bigint                      | not null default
(0)::bigint
>> Indexes:
>>    "n_traffic_collect_time" btree (collect_time)
>>    "n_traffic_login_id" btree (login_id)
>>    "n_traffic_login_id_collect_time" btree (login_id, collect_time)
>> Foreign-key constraints:
>>    "n_traffic_login_id_fkey" FOREIGN KEY (login_id) REFERENCES
>> n_logins(login_id) ON UPDATE CASCADE
>>    "n_traffic_traftype_id_fkey" FOREIGN KEY (traftype_id) REFERENCES
>> n_traftypes(traftype_id) ON UPDATE CASCADE
>>
>
> Why do you have indexes on both LOGIN_ID *and* LOGIN_ID +
COLLECT_TIME?
>
> ISTM that you can drop the LOGIN_ID index.
>
Hmm...   Will queries that use only login_id and not collect_time use
the (login_id, collect_time) index?

--
erik jones <erik@myemma.com>
software development
emma(r)


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

Re: Why DISTINCT ... DESC is slow?

From
Tom Lane
Date:
"Brandon Aiken" <BAiken@winemantech.com> writes:
> If you have, say, an index(x, y) then that index will often double as an
> index(x).  It will generally not double as an index(y).

It's not hard to understand why, if you think about the sort ordering of
a double-column index:

    x    y

    1    1
    1    2
    1    3
    2    1
    2    2
    2    3
    3    1
    ...

All similar values of x are brought together, so scanning the index for
x alone works just the same as it would in a one-column index ... the
index entries are bigger so it's marginally less efficient, but only
marginally.  On the other hand, the entries for a specific value or
range of y will be scattered all over the index, so it's almost useless
to use the index for a search on y alone.

As of PG 8.1 or 8.2 (I forget) the optimizer will *consider* using such
an index for a y-only query, but it'll nearly always decide it's a bad
idea.

            regards, tom lane

Re: Why DISTINCT ... DESC is slow?

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/12/06 11:30, Tom Lane wrote:
> "Brandon Aiken" <BAiken@winemantech.com> writes:
>> If you have, say, an index(x, y) then that index will often double as an
>> index(x).  It will generally not double as an index(y).
>
> It's not hard to understand why, if you think about the sort ordering of
> a double-column index:
>
>     x    y
>
>     1    1
>     1    2
>     1    3
>     2    1
>     2    2
>     2    3
>     3    1
>     ...
>
> All similar values of x are brought together, so scanning the index for
> x alone works just the same as it would in a one-column index ... the
> index entries are bigger so it's marginally less efficient, but only
> marginally.  On the other hand, the entries for a specific value or
> range of y will be scattered all over the index, so it's almost useless
> to use the index for a search on y alone.

Some DBMSs call this an "index scan".

> As of PG 8.1 or 8.2 (I forget) the optimizer will *consider* using such
> an index for a y-only query, but it'll nearly always decide it's a bad
> idea.

Scanning segment-2 of a 2-segment index seems like it would be
faster than scanning the table, if for no other reason than
"locality of data": the index will be smaller than the table, so
scanning it looking for record pointers should be faster than
scanning the table.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFfun7S9HxQb37XmcRAvaqAJ0X4m933xqHaKBfdYEM0KHaMST/TgCfQsEA
4dBgCERRzIlBrkUK18gfZ08=
=PGjb
-----END PGP SIGNATURE-----

Re: Why DISTINCT ... DESC is slow?

From
Anton
Date:
> > =# \d n_traffic
> >                         Table "public.n_traffic"
> >    Column    |            Type             |          Modifiers
> > --------------+-----------------------------+------------------------------
> > login_id     | integer                     | not null
> > traftype_id  | integer                     | not null
> > collect_time | timestamp without time zone | not null default now()
> > bytes_in     | bigint                      | not null default (0)::bigint
> > bytes_out    | bigint                      | not null default (0)::bigint
> > Indexes:
> >    "n_traffic_collect_time" btree (collect_time)
> >    "n_traffic_login_id" btree (login_id)
> >    "n_traffic_login_id_collect_time" btree (login_id, collect_time)
> > Foreign-key constraints:
> >    "n_traffic_login_id_fkey" FOREIGN KEY (login_id) REFERENCES
> > n_logins(login_id) ON UPDATE CASCADE
> >    "n_traffic_traftype_id_fkey" FOREIGN KEY (traftype_id) REFERENCES
> > n_traftypes(traftype_id) ON UPDATE CASCADE
>
> Why do you have indexes on both LOGIN_ID *and* LOGIN_ID + COLLECT_TIME?
It is because I think that queries which use only LOGIN_ID field will
use (faster) LOGIN_IDonly index... For me, speed of insertions is not
a primary task here (robot is not confused by delays...), but select
is. So I keep both indexes.

> ISTM that you can drop the LOGIN_ID index.
--
engineer

Re: Why DISTINCT ... DESC is slow?

From
Ron Johnson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/12/06 23:13, Anton wrote:
[snip]
>> Why do you have indexes on both LOGIN_ID *and* LOGIN_ID +
>> COLLECT_TIME?
> It is because I think that queries which use only LOGIN_ID field
> will use (faster) LOGIN_IDonly index... For me, speed of
> insertions is not a primary task here (robot is not confused by
> delays...), but select is. So I keep both indexes.

Figured.  Understandable thought, and valid for a *hashed* index.
Also valid for COLLECT_TIME, since it's the 2nd segment of the index.

Because of the nature of b-tree indexes, though, the optimizer
*will* use n_traffic_login_id_collect_time when you say WHERE
LOGIN_ID = 5;

>> ISTM that you can drop the LOGIN_ID index.


- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFgAc3S9HxQb37XmcRAnGPAKCgRBJ1ADJ/chYqIDZhVdZhwKB6YQCeNevb
+DnTXM/8utMXyN5s+zA//lU=
=DKb/
-----END PGP SIGNATURE-----