Thread: [BUGS] Query planner skipping index depending on DISTINCT parameter order

[BUGS] Query planner skipping index depending on DISTINCT parameter order

From
Дилян Палаузов
Date:
Hello,

I have this database in Pg 9.6.5, the schema comes from spamassassin:

┌────────────┬─────────┬──────────────────────────────┐
│   Column   │  Type   │          Modifiers           │
├────────────┼─────────┼──────────────────────────────┤
│ id         │ integer │ not null default 0           │
│ token      │ bytea   │ not null default '\x'::bytea │
│ spam_count │ integer │ not null default 0           │
│ ham_count  │ integer │ not null default 0           │
│ atime      │ integer │ not null default 0           │
└────────────┴─────────┴──────────────────────────────┘
Indexes:    "bayes_token_pkey" PRIMARY KEY, btree (id, token)    "bayes_token_idx1" btree (token)    "the_index" btree
(id,token)
 

with 261000 rows. According to my understanding, "SELECT DISTINCT ON(token, id) token FROM bayes_token;" and "SELECT
DISTINCTON(id, token) token FROM bayes_token;" are supposed to deliver the same results, as only the "token" and "id"
columnsare exchanged in DISTINCT, so the query optimizer is supposed to generate the same query plan.  But it does
not:

EXPLAIN ANALYZE SELECT DISTINCT ON(token, id) token  FROM bayes_token;

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN                                                           │

├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique  (cost=32624.56..34588.73 rows=261889 width=10) (actual time=154.947..264.439 rows=261935 loops=1) │
│   ->  Sort  (cost=32624.56..33279.28 rows=261889 width=10) (actual time=154.945..215.378 rows=261935 loops=1)
        │
 
│         Sort Key: token, id │
│         Sort Method: external merge  Disk: 5624kB │
│         ->  Seq Scan on bayes_token  (cost=0.00..4579.89 rows=261889 width=10) (actual time=0.015..35.282 rows=261935
loops=1)│
 
│ Planning time: 0.085 ms │
│ Execution time: 285.303 ms │

└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

EXPLAIN ANALYZE SELECT DISTINCT ON(id, token) token  FROM bayes_token;

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN                 │

├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique  (cost=0.42..13819.07 rows=262023 width=10) (actual time=0.028..207.695 rows=262018 loops=1)
                           │
 
│   ->  Index Only Scan using the_index on bayes_token (cost=0.42..12508.95 rows=262023 width=10) (actual
time=0.026..147.882rows=262018 loops=1) │
 
│         Heap Fetches: 261729             │
│ Planning time: 0.086 ms             │
│ Execution time: 232.598 ms             │

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The_index and the SELECTs are invented for the sake of this demonstration.

Regards  Diluan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

[BUGS] Query planner skipping index depending on DISTINCT parameter order(2)

From
Дилян Палаузов
Date:
Post Scriptum to my email from yesterday.

I deleted the_index, so that I have now

spamassassin=> \d bayes_token              Table "public.bayes_token"
┌────────────┬─────────┬──────────────────────────────┐
│   Column   │  Type   │          Modifiers           │
├────────────┼─────────┼──────────────────────────────┤
│ id         │ integer │ not null default 0           │
│ token      │ bytea   │ not null default '\x'::bytea │
│ spam_count │ integer │ not null default 0           │
│ ham_count  │ integer │ not null default 0           │
│ atime      │ integer │ not null default 0           │
└────────────┴─────────┴──────────────────────────────┘
Indexes:    "bayes_token_pkey" PRIMARY KEY, btree (id, token)    "bayes_token_idx1" btree (token)

and as you can see, there is still a btree index on (id, token).  I expect the same query plans, as before deleting
the_index,as bayes_token_pkey has the same information as the_index had.  This does not happen on my system and I have
nottweaked the default configuration files.
 

spamassassin=> EXPLAIN ANALYZE SELECT DISTINCT(id, token) token FROM bayes_token;

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                           QUERY PLAN
        │
 

├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Unique  (cost=21737.73..22555.11 rows=163476 width=32) (actual time=914.275..1206.239 rows=165523 loops=1)
        │
 
│   ->  Sort  (cost=21737.73..22146.42 rows=163476 width=32) (actual time=914.274..1125.674 rows=165523 loops=1)
        │
 
│         Sort Key: (ROW(id, token))
        │
 
│         Sort Method: external merge  Disk: 6616kB
        │
 
│         ->  Seq Scan on bayes_token  (cost=0.00..3668.76 rows=163476 width=32) (actual time=0.015..50.849 rows=165523
loops=1)│
 
│ Planning time: 0.079 ms
        │
 
│ Execution time: 1216.047 ms
        │
 

└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

What is the difference makes ROW between the one case here  Sort Key: (ROW(id, token))
and in the other below:  Sort Key: token, id

Regards  Dilyan

On 09/14/17 17:38, Дилян Палаузов wrote:
> Hello,
> 
> I have this database in Pg 9.6.5, the schema comes from spamassassin:
> 
> ┌────────────┬─────────┬──────────────────────────────┐
> │   Column   │  Type   │          Modifiers           │
> ├────────────┼─────────┼──────────────────────────────┤
> │ id         │ integer │ not null default 0           │
> │ token      │ bytea   │ not null default '\x'::bytea │
> │ spam_count │ integer │ not null default 0           │
> │ ham_count  │ integer │ not null default 0           │
> │ atime      │ integer │ not null default 0           │
> └────────────┴─────────┴──────────────────────────────┘
> Indexes:
>      "bayes_token_pkey" PRIMARY KEY, btree (id, token)
>      "bayes_token_idx1" btree (token)
>      "the_index" btree (id, token)
> 
> with 261000 rows. According to my understanding, "SELECT DISTINCT ON(token, id) token FROM bayes_token;" and "SELECT
DISTINCTON(id, token) token FROM bayes_token;" are supposed to deliver the same results, as only the "token" and "id"
columnsare exchanged in DISTINCT, so the query optimizer is supposed to generate the same query plan.  But it does
not:
> 
> EXPLAIN ANALYZE SELECT DISTINCT ON(token, id) token  FROM bayes_token;
>
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │ QUERY PLAN                                                           │
>
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Unique  (cost=32624.56..34588.73 rows=261889 width=10) (actual time=154.947..264.439 rows=261935 loops=1) │
> │   ->  Sort  (cost=32624.56..33279.28 rows=261889 width=10) (actual time=154.945..215.378 rows=261935 loops=1)
          │
 
> │         Sort Key: token, id │
> │         Sort Method: external merge  Disk: 5624kB │
> │         ->  Seq Scan on bayes_token  (cost=0.00..4579.89 rows=261889 width=10) (actual time=0.015..35.282
rows=261935loops=1) │
 
> │ Planning time: 0.085 ms │
> │ Execution time: 285.303 ms │
>
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> 
> EXPLAIN ANALYZE SELECT DISTINCT ON(id, token) token  FROM bayes_token;
>
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
> │ QUERY PLAN                 │
>
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
> │ Unique  (cost=0.42..13819.07 rows=262023 width=10) (actual time=0.028..207.695 rows=262018 loops=1)
                             │
 
> │   ->  Index Only Scan using the_index on bayes_token (cost=0.42..12508.95 rows=262023 width=10) (actual
time=0.026..147.882rows=262018 loops=1) │
 
> │         Heap Fetches: 261729             │
> │ Planning time: 0.086 ms             │
> │ Execution time: 232.598 ms             │
>
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
> 
> The_index and the SELECTs are invented for the sake of this demonstration.
> 
> Regards
>    Diluan


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Query planner skipping index depending on DISTINCTparameter order (2)

From
Tomas Vondra
Date:
Hello Dilyan,

You're right - we're currently not able to use the index if it's not
consistent with the DISTINCT ordering. That is, if you have index on
(a,b) and DISTINCT ON (b,a) we fail to leverage the index.

The reason for this simple - if you look at create_distinct_paths [1],
which is where the Unique path comes from, you'll see it iterates over
all paths and compares the ordering using pathkeys_is_contained [2].

That however only ensures the path matches the expected Unique ordering
(determined by the column list in DISTINCT ON clause), we don't try to
re-shuffle the columns in any way at this point.

So this is more a missing optimization than a bug, I'd guess. But it
seems worthwhile and possibly not extremely difficult to implement, so I
may look into it - but that's PG11 at the earliest.

But, looking at the code in create_distinct_paths, ISTM you can easily
convince the planner to use the index by simply adding a matching ORDER
BY clause. That is
   SELECT DISTINCT ON(token, id) token  FROM bayes_token   ORDER BY id, token;

should be able to use the index on (id,token).


[1]
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4725

[2]
https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4811


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Query planner skipping index depending on DISTINCTparameter order (2)

From
Dilyan Palauzov
Date:
Hello Tomas,

thanks for your answer.

While adding an ORDER BY uses the index, it does not help, as the user 
has to take care of the order of provided columns.  Whether the care is 
is taken in DISTINCT ON or in ORDER BY is secondary.

The behaviour behind DISTINCT and indexes surprises  me, as the query 
planner does reorder the columns for SELECT to determine the most 
suitable index.

My proposal to reflect this:

diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1069,7 +1069,10 @@ SELECT DISTINCT ON (location) location, time, report    <para>     Currently, <literal>FOR NO
KEYUPDATE</>, <literal>FOR UPDATE</>,     <literal>FOR SHARE</> and <literal>FOR KEY SHARE</> cannot be
 
-    specified with <literal>DISTINCT</literal>.
+    specified with <literal>DISTINCT</literal>.  Contrary to
+    <literal>SELECT</>, which reorders its parameters to find a best,
+    matching index, DISTINCT ON constructs an expression, e.g. from the
+    provided rows, and checks then if an index can serve the expression.    </para>   </refsect2>


Please ignore my email from Friday, I have forgotten an ON after DISTINCT.


Greetings  Dilyan


On 09/16/2017 10:23 AM, Tomas Vondra wrote:
> Hello Dilyan,
>
> You're right - we're currently not able to use the index if it's not
> consistent with the DISTINCT ordering. That is, if you have index on
> (a,b) and DISTINCT ON (b,a) we fail to leverage the index.
>
> The reason for this simple - if you look at create_distinct_paths [1],
> which is where the Unique path comes from, you'll see it iterates over
> all paths and compares the ordering using pathkeys_is_contained [2].
>
> That however only ensures the path matches the expected Unique ordering
> (determined by the column list in DISTINCT ON clause), we don't try to
> re-shuffle the columns in any way at this point.
>
> So this is more a missing optimization than a bug, I'd guess. But it
> seems worthwhile and possibly not extremely difficult to implement, so I
> may look into it - but that's PG11 at the earliest.
>
> But, looking at the code in create_distinct_paths, ISTM you can easily
> convince the planner to use the index by simply adding a matching ORDER
> BY clause. That is
>
>     SELECT DISTINCT ON(token, id) token  FROM bayes_token
>     ORDER BY id, token;
>
> should be able to use the index on (id,token).
>
>
> [1]
> https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4725
>
> [2]
> https://github.com/postgres/postgres/blob/master/src/backend/optimizer/plan/planner.c#L4811
>
>
> regards
>


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Query planner skipping index depending on DISTINCTparameter order (2)

From
Tomas Vondra
Date:
On 09/17/2017 02:30 AM, Dilyan Palauzov wrote:
> Hello Tomas,
> 
> thanks for your answer.
> 
> While adding an ORDER BY uses the index, it does not help, as the
> user has to take care of the order of provided columns.  Whether the
> care is is taken in DISTINCT ON or in ORDER BY is secondary.
> 
> The behaviour behind DISTINCT and indexes surprises  me, as the
> query planner does reorder the columns for SELECT to determine the
> most suitable index.
> 

Well, I agree it's somewhat reasonable optimization. The thing is, the
planner/optimizer does not start with all features on day 1, it gets
improved over time. And no one implemented this bit yet.


> My proposal to reflect this:
> 
> diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
> --- a/doc/src/sgml/ref/select.sgml
> +++ b/doc/src/sgml/ref/select.sgml
> @@ -1069,7 +1069,10 @@ SELECT DISTINCT ON (location) location, time, report
>     <para>
>      Currently, <literal>FOR NO KEY UPDATE</>, <literal>FOR UPDATE</>,
>      <literal>FOR SHARE</> and <literal>FOR KEY SHARE</> cannot be
> -    specified with <literal>DISTINCT</literal>.
> +    specified with <literal>DISTINCT</literal>.  Contrary to
> +    <literal>SELECT</>, which reorders its parameters to find a best,
> +    matching index, DISTINCT ON constructs an expression, e.g. from the
> +    provided rows, and checks then if an index can serve the expression.
>     </para>
>    </refsect2>
> 
I don't think we want to change the docs like this. Notice that the
SELECT documentation does not mention indexes at all, and I'm pretty
sure we don't want to start doing that. The docs are user-level, deal
only explaining properties of the output relation, and not with
implementation-level details like index optimizations.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 09/17/2017 02:30 AM, Dilyan Palauzov wrote:
>> The behaviour behind DISTINCT and indexes surprises  me, as the
>> query planner does reorder the columns for SELECT to determine the
>> most suitable index.

> Well, I agree it's somewhat reasonable optimization. The thing is, the
> planner/optimizer does not start with all features on day 1, it gets
> improved over time. And no one implemented this bit yet.

For the DISTINCT ON case, the user-visible semantics are actually pretty
tightly tied to ORDER BY, so that it would not be very reasonable to
consider any other orderings than the given column order anyway.

For plain DISTINCT, yeah we could consider other orderings ... but
we're rather unlikely to find an index that matches all the output
columns, regardless of what order they're in.  So it's just not that
exciting.

IOW, somebody might get around to this someday, but don't hold your
breath; there's lots of higher-value fruit to be reaching for.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Query planner skipping index depending on DISTINCTparameter order (2)

From
Tomas Vondra
Date:

On 09/17/2017 07:15 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 09/17/2017 02:30 AM, Dilyan Palauzov wrote:
>>> The behaviour behind DISTINCT and indexes surprises  me, as the
>>> query planner does reorder the columns for SELECT to determine the
>>> most suitable index.
> 
>> Well, I agree it's somewhat reasonable optimization. The thing is, the
>> planner/optimizer does not start with all features on day 1, it gets
>> improved over time. And no one implemented this bit yet.
> 
> For the DISTINCT ON case, the user-visible semantics are actually pretty
> tightly tied to ORDER BY, so that it would not be very reasonable to
> consider any other orderings than the given column order anyway.
> 

Tied in what sense? In the docs we explicitly say this:
 https://www.postgresql.org/docs/10/static/sql-select.html#sql-distinct
 SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions
evaluateto equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note
thatthe “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
 

which in my understanding is that while we use the same rules as ORDER
BY, we don't guarantee any particular ordering (i.e. which row we keep)
unless an explicit ORDER BY clause is used.

So if an ORDER BY is not specified, why couldn't we pick an arbitrary
ordering matching based on available indexes?

> For plain DISTINCT, yeah we could consider other orderings ... but
> we're rather unlikely to find an index that matches all the output
> columns, regardless of what order they're in.  So it's just not that
> exciting.
> 

Not necessarily. For example if we get the incremental sort in, we might
pick from a much wider set of indexes.

> IOW, somebody might get around to this someday, but don't hold your
> breath; there's lots of higher-value fruit to be reaching for.
> 

Sure. But the perceived value really depends on the user - what's
worthless for one user may be quite valuable for another one.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On 09/17/2017 07:15 PM, Tom Lane wrote:
>> For the DISTINCT ON case, the user-visible semantics are actually pretty
>> tightly tied to ORDER BY, so that it would not be very reasonable to
>> consider any other orderings than the given column order anyway.

> Tied in what sense? In the docs we explicitly say this:

>   https://www.postgresql.org/docs/10/static/sql-select.html#sql-distinct

>   SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
>   each set of rows where the given expressions evaluate to equal. The
>   DISTINCT ON expressions are interpreted using the same rules as for
>   ORDER BY (see above). Note that the “first row” of each set is
>   unpredictable unless ORDER BY is used to ensure that the desired row
>   appears first.

Right, so the behavior is undefined unless you have an ORDER BY clause
that includes the DISTINCT ON columns plus some more columns.  That's
pretty tightly tied in my book.

> So if an ORDER BY is not specified, why couldn't we pick an arbitrary
> ordering matching based on available indexes?

The case is not of any real-world use, and so I'm unwilling to expend
the large amount of coding effort that would be needed to make the
planner behave this way.
        regards, tom lane


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Query planner skipping index depending on DISTINCTparameter order (2)

From
Tomas Vondra
Date:
On 09/19/2017 05:40 PM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> On 09/17/2017 07:15 PM, Tom Lane wrote:
>>> For the DISTINCT ON case, the user-visible semantics are actually pretty
>>> tightly tied to ORDER BY, so that it would not be very reasonable to
>>> consider any other orderings than the given column order anyway.
> 
>> Tied in what sense? In the docs we explicitly say this:
> 
>>   https://www.postgresql.org/docs/10/static/sql-select.html#sql-distinct
> 
>>   SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of
>>   each set of rows where the given expressions evaluate to equal. The
>>   DISTINCT ON expressions are interpreted using the same rules as for
>>   ORDER BY (see above). Note that the “first row” of each set is
>>   unpredictable unless ORDER BY is used to ensure that the desired row
>>   appears first.
> 
> Right, so the behavior is undefined unless you have an ORDER BY
> clause that includes the DISTINCT ON columns plus some more columns.
> That's pretty tightly tied in my book.
> 

Ah, OK. I thought you're suggesting we're required to produce the data
sorted by the DISTINCT ON columns. But you meant that ORDER BY clause is
required for well-defined result, which limits our options when picking
an index. Right?

>> So if an ORDER BY is not specified, why couldn't we pick an
>> arbitrary ordering matching based on available indexes?
> 
> The case is not of any real-world use, and so I'm unwilling to
> expend the large amount of coding effort that would be needed to make
> the planner behave this way.
> 

I don't think the "no real-world use" is so clear. Imagine for example a
denormalized table where the user knows that for a given ID, the other
columns match too. Then
   SELECT DISTINCT ON (id), x, y, z FROM t;   SELECT DISTINCT ON (id), x, y, z FROM t ORDER BY id, x, y, z;   SELECT
DISTINCTid, x, y, z, FROM t;
 

will all produce the same result, except that the first case only needs
compare values in "id". If sorts on "x", "y" and "z" are expensive (say,
because those are text columns with non-C collations), that may be quite
a difference.

That being said, I'm not sure how much code would this be, and how much
overhead would it mean.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

On 17 September 2017 at 18:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> For plain DISTINCT, yeah we could consider other orderings ... but
> we're rather unlikely to find an index that matches all the output
> columns, regardless of what order they're in.  So it's just not that
> exciting.

I don't follow this part. Are you saying a simple "select distinct a,b
from table" is unlikely to find a matching index on <b,a>? Don't we
already do exactly this for "select a,b from table group by a,b"? I
would have expected the two equivalent SQL statements to use exactly
the same infrastructure and thought it was only a matter of historical
legacy that they didn't.

-- 
greg


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

[BUGS] Re: Query planner skipping index depending on DISTINCT parameterorder (2)

From
Дилян Палаузов
Date:
Hello,

for the record, in this table

spamassassin=> \d bayes_token           Table "public.bayes_token"  Column    Type            Modifiers
────────── ─────── ────────────────────────────
id         integer not null default 0
token      bytea   not null default '\x'::bytea
spam_count integer not null default 0
ham_count  integer not null default 0
atime      integer not null default 0
Indexes:    "bayes_token_pkey" PRIMARY KEY, btree (id, token)    "bayes_token_idx1" btree (token)


"SELECT id, token FROM bayes_token" and "SELECT id, token FROM bayes_token WHERE id > 0" do Seq Scan and need 35-50
Sec.

But "SELECT id, token FROM bayes_token WHERE id > 1" uses Index Only Scan and 0.04 seconds on pg 9.6.5 .

Greetings  Dilian

On 09/25/17 12:43, Greg Stark wrote:
> On 17 September 2017 at 18:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
>> For plain DISTINCT, yeah we could consider other orderings ... but
>> we're rather unlikely to find an index that matches all the output
>> columns, regardless of what order they're in.  So it's just not that
>> exciting.
> 
> I don't follow this part. Are you saying a simple "select distinct a,b
> from table" is unlikely to find a matching index on <b,a>? Don't we
> already do exactly this for "select a,b from table group by a,b"? I
> would have expected the two equivalent SQL statements to use exactly
> the same infrastructure and thought it was only a matter of historical
> legacy that they didn't.
> 


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

[BUGS] Re: Query planner skipping index depending on DISTINCT parameterorder (2)

From
Tomas Vondra
Date:

On 09/25/2017 08:10 PM, Дилян Палаузов wrote:
> Hello,
> 
> for the record, in this table
> 
> spamassassin=> \d bayes_token
>            Table "public.bayes_token"
>   Column    Type            Modifiers
> ────────── ─────── ────────────────────────────
> id         integer not null default 0
> token      bytea   not null default '\x'::bytea
> spam_count integer not null default 0
> ham_count  integer not null default 0
> atime      integer not null default 0
> Indexes:
>     "bayes_token_pkey" PRIMARY KEY, btree (id, token)
>     "bayes_token_idx1" btree (token)
> 
> 
> "SELECT id, token FROM bayes_token" and "SELECT id, token FROM
> bayes_token WHERE id > 0" do Seq Scan and need 35-50 Sec.
> 
> But "SELECT id, token FROM bayes_token WHERE id > 1" uses Index Only
> Scan and 0.04 seconds on pg 9.6.5 .
> 

I don't quite see how this is related to the original topic in this
thread, which was discussing DISTINCT ON vs. indexes.

But more importantly, it's really difficult to give you answers without
you showing us EXPLAIN (and ideally EXPLAIN ANALYZE) for the queries.

FWIW, my guess is that there are many rows with (id > 0), and scanning
them using index only scan would be expensive. While there are only very
few rows with (id > 1), so the database uses IOS.

You can try pushing the database towards IOS by disabling sequential and
bitmap scans.
set enable_seqscan = off;set enable_bitmapscan = off;


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs


On 09/25/2017 12:43 PM, Greg Stark wrote:
> On 17 September 2017 at 18:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 
>> For plain DISTINCT, yeah we could consider other orderings ... but
>> we're rather unlikely to find an index that matches all the output
>> columns, regardless of what order they're in.  So it's just not that
>> exciting.
> 
> I don't follow this part. Are you saying a simple "select distinct a,b
> from table" is unlikely to find a matching index on <b,a>? Don't we
> already do exactly this for "select a,b from table group by a,b"?

FWIW we don't do that:
   create table t (a int, b int);   insert into t select mod(i, 1000), mod(i, 1000)                  from
generate_series(1,1000000)S(i);   create index on t (a,b);   vacuum analyze t;
 
   explain (costs off) select a, b, count(*) from t group by a , b;                    QUERY PLAN
--------------------------------------------   GroupAggregate      Group Key: a, b      ->  Index Only Scan using
t_a_b_idxon t   (3 rows)
 
   explain (costs off) select a, b, count(*) from t group by b , a;           QUERY PLAN   ---------------------------
 GroupAggregate      Group Key: b, a      ->  Sort            Sort Key: b, a            ->  Seq Scan on t   (5 rows)
 

But yeah, fixing this is probably more worthwhile than the original
DISTINCT ON issue ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs