Thread: ToDo: KNN Search should to support DISTINCT clasuse?

ToDo: KNN Search should to support DISTINCT clasuse?

From
Pavel Stehule
Date:
Hello

I should to search distinct values based on similarity

postgres=# explain select  nazobce, nazobce <-> 'Benešov' from obce
order by nazobce <-> 'Benešov' limit 10
;                                       QUERY PLAN
-------------------------------------------------------------------------------------------Limit  (cost=0.00..0.86
rows=10width=10)  ->  Index Scan using obce_nazobce_idx on obce  (cost=0.00..1433.14 
rows=16644 width=10)        Order By: (nazobce <-> 'Benešov'::text)
(3 rows)

Time: 0.576 ms

but using DISTINCT breaks KNN searching optimization

postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10
;                                QUERY PLAN
-----------------------------------------------------------------------------Limit  (cost=600.45..600.47 rows=10
width=10) ->  Sort  (cost=600.45..613.80 rows=5341 width=10)        Sort Key: ((nazobce <-> 'Benešov'::text))        ->
HashAggregate  (cost=418.27..485.03 rows=5341 width=10)              ->  Seq Scan on obce  (cost=0.00..335.05
rows=16644width=10) 
(5 rows)

Regards

Pavel Stehule



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Tom Lane
Date:
Pavel Stehule <pavel.stehule@gmail.com> writes:
> but using DISTINCT breaks KNN searching optimization

> postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
> obce order by nazobce <-> 'Benešov' limit 10

Don't hold your breath.  There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input.  sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful.  You could get a plan using the index if you only
wanted the <-> output column, eg

contrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
                 QUERY PLAN                                      
 
-------------------------------------------------------------------------------------Limit  (cost=0.00..0.87 rows=10
width=12) ->  Unique  (cost=0.00..86.75 rows=1000 width=12)        ->  Index Scan using ti on test_trgm
(cost=0.00..84.25rows=1000 width=12)              Order By: (t <-> 'foo'::text)
 
(4 rows)

Perhaps it would be close enough to what you want to use DISTINCT ON:

contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit
10;                                   QUERY PLAN                                      
 
-------------------------------------------------------------------------------------Limit  (cost=0.00..0.87 rows=10
width=12) ->  Unique  (cost=0.00..86.75 rows=1000 width=12)        ->  Index Scan using ti on test_trgm
(cost=0.00..84.25rows=1000 width=12)              Order By: (t <-> 'foo'::text)
 
(4 rows)
        regards, tom lane



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Pavel Stehule
Date:
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
> Pavel Stehule <pavel.stehule@gmail.com> writes:
>> but using DISTINCT breaks KNN searching optimization
>
>> postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
>> obce order by nazobce <-> 'Benešov' limit 10
>
> Don't hold your breath.  There are two ways the system could implement
> the DISTINCT clause: either sort and uniq, or hashaggregate.
> hashaggregate will destroy any input ordering, so there's no value in
> using the index as input.  sort and uniq requires the input to be sorted
> by *all* the columns being distinct'ed, not just one, so again this
> index isn't useful.  You could get a plan using the index if you only
> wanted the <-> output column, eg
>
> contrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
>                                      QUERY PLAN
> -------------------------------------------------------------------------------------
>  Limit  (cost=0.00..0.87 rows=10 width=12)
>    ->  Unique  (cost=0.00..86.75 rows=1000 width=12)
>          ->  Index Scan using ti on test_trgm  (cost=0.00..84.25 rows=1000 width=12)
>                Order By: (t <-> 'foo'::text)
> (4 rows)
>
> Perhaps it would be close enough to what you want to use DISTINCT ON:
>
> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit
10;
>                                      QUERY PLAN
> -------------------------------------------------------------------------------------
>  Limit  (cost=0.00..0.87 rows=10 width=12)
>    ->  Unique  (cost=0.00..86.75 rows=1000 width=12)
>          ->  Index Scan using ti on test_trgm  (cost=0.00..84.25 rows=1000 width=12)
>                Order By: (t <-> 'foo'::text)
> (4 rows)
>
>                         regards, tom lane

good tip - it's working

thank you

Regards

Pavel



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Robert Haas
Date:
On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Pavel Stehule <pavel.stehule@gmail.com> writes:
>> but using DISTINCT breaks KNN searching optimization
>
>> postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
>> obce order by nazobce <-> 'Benešov' limit 10
>
> Don't hold your breath.  There are two ways the system could implement
> the DISTINCT clause: either sort and uniq, or hashaggregate.
> hashaggregate will destroy any input ordering, so there's no value in
> using the index as input.

Isn't that an implementation limitation though, rather than a
fundamental limitation?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Don't hold your breath.  There are two ways the system could implement
>> the DISTINCT clause: either sort and uniq, or hashaggregate.
>> hashaggregate will destroy any input ordering, so there's no value in
>> using the index as input.

> Isn't that an implementation limitation though, rather than a
> fundamental limitation?

Perhaps, but it's not a simple one to surmount, and I'm dubious about
putting the amount of work that'd be required into such a corner case.
        regards, tom lane



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Pavel Stehule
Date:
2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Don't hold your breath.  There are two ways the system could implement
>>> the DISTINCT clause: either sort and uniq, or hashaggregate.
>>> hashaggregate will destroy any input ordering, so there's no value in
>>> using the index as input.
>
>> Isn't that an implementation limitation though, rather than a
>> fundamental limitation?
>
> Perhaps, but it's not a simple one to surmount, and I'm dubious about
> putting the amount of work that'd be required into such a corner case.

I don't think so this use case is too special - but workaround working well

Regards

Pavel
>
>                         regards, tom lane



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Greg Stark
Date:
On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Don't hold your breath.  There are two ways the system could implement
> the DISTINCT clause: either sort and uniq, or hashaggregate.
> hashaggregate will destroy any input ordering, so there's no value in
> using the index as input.  sort and uniq requires the input to be sorted
> by *all* the columns being distinct'ed, not just one, so again this
> index isn't useful.

We already have some bits that understand functional dependencies for
distinct/group by don't we? Do we detect that col <-> 'foo' is
functionally dependent on col? If so is it possible to construct a
multicolumn index that can produce an ordering like [col <-> 'foo',
col] which could be used to get distinct values of col in the knn
order (since the first column is functionally dependent on the second
it can be ignored for grouping purposes).

Not that we can do this now but I wonder whether a lot of the pieces
are already there and just need to be hooked up together.


-- 
greg



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Robert Haas
Date:
On Mon, Oct 22, 2012 at 7:09 PM, Greg Stark <stark@mit.edu> wrote:
> On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Don't hold your breath.  There are two ways the system could implement
>> the DISTINCT clause: either sort and uniq, or hashaggregate.
>> hashaggregate will destroy any input ordering, so there's no value in
>> using the index as input.  sort and uniq requires the input to be sorted
>> by *all* the columns being distinct'ed, not just one, so again this
>> index isn't useful.
>
> We already have some bits that understand functional dependencies for
> distinct/group by don't we? Do we detect that col <-> 'foo' is
> functionally dependent on col? If so is it possible to construct a
> multicolumn index that can produce an ordering like [col <-> 'foo',
> col] which could be used to get distinct values of col in the knn
> order (since the first column is functionally dependent on the second
> it can be ignored for grouping purposes).
>
> Not that we can do this now but I wonder whether a lot of the pieces
> are already there and just need to be hooked up together.

I think the real issue is that a hash aggregate doesn't emit any rows
until the entire input is read, so it doesn't play nice with LIMIT.
In general there's no other possible strategy, because you could get
another row belonging to an existing group at any time right up
through the end of the input.  However, when the HashAgg node is only
implementing DISTINCT (ON), you can emit each new row as soon as you
see it, and just make the hash table entry to be certain you don't
emit it again.  I think someone recently submitted a patch along these
lines and we rejected it because the use case was too thin ... but
this example is making me think that the use case might not be all
that thin after all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Ants Aasma
Date:
On Tue, Oct 23, 2012 at 7:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> through the end of the input.  However, when the HashAgg node is only
> implementing DISTINCT (ON), you can emit each new row as soon as you
> see it, and just make the hash table entry to be certain you don't
> emit it again.  I think someone recently submitted a patch along these
> lines and we rejected it because the use case was too thin ... but
> this example is making me think that the use case might not be all
> that thin after all.

If anyone wants to play around, the initial patch is available here:


http://archives.postgresql.org/message-id/CA%2BCSw_uE-RCyQd_bXJNe%3DusrXkq%2BkeFrQrahkc%2B8ou%2BWs4Y%3DVw%40mail.gmail.com

It looks to me like it should work for the distinct on KNN search case
out of the box. However it needs some planner adjustment for more
complex plans and maybe some refactoring to lose duplication in the
executor to be worth considering committing.

Regards,
Ants Aasma



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
"Kevin Grittner"
Date:
Pavel Stehule wrote:
> 2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:

>> Perhaps it would be close enough to what you want to use DISTINCT ON:
>>
>> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo'
limit10;
 

> good tip - it's working

If two or more values happen to be at exactly the same distance,
wouldn't you just get one of them?

-Kevin



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Tom Lane
Date:
"Kevin Grittner" <kgrittn@mail.com> writes:
> Pavel Stehule wrote:
>> 2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
>>> Perhaps it would be close enough to what you want to use DISTINCT ON:
>>> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo'
limit10;
 

>> good tip - it's working

> If two or more values happen to be at exactly the same distance,
> wouldn't you just get one of them?

Yeah, that is a hazard.  I'm not sure whether <->'s results are
sufficiently quantized to make that a big problem in practice.
        regards, tom lane



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
"Kevin Grittner"
Date:
Tom Lane wrote:
> "Kevin Grittner" <kgrittn@mail.com> writes:
> > Pavel Stehule wrote:
> >> 2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
> >>> Perhaps it would be close enough to what you want to use DISTINCT ON:
> >>> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo'
limit10;
 
> 
> >> good tip - it's working
> 
> > If two or more values happen to be at exactly the same distance,
> > wouldn't you just get one of them?
> 
> Yeah, that is a hazard. I'm not sure whether <->'s results are
> sufficiently quantized to make that a big problem in practice.

It doesn't seem too far-fetched for trigram queries:

test=# select nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm);   nm    | ?column? 
----------+----------anderson |      0.4andersen |      0.4andersly |      0.4
(3 rows)

test=# select distinct on (nm <-> 'anders') nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly'))
x(nm)order by nm <-> 'anders' limit 3;   nm    | ?column? 
 
----------+----------anderson |      0.4
(1 row)

-Kevin



Re: ToDo: KNN Search should to support DISTINCT clasuse?

From
Pavel Stehule
Date:
2012/10/25 Kevin Grittner <kgrittn@mail.com>:
> Tom Lane wrote:
>> "Kevin Grittner" <kgrittn@mail.com> writes:
>> > Pavel Stehule wrote:
>> >> 2012/10/22 Tom Lane <tgl@sss.pgh.pa.us>:
>> >>> Perhaps it would be close enough to what you want to use DISTINCT ON:
>> >>> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo'
limit10;
 
>>
>> >> good tip - it's working
>>
>> > If two or more values happen to be at exactly the same distance,
>> > wouldn't you just get one of them?
>>
>> Yeah, that is a hazard. I'm not sure whether <->'s results are
>> sufficiently quantized to make that a big problem in practice.
>
> It doesn't seem too far-fetched for trigram queries:
>
> test=# select nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly')) x(nm);
>     nm    | ?column?
> ----------+----------
>  anderson |      0.4
>  andersen |      0.4
>  andersly |      0.4
> (3 rows)
>
> test=# select distinct on (nm <-> 'anders') nm, nm <-> 'anders' from (values ('anderson'),('andersen'),('andersly'))
x(nm)order by nm <-> 'anders' limit 3;
 
>     nm    | ?column?
> ----------+----------
>  anderson |      0.4
> (1 row)

yes it is issue - but I am thinking about simple "fuzzy" searching, so
exact result is not strongly expected. On second hand if SELECT
DISTINCT * will be supported it should be nice.

Pavel

>
> -Kevin