Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT - Mailing list pgsql-hackers

From Jon Nelson
Subject Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
Date
Msg-id CAKuK5J07k3rEWq6QT0_i7pTT3OSBK9ReQwQfi5LXNp8dmeokEQ@mail.gmail.com
Whole thread Raw
In response to Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
List pgsql-hackers
On Wed, Jan 22, 2014 at 3:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeremy Harris <jgh@wizmail.org> writes:
>> On 22/01/14 03:53, Tom Lane wrote:
>>> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>>>> - in createplan.c, eliding duplicate tuples is enabled if we are
>>>> creating a unique plan which involves sorting first
>
>>> [ raised eyebrow ... ]  And what happens if the planner drops the
>>> unique step and then the sort doesn't actually go to disk?
>
>> I don't think Jon was suggesting that the planner drop the unique step.
>
> Hm, OK, maybe I misread what he said there.  Still, if we've told
> tuplesort to remove duplicates, why shouldn't we expect it to have
> done the job?  Passing the data through a useless Unique step is
> not especially cheap.

That's correct - I do not propose to drop the unique step. Duplicates
are only dropped if it's convenient to do so. In one case, it's a
zero-cost drop (no extra comparison is made). In most other cases, an
extra comparison is made, typically right before writing a tuple to
tape. If it compares as identical to the previously-written tuple,
it's thrown out instead of being written.

The output of the modified code is still sorted, still *might* (and in
most cases, probably will) contain duplicates, but will (probably)
contain fewer duplicates.

-- 
Jon



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: Storing pg_stat_statements query texts externally, pg_stat_statements in core
Next
From: Bruce Momjian
Date:
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance