Thread: Question about sorting internals

Question about sorting internals

From
hubert depesz lubaczewski
Date:
Hi,

before I'll go any further - this is only thought-experiment. I do not
plan to use such queries in real-life applications. I was just presented
with a question that I can't answer in any logical way.

There are two simple queries:

#v+
with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)      ,(4,4),(5,NULL),(6,6))   ,rok2004 (miesiac,wynik) as (VALUES
(1,3)     ,(3,3),(4,5)         ,(6,6)) 
SELECT
distinct on (miesiac) *
FROM (   SELECT miesiac, 2005 as rok, wynik FROM rok2005   union all   SELECT miesiac, 2004 as rok, wynik FROM rok2004
) as polaczone
ORDER BY miesiac, wynik desc;
#v-

#v+
with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)      ,(4,4),(5,NULL),(6,6))   ,rok2004 (miesiac,wynik) as (VALUES
(1,3)     ,(3,3),(4,5)         ,(6,6)) 
SELECT
distinct on (miesiac) *
FROM (   SELECT miesiac, 2004 as rok, wynik FROM rok2004   union all   SELECT miesiac, 2005 as rok, wynik FROM rok2005
) as polaczone
ORDER BY miesiac, wynik desc;
#v-

They differ only in order of queries in union all part.

The thing is that they return the same result. Why isn't one of them returning
"2005" for 6th "miesiac"?

I know I'm not sorting using "rok", which means I'm getting "undefined
functionality". Fine. But what exactly is happening that regardless of
order of rows in subquery, I get the same, always lower, rok in output?

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                  http://depesz.com/ 

Re: Question about sorting internals

From
Ashutosh Bapat
Date:
Hi deepesz,
You might want to see their EXPLAIN VERBOSE outputs. Having one of them (2004 one) lesser number of rows, might be getting picked up as first relation being union and thus ends up having it's rows before the second one. Explain output would make it more clear. Also, try having same number of rows in both the relations.


On Wed, Dec 11, 2013 at 3:26 PM, hubert depesz lubaczewski <depesz@depesz.com> wrote:
Hi,

before I'll go any further - this is only thought-experiment. I do not
plan to use such queries in real-life applications. I was just presented
with a question that I can't answer in any logical way.

There are two simple queries:

#v+
with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)      ,(4,4),(5,NULL),(6,6))
    ,rok2004 (miesiac,wynik) as (VALUES (1,3)      ,(3,3),(4,5)         ,(6,6))
SELECT
distinct on (miesiac) *
FROM (
    SELECT miesiac, 2005 as rok, wynik FROM rok2005
    union all
    SELECT miesiac, 2004 as rok, wynik FROM rok2004
) as polaczone
ORDER BY miesiac, wynik desc;
#v-

#v+
with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)      ,(4,4),(5,NULL),(6,6))
    ,rok2004 (miesiac,wynik) as (VALUES (1,3)      ,(3,3),(4,5)         ,(6,6))
SELECT
distinct on (miesiac) *
FROM (
    SELECT miesiac, 2004 as rok, wynik FROM rok2004
    union all
    SELECT miesiac, 2005 as rok, wynik FROM rok2005
) as polaczone
ORDER BY miesiac, wynik desc;
#v-

They differ only in order of queries in union all part.

The thing is that they return the same result. Why isn't one of them returning
"2005" for 6th "miesiac"?

I know I'm not sorting using "rok", which means I'm getting "undefined
functionality". Fine. But what exactly is happening that regardless of
order of rows in subquery, I get the same, always lower, rok in output?

Best regards,

depesz

--
The best thing about modern society is how easy it is to avoid contact with it.
                                                             http://depesz.com/



--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Question about sorting internals

From
hubert depesz lubaczewski
Date:
On Wed, Dec 11, 2013 at 03:34:38PM +0530, Ashutosh Bapat wrote:
> Hi deepesz,
> You might want to see their EXPLAIN VERBOSE outputs. Having one of them
> (2004 one) lesser number of rows, might be getting picked up as first
> relation being union and thus ends up having it's rows before the second
> one. Explain output would make it more clear. Also, try having same number
> of rows in both the relations.

Explains:
                                                    QUERY PLAN

--------------------------------------------------------------------------------------------------------------------Unique
(cost=0.44..0.48 rows=9 width=12) (actual time=0.030..0.035 rows=6 loops=1)  Output: rok2004.miesiac, (2004),
rok2004.wynik CTE rok2004    ->  Values Scan on "*VALUES*"  (cost=0.00..0.06 rows=5 width=8) (actual time=0.001..0.003
rows=5loops=1)          Output: "*VALUES*".column1, "*VALUES*".column2  CTE rok2005    ->  Values Scan on "*VALUES*_1"
(cost=0.00..0.05rows=4 width=8) (actual time=0.000..0.001 rows=4 loops=1)          Output: "*VALUES*_1".column1,
"*VALUES*_1".column2 ->  Sort  (cost=0.32..0.35 rows=9 width=12) (actual time=0.029..0.031 rows=9 loops=1)
Output:rok2004.miesiac, (2004), rok2004.wynik        Sort Key: rok2004.miesiac, rok2004.wynik        Sort Method:
quicksort Memory: 25kB        ->  Append  (cost=0.00..0.18 rows=9 width=12) (actual time=0.007..0.018 rows=9 loops=1)
          ->  CTE Scan on rok2004  (cost=0.00..0.10 rows=5 width=12) (actual time=0.006..0.011 rows=5 loops=1)
         Output: rok2004.miesiac, 2004, rok2004.wynik              ->  CTE Scan on rok2005  (cost=0.00..0.08 rows=4
width=12)(actual time=0.002..0.004 rows=4 loops=1)                    Output: rok2005.miesiac, 2005, rok2005.wynikTotal
runtime:0.077 ms
 
(18 rows)
                                                    QUERY PLAN

--------------------------------------------------------------------------------------------------------------------Unique
(cost=0.44..0.48 rows=9 width=12) (actual time=0.024..0.027 rows=6 loops=1)  Output: rok2005.miesiac, (2005),
rok2005.wynik CTE rok2004    ->  Values Scan on "*VALUES*"  (cost=0.00..0.06 rows=5 width=8) (actual time=0.001..0.003
rows=5loops=1)          Output: "*VALUES*".column1, "*VALUES*".column2  CTE rok2005    ->  Values Scan on "*VALUES*_1"
(cost=0.00..0.05rows=4 width=8) (actual time=0.001..0.003 rows=4 loops=1)          Output: "*VALUES*_1".column1,
"*VALUES*_1".column2 ->  Sort  (cost=0.32..0.35 rows=9 width=12) (actual time=0.023..0.024 rows=9 loops=1)
Output:rok2005.miesiac, (2005), rok2005.wynik        Sort Key: rok2005.miesiac, rok2005.wynik        Sort Method:
quicksort Memory: 25kB        ->  Append  (cost=0.00..0.18 rows=9 width=12) (actual time=0.004..0.015 rows=9 loops=1)
          ->  CTE Scan on rok2005  (cost=0.00..0.08 rows=4 width=12) (actual time=0.003..0.006 rows=4 loops=1)
         Output: rok2005.miesiac, 2005, rok2005.wynik              ->  CTE Scan on rok2004  (cost=0.00..0.10 rows=5
width=12)(actual time=0.001..0.006 rows=5 loops=1)                    Output: rok2004.miesiac, 2004, rok2004.wynikTotal
runtime:0.053 ms
 
(18 rows)

So, it looks like rowcount is the one thing that's different. Not
entirely sure how the logic would be to make rowcount differ.

After some more talk on #postgresql, it looks like I will have to spend
some time with debugger to see what's happening there.

Best regards,

depesz


Re: Question about sorting internals

From
Robert Haas
Date:
On Wed, Dec 11, 2013 at 4:56 AM, hubert depesz lubaczewski
<depesz@depesz.com> wrote:
> before I'll go any further - this is only thought-experiment. I do not
> plan to use such queries in real-life applications. I was just presented
> with a question that I can't answer in any logical way.
>
> There are two simple queries:
>
> #v+
> with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)      ,(4,4),(5,NULL),(6,6))
>     ,rok2004 (miesiac,wynik) as (VALUES (1,3)      ,(3,3),(4,5)         ,(6,6))
> SELECT
> distinct on (miesiac) *
> FROM (
>     SELECT miesiac, 2005 as rok, wynik FROM rok2005
>     union all
>     SELECT miesiac, 2004 as rok, wynik FROM rok2004
> ) as polaczone
> ORDER BY miesiac, wynik desc;
> #v-
>
> #v+
> with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2)      ,(4,4),(5,NULL),(6,6))
>     ,rok2004 (miesiac,wynik) as (VALUES (1,3)      ,(3,3),(4,5)         ,(6,6))
> SELECT
> distinct on (miesiac) *
> FROM (
>     SELECT miesiac, 2004 as rok, wynik FROM rok2004
>     union all
>     SELECT miesiac, 2005 as rok, wynik FROM rok2005
> ) as polaczone
> ORDER BY miesiac, wynik desc;
> #v-
>
> They differ only in order of queries in union all part.
>
> The thing is that they return the same result. Why isn't one of them returning
> "2005" for 6th "miesiac"?

The query planner sees that in order for the output ordering to match
the ORDER BY clause, it's got to sort by miesiac, wynik desc.  The
DISTINCT ON clause can be implemented very cheaply after that - every
time the value of miesiac changes, it emits only the first of the rows
with the new value.  So it's a good plan.  However, because the sort
happens before the unique step, the results you get are dependent on
what order the sort happens to emit the rows.  Our sort algorithms are
not stable, so there's no particular guarantee about the order in
which rows will pop out, beyond the fact that they must obey the sort
key.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Question about sorting internals

From
Tom Lane
Date:
hubert depesz lubaczewski <depesz@depesz.com> writes:
> There are two simple queries: ...
> They differ only in order of queries in union all part.
> The thing is that they return the same result. Why isn't one of them returning
> "2005" for 6th "miesiac"?

With such a small amount of data, you're getting an in-memory quicksort,
and a well-known property of quicksort is that it isn't stable --- that
is, there are no guarantees about the order in which it will return items
that have equal keys.  In this case it's evidently making different
partitioning choices, as a consequence of the different arrival order of
the rows, that just by chance end up with the 6/2004/6 row being returned
before the 6/2005/6 row in both cases.  You could trace through the logic
and see exactly how that's happening, but I doubt it'd be a very edifying
exercise.

If you want to get well-defined results with DISTINCT ON, you should
make the ORDER BY sort by a candidate key.  Anything less opens you to
uncertainty about which rows the DISTINCT will select.
        regards, tom lane