Thread: Question about sorting internals
Hi, before I'll go any further - this is only thought-experiment. I do not plan to use such queries in real-life applications. I was just presented with a question that I can't answer in any logical way. There are two simple queries: #v+ with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2) ,(4,4),(5,NULL),(6,6)) ,rok2004 (miesiac,wynik) as (VALUES (1,3) ,(3,3),(4,5) ,(6,6)) SELECT distinct on (miesiac) * FROM ( SELECT miesiac, 2005 as rok, wynik FROM rok2005 union all SELECT miesiac, 2004 as rok, wynik FROM rok2004 ) as polaczone ORDER BY miesiac, wynik desc; #v- #v+ with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2) ,(4,4),(5,NULL),(6,6)) ,rok2004 (miesiac,wynik) as (VALUES (1,3) ,(3,3),(4,5) ,(6,6)) SELECT distinct on (miesiac) * FROM ( SELECT miesiac, 2004 as rok, wynik FROM rok2004 union all SELECT miesiac, 2005 as rok, wynik FROM rok2005 ) as polaczone ORDER BY miesiac, wynik desc; #v- They differ only in order of queries in union all part. The thing is that they return the same result. Why isn't one of them returning "2005" for 6th "miesiac"? I know I'm not sorting using "rok", which means I'm getting "undefined functionality". Fine. But what exactly is happening that regardless of order of rows in subquery, I get the same, always lower, rok in output? Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
Hi deepesz,
You might want to see their EXPLAIN VERBOSE outputs. Having one of them (2004 one) lesser number of rows, might be getting picked up as first relation being union and thus ends up having it's rows before the second one. Explain output would make it more clear. Also, try having same number of rows in both the relations.On Wed, Dec 11, 2013 at 3:26 PM, hubert depesz lubaczewski <depesz@depesz.com> wrote:
Hi,
before I'll go any further - this is only thought-experiment. I do not
plan to use such queries in real-life applications. I was just presented
with a question that I can't answer in any logical way.
There are two simple queries:
#v+
with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2) ,(4,4),(5,NULL),(6,6))
,rok2004 (miesiac,wynik) as (VALUES (1,3) ,(3,3),(4,5) ,(6,6))
SELECT
distinct on (miesiac) *
FROM (
SELECT miesiac, 2005 as rok, wynik FROM rok2005
union all
SELECT miesiac, 2004 as rok, wynik FROM rok2004
) as polaczone
ORDER BY miesiac, wynik desc;
#v-
#v+
with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2) ,(4,4),(5,NULL),(6,6))
,rok2004 (miesiac,wynik) as (VALUES (1,3) ,(3,3),(4,5) ,(6,6))
SELECT
distinct on (miesiac) *
FROM (
SELECT miesiac, 2004 as rok, wynik FROM rok2004
union all
SELECT miesiac, 2005 as rok, wynik FROM rok2005
) as polaczone
ORDER BY miesiac, wynik desc;
#v-
They differ only in order of queries in union all part.
The thing is that they return the same result. Why isn't one of them returning
"2005" for 6th "miesiac"?
I know I'm not sorting using "rok", which means I'm getting "undefined
functionality". Fine. But what exactly is happening that regardless of
order of rows in subquery, I get the same, always lower, rok in output?
Best regards,
depesz
--
The best thing about modern society is how easy it is to avoid contact with it.
http://depesz.com/
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On Wed, Dec 11, 2013 at 03:34:38PM +0530, Ashutosh Bapat wrote: > Hi deepesz, > You might want to see their EXPLAIN VERBOSE outputs. Having one of them > (2004 one) lesser number of rows, might be getting picked up as first > relation being union and thus ends up having it's rows before the second > one. Explain output would make it more clear. Also, try having same number > of rows in both the relations. Explains: QUERY PLAN --------------------------------------------------------------------------------------------------------------------Unique (cost=0.44..0.48 rows=9 width=12) (actual time=0.030..0.035 rows=6 loops=1) Output: rok2004.miesiac, (2004), rok2004.wynik CTE rok2004 -> Values Scan on "*VALUES*" (cost=0.00..0.06 rows=5 width=8) (actual time=0.001..0.003 rows=5loops=1) Output: "*VALUES*".column1, "*VALUES*".column2 CTE rok2005 -> Values Scan on "*VALUES*_1" (cost=0.00..0.05rows=4 width=8) (actual time=0.000..0.001 rows=4 loops=1) Output: "*VALUES*_1".column1, "*VALUES*_1".column2 -> Sort (cost=0.32..0.35 rows=9 width=12) (actual time=0.029..0.031 rows=9 loops=1) Output:rok2004.miesiac, (2004), rok2004.wynik Sort Key: rok2004.miesiac, rok2004.wynik Sort Method: quicksort Memory: 25kB -> Append (cost=0.00..0.18 rows=9 width=12) (actual time=0.007..0.018 rows=9 loops=1) -> CTE Scan on rok2004 (cost=0.00..0.10 rows=5 width=12) (actual time=0.006..0.011 rows=5 loops=1) Output: rok2004.miesiac, 2004, rok2004.wynik -> CTE Scan on rok2005 (cost=0.00..0.08 rows=4 width=12)(actual time=0.002..0.004 rows=4 loops=1) Output: rok2005.miesiac, 2005, rok2005.wynikTotal runtime:0.077 ms (18 rows) QUERY PLAN --------------------------------------------------------------------------------------------------------------------Unique (cost=0.44..0.48 rows=9 width=12) (actual time=0.024..0.027 rows=6 loops=1) Output: rok2005.miesiac, (2005), rok2005.wynik CTE rok2004 -> Values Scan on "*VALUES*" (cost=0.00..0.06 rows=5 width=8) (actual time=0.001..0.003 rows=5loops=1) Output: "*VALUES*".column1, "*VALUES*".column2 CTE rok2005 -> Values Scan on "*VALUES*_1" (cost=0.00..0.05rows=4 width=8) (actual time=0.001..0.003 rows=4 loops=1) Output: "*VALUES*_1".column1, "*VALUES*_1".column2 -> Sort (cost=0.32..0.35 rows=9 width=12) (actual time=0.023..0.024 rows=9 loops=1) Output:rok2005.miesiac, (2005), rok2005.wynik Sort Key: rok2005.miesiac, rok2005.wynik Sort Method: quicksort Memory: 25kB -> Append (cost=0.00..0.18 rows=9 width=12) (actual time=0.004..0.015 rows=9 loops=1) -> CTE Scan on rok2005 (cost=0.00..0.08 rows=4 width=12) (actual time=0.003..0.006 rows=4 loops=1) Output: rok2005.miesiac, 2005, rok2005.wynik -> CTE Scan on rok2004 (cost=0.00..0.10 rows=5 width=12)(actual time=0.001..0.006 rows=5 loops=1) Output: rok2004.miesiac, 2004, rok2004.wynikTotal runtime:0.053 ms (18 rows) So, it looks like rowcount is the one thing that's different. Not entirely sure how the logic would be to make rowcount differ. After some more talk on #postgresql, it looks like I will have to spend some time with debugger to see what's happening there. Best regards, depesz
On Wed, Dec 11, 2013 at 4:56 AM, hubert depesz lubaczewski <depesz@depesz.com> wrote: > before I'll go any further - this is only thought-experiment. I do not > plan to use such queries in real-life applications. I was just presented > with a question that I can't answer in any logical way. > > There are two simple queries: > > #v+ > with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2) ,(4,4),(5,NULL),(6,6)) > ,rok2004 (miesiac,wynik) as (VALUES (1,3) ,(3,3),(4,5) ,(6,6)) > SELECT > distinct on (miesiac) * > FROM ( > SELECT miesiac, 2005 as rok, wynik FROM rok2005 > union all > SELECT miesiac, 2004 as rok, wynik FROM rok2004 > ) as polaczone > ORDER BY miesiac, wynik desc; > #v- > > #v+ > with rok2005 (miesiac,wynik) as (VALUES (1,1),(2,2) ,(4,4),(5,NULL),(6,6)) > ,rok2004 (miesiac,wynik) as (VALUES (1,3) ,(3,3),(4,5) ,(6,6)) > SELECT > distinct on (miesiac) * > FROM ( > SELECT miesiac, 2004 as rok, wynik FROM rok2004 > union all > SELECT miesiac, 2005 as rok, wynik FROM rok2005 > ) as polaczone > ORDER BY miesiac, wynik desc; > #v- > > They differ only in order of queries in union all part. > > The thing is that they return the same result. Why isn't one of them returning > "2005" for 6th "miesiac"? The query planner sees that in order for the output ordering to match the ORDER BY clause, it's got to sort by miesiac, wynik desc. The DISTINCT ON clause can be implemented very cheaply after that - every time the value of miesiac changes, it emits only the first of the rows with the new value. So it's a good plan. However, because the sort happens before the unique step, the results you get are dependent on what order the sort happens to emit the rows. Our sort algorithms are not stable, so there's no particular guarantee about the order in which rows will pop out, beyond the fact that they must obey the sort key. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
hubert depesz lubaczewski <depesz@depesz.com> writes: > There are two simple queries: ... > They differ only in order of queries in union all part. > The thing is that they return the same result. Why isn't one of them returning > "2005" for 6th "miesiac"? With such a small amount of data, you're getting an in-memory quicksort, and a well-known property of quicksort is that it isn't stable --- that is, there are no guarantees about the order in which it will return items that have equal keys. In this case it's evidently making different partitioning choices, as a consequence of the different arrival order of the rows, that just by chance end up with the 6/2004/6 row being returned before the 6/2005/6 row in both cases. You could trace through the logic and see exactly how that's happening, but I doubt it'd be a very edifying exercise. If you want to get well-defined results with DISTINCT ON, you should make the ORDER BY sort by a candidate key. Anything less opens you to uncertainty about which rows the DISTINCT will select. regards, tom lane