Thread: a JOIN on same table, but 'slided over'

a JOIN on same table, but 'slided over'

From
Rafal Pietrak
Date:
Hi,

I understand, that this is 'general SQL' question rather then 'general
postgres'. But may be someone here could help me with it anyways.

I have a *single* table:

CREATE TABLE test (id int not null unique, thread int not null, info
text);

The ID, although unique, is not continues. A sample query:
----------------------------------------
SELECT * from test;
 id | thread | info
----+--------+------
  2 |    763 | A
  3 |    764 | B
  6 |      5 | C
  8 |  88946 | Cats
  9 |  69315 | Eifel
 10 |  96379 | G
 14 |  23927 | test 1
 16 |  16529 | test 2
 17 |    634 | test 3
 20 |  63930 | batman
(10 rows)
-----------------------------------------

Now, I'd like to make a JOIN-ed query of that table with itself, so that
I'd get rows paiwise: every row containing data from *two* rows of the
original TEST table so, that those data come from rows of consequtive
ID's - not neceserly (depending on the TEST table contents) continuesly
consequtive. Like:

SELECT * from view_of_test;
 id | id+X | thread | thread+X | info  | info+X
----+------+--------+----------+-------+---------
  2 |    3 |    763 |      764 | A     | B
  3 |    6 |    764 |        5 | B     | C
  6 |    8 |      5 |    88946 | C     | Cats
  8 |    9 |  88946 |    69315 | Cats  | Eifel
  9 |   10 |  69315 |    96379 | Eifel | G
-------------------------------------------------
Is there an SQL construct to get it?

I'd apreciate any hints or sugestions.

-R

Re: a JOIN on same table, but 'slided over'

From
"hubert depesz lubaczewski"
Date:
On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
Is there an SQL construct to get it?

select
    distinct on (t1.id)
    t1.*, t2.*
from
    test t1
    join test t2 on t2.id > t1.id
order by t1.id asc, t2.id asc

should do the trick.

depesz

--
http://www.depesz.com/ - nowy, lepszy depesz

Re: a JOIN on same table, but 'slided over'

From
PFC
Date:
> Now, I'd like to make a JOIN-ed query of that table with itself, so that
> I'd get rows paiwise: every row containing data from *two* rows of the
> original TEST table so, that those data come from rows of consequtive
> ID's - not neceserly (depending on the TEST table contents) continuesly
> consequtive. Like:
>
> SELECT * from view_of_test;
>  id | id+X | thread | thread+X | info  | info+X
> ----+------+--------+----------+-------+---------
>   2 |    3 |    763 |      764 | A     | B
>   3 |    6 |    764 |        5 | B     | C
>   6 |    8 |      5 |    88946 | C     | Cats
>   8 |    9 |  88946 |    69315 | Cats  | Eifel
>   9 |   10 |  69315 |    96379 | Eifel | G
> -------------------------------------------------
> Is there an SQL construct to get it?

    I would use a plpgsql procedure, select all the rows ORDER BY id, keep
the current and last row in a variable, and that's it.

Re: a JOIN on same table, but 'slided over'

From
Rafal Pietrak
Date:
Marvelous! Thenx!

-R

On Tue, 2007-06-26 at 10:06 +0200, hubert depesz lubaczewski wrote:
> On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
>         Is there an SQL construct to get it?
>
> select
>     distinct on (t1.id)
>     t1.*, t2.*
> from
>     test t1
>     join test t2 on t2.id > t1.id
> order by t1.id asc, t2.id asc
>
> should do the trick.
>
> depesz
>
> --
> http://www.depesz.com/ - nowy, lepszy depesz

Re: a JOIN on same table, but 'slided over'

From
"Gurjeet Singh"
Date:
Hi Rafal,

    Just a note that this is not standard SQL... 'distinct on' is an extension to SQL provided by postgres.

Following query utilizes the standard SQL to get the same results:

select    t1.id as id, t2.id as "id+1",
        t1.thread as thread, t2.thread as "thread+1",
        t1.info as info, t2.info as "info+1"
from test as t1, test as t2
where t2.id = ( select min(id) from test as t3 where t3.id > t1.id);

HTH
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

17°29'34.37"N  78°30'59.76"E - Hyderabad *
18°32'57.25"N  73°56'25.42 "E - Pune

Sent from my BlackLaptop device

On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
Marvelous! Thenx!

-R

On Tue, 2007-06-26 at 10:06 +0200, hubert depesz lubaczewski wrote:
> On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
>         Is there an SQL construct to get it?
>
> select
>     distinct on (t1.id)
>     t1.*, t2.*
> from
>     test t1
>     join test t2 on t2.id > t1.id
> order by t1.id asc, t2.id asc
>
> should do the trick.
>
> depesz
>
> --
> http://www.depesz.com/ - nowy, lepszy depesz

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org/

Re: a JOIN on same table, but 'slided over'

From
Rafal Pietrak
Date:
OK. Have tried this one.... looks like close to 6 times slower then the
'non-standard' phrase with 'distinct on'.

On the small dataset that I've included in my original post (ten rows of
data within TEST), I've run both queries through EXPLAIN ANALYSE, with
the following result summary (for clearity, I've cut away the details
from EXPLAIN output):

-----------STANDARD
 Total runtime: 10.660 ms
-----------DISTINCT-ON
 Total runtime: 1.479 ms
-----------

Would there be ways to optimise the standard query to get the
performance closer to the none-standard one?


-R


On Tue, 2007-06-26 at 18:05 +0530, Gurjeet Singh wrote:
> Hi Rafal,
>
>     Just a note that this is not standard SQL... 'distinct on' is an
> extension to SQL provided by postgres.
>
> Following query utilizes the standard SQL to get the same results:
>
> select    t1.id as id, t2.id as "id+1",
>         t1.thread as thread, t2.thread as "thread+1",
>         t1.info as info, t2.info as "info+1"
> from test as t1, test as t2
> where t2.id = ( select min(id) from test as t3 where t3.id > t1.id);
>
> HTH
> --
> gurjeet[.singh]@EnterpriseDB.com
> singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
>
> 17°29'34.37"N  78°30'59.76"E - Hyderabad *
> 18°32'57.25"N  73°56'25.42 "E - Pune
>
> Sent from my BlackLaptop device
>
> On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
>         Marvelous! Thenx!
>
>         -R
>
>         On Tue, 2007-06-26 at 10:06 +0200, hubert depesz lubaczewski
>         wrote:
>         > On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
>         >         Is there an SQL construct to get it?
>         >
>         > select
>         >     distinct on (t1.id)
>         >     t1.*, t2.*
>         > from
>         >     test t1
>         >     join test t2 on t2.id > t1.id
>         > order by t1.id asc, t2.id asc
>         >
>         > should do the trick.
>         >
>         > depesz
>         >
>         > --
>         > http://www.depesz.com/ - nowy, lepszy depesz
>
>         ---------------------------(end of
>         broadcast)---------------------------
>         TIP 4: Have you searched our list archives?
>
>                        http://archives.postgresql.org/
>

Re: a JOIN on same table, but 'slided over'

From
"Gurjeet Singh"
Date:
I missed the ORDER BY clause... Here it goes:

select    t1.id as id, t2.id as "id+1",
        t1.thread as thread, t2.thread as "thread+1",
        t1.info as info, t2.info as "info+1"
from test as t1, test as t2
where t2.id = ( select min(id) from test as t3 where t3.id > t1.id )
order by t1.id asc;

Also note that this query is much cheaper that the 'distinct on' query by more than two orders on magnitude ( 217.86 vs. 98040.67):

postgres=# explain
postgres-# select
postgres-#     distinct on (t1.id)
postgres-#     t1.*, t2.*
postgres-# from
postgres-#     test t1
postgres-#     join test t2 on t2.id > t1.id
postgres-# order by t1.id asc, t2.id asc;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Unique  (cost=95798.00..98040.67 rows=1160 width=80)
   ->  Sort  (cost=95798.00..96919.33 rows=448533 width=80)
         Sort Key: t1.id, t2.id
         ->  Nested Loop  (cost=0.00..13827.29 rows=448533 width=80)
               ->  Seq Scan on test t1  (cost=0.00..21.60 rows=1160 width=40)
               ->  Index Scan using test_id_key on test t2  (cost=0.00..7.06 rows=387 width=40)
                     Index Cond: (t2.id > t1.id)
(7 rows)
Time: 5.003 ms
postgres=# explain
postgres-# select       t1.id as id, t2.id as "id+1",
postgres-#              t1.thread as thread, t2.thread as "thread+1",
postgres-#              t1.info as info, t2.info as "info+1"
postgres-# from test as t1, test as t2
postgres-# where t2.id = ( select min(id) from test as t3 where t3.id > t1.id )
postgres-# order by t1.id asc;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Sort  (cost=214.96..217.86 rows=1160 width=80)
   Sort Key: t1.id
   ->  Hash Join  (cost= 36.10..155.92 rows=1160 width=80)
         Hash Cond: ((subplan) = t2.id)
         ->  Seq Scan on test t1  (cost=0.00..21.60 rows=1160 width=40)
         ->  Hash  (cost=21.60..21.60 rows=1160 width=40)
               ->  Seq Scan on test t2  (cost=0.00..21.60 rows=1160 width=40)
         SubPlan
           ->  Result  (cost=0.13..0.14 rows=1 width=0)
                 InitPlan
                   ->  Limit  (cost= 0.00..0.13 rows=1 width=4)
                         ->  Index Scan using test_id_key on test t3  (cost=0.00..51.02 rows=387 width=4)
                               Index Cond: (id > $0)
                               Filter: (id IS NOT NULL)
(14 rows)
Time: 4.125 ms


Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

17°29'34.37"N  78°30'59.76"E - Hyderabad *
18°32'57.25"N  73°56'25.42"E - Pune

Sent from my BlackLaptop device

On 6/26/07, Gurjeet Singh <singh.gurjeet@gmail.com > wrote:
Hi Rafal,

    Just a note that this is not standard SQL... 'distinct on' is an extension to SQL provided by postgres.

Following query utilizes the standard SQL to get the same results:

select    t1.id as id, t2.id as "id+1",
        t1.thread as thread, t2.thread as "thread+1",
        t1.info as info, t2.info as "info+1"
from test as t1, test as t2
where t2.id = ( select min(id) from test as t3 where t3.id > t1.id);

HTH
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

17°29'34.37"N  78°30'59.76"E - Hyderabad *
18°32'57.25"N  73°56' 25.42 "E - Pune

Sent from my BlackLaptop device


On 6/26/07, Rafal Pietrak < rafal@zorro.isa-geek.com> wrote:
Marvelous! Thenx!

-R

On Tue, 2007-06-26 at 10:06 +0200, hubert depesz lubaczewski wrote:
> On 6/26/07, Rafal Pietrak < rafal@zorro.isa-geek.com> wrote:
>         Is there an SQL construct to get it?
>
> select
>     distinct on (t1.id)
>     t1.*, t2.*
> from
>     test t1
>     join test t2 on t2.id > t1.id
> order by t1.id asc, t2.id asc
>
> should do the trick.
>
> depesz
>
> --
> http://www.depesz.com/ - nowy, lepszy depesz

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org/

Re: a JOIN on same table, but 'slided over'

From
"news.gmane.org"
Date:
Gurjeet Singh skrev:
> I missed the ORDER BY clause... Here it goes:
>
> select    t1.id <http://t1.id> as id, t2.id <http://t2.id> as "id+1",
>         t1.thread as thread, t2.thread as "thread+1",
>         t1.info <http://t1.info> as info, t2.info <http://t2.info> as
> "info+1"
> from test as t1, test as t2
> where t2.id <http://t2.id> = ( select min(id) from test as t3 where
> t3.id <http://t3.id> > t1.id <http://t1.id> )
> order by t1.id <http://t1.id> asc;
>
> Also note that this query is much cheaper that the 'distinct on' query
> by more than two orders on magnitude ( 217.86 vs. 98040.67):

No it isn't. The estimate is much lower, but the actual times are very
close:

[explain of distinct on]

> Time: 5.003 ms

[explain of correlated subquery]

> Time: 4.125 ms

I tried on a larger table (16384 rows), and in this case the numbers are
strongly in favor of  the subquery. In fact, I am still waiting for the
"distinct on" version to return ...

/Nis

Re: a JOIN on same table, but 'slided over'

From
Tom Lane
Date:
"news.gmane.org" <nis@superlativ.dk> writes:
> Gurjeet Singh skrev:
>> Also note that this query is much cheaper that the 'distinct on' query
>> by more than two orders on magnitude ( 217.86 vs. 98040.67):

> No it isn't. The estimate is much lower, but the actual times are very
> close:

> [explain of distinct on]
>> Time: 5.003 ms

> [explain of correlated subquery]
>> Time: 4.125 ms

You're both confused: the planner estimate certainly should not be taken
as gospel, but the actual runtime of an EXPLAIN (not EXPLAIN ANALYZE)
only reflects planning effort.

EXPLAIN ANALYZE output would be a lot more suitable to settle the
question which one is faster.

            regards, tom lane

Re: a JOIN on same table, but 'slided over'

From
Rafal Pietrak
Date:
I see. (Have actually tried it on a larger dataset - to see it for
myself ... it is optimised :)

Thenx again!

-R


On Tue, 2007-06-26 at 19:56 +0530, Gurjeet Singh wrote:
>     It _is_ the optimised version.... as you can see from the explain
> plans posted in the other mail, the planner shows that the cost is
> drastically less than the 'distinct on' version.
>
>     For smaller data-sets 'distinct-on' version might seem faster, but
> for reasonably larger datasets, it's performance deteriorates
> exponentially... This is because of the Nested-loops involved in the
> plan...
>
>     I increased your data-set to 10240 rows by executing the following
> query 10 times:
>
> insert into test select id+(select max(id) from test), thread, info
> from test;
>
>     On such data-set (which is not very large by any means), the
> standard SQL version executes in almost a second, and on the other
> hand, I had to cancel the EXPLAIN ANALYZE of the 'distinct on' query
> after letting it run for over three minutes!!!
>
> postgres=# explain analyze
> postgres-# select       t1.id as id, t2.id as "id+1",
> postgres-#              t1.thread as thread, t2.thread as "thread+1",
> postgres-#              t1.info as info, t2.info as "info+1"
> postgres-# from test as t1, test as t2
> postgres-# where t2.id = ( select min(id) from test as t3 where t3.id
> > t1.id )
> postgres-# order by t1.id asc;
>
> QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

>  Sort  (cost=2971.36..2996.96 rows=10240 width=24) (actual
> time=1004.031..1030.116 rows=10239 loops=1)
>    Sort Key: t1.id
>    Sort Method:  external sort  Disk: 416kB
>    ->  Merge Join  (cost=840.48..2289.28 rows=10240 width=24) (actual
> time=834.218..956.595 rows=10239 loops=1)
>          Merge Cond: (t2.id = ((subplan)))
>          ->  Index Scan using test_id_key on test t2
> (cost=0.00..332.85 rows=10240 width=12) (actual time=0.060..24.503
> rows=10240 loops=1)
>          ->  Sort  (cost=840.48..866.08 rows=10240 width=12) (actual
> time=834.129..854.776 rows=10240 loops=1)
>                Sort Key: ((subplan))
>                Sort Method:  quicksort  Memory: 928kB
>                ->  Seq Scan on test t1  (cost=0.00..158.40 rows=10240
> width=12)(actual time=0.196..797.752 rows=10240 loops=1)
>                      SubPlan
>                        ->  Result  (cost= 0.04..0.05 rows=1 width=0)
> (actual time=0.062..0.064 rows=1 loops=10240)
>                              InitPlan
>                                ->  Limit  (cost=0.00..0.04 rows=1
> width=4) (actual time=0.047..0.050 rows=1 loops=10240)
>                                      ->  Index Scan using test_id_key
> on test t3  (cost=0.00..121.98 rows=3413 width=4) (actual time=
> 0.038..0.038 rows=1 loops=10240)
>                                            Index Cond: (id > $0)
>                                            Filter: (id IS NOT NULL)
>  Total runtime: 1052.802 ms
> (18 rows)
> Time: 1056.740 ms
>
> postgres=# explain analyze
> postgres-# select
> postgres-#     distinct on (t1.id)
> postgres-#     t1.*, t2.*
> postgres-# from
> postgres-#     test t1
> postgres-#     join test t2 on t2.id > t1.id
> postgres-# order by t1.id asc, t2.id asc;
> Cancel request sent
> ERROR:  canceling statement due to user request
> postgres=#
>
>
>
> On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
>         OK. Have tried this one.... looks like close to 6 times slower
>         then the
>         'non-standard' phrase with 'distinct on'.
>
>         On the small dataset that I've included in my original post
>         (ten rows of
>         data within TEST), I've run both queries through EXPLAIN
>         ANALYSE, with
>         the following result summary (for clearity, I've cut away the
>         details
>         from EXPLAIN output):
>
>         -----------STANDARD
>         Total runtime: 10.660 ms
>         -----------DISTINCT-ON
>         Total runtime: 1.479 ms
>         -----------
>
>         Would there be ways to optimise the standard query to get the
>         performance closer to the none-standard one?
>
>
>         -R
>
>
>         On Tue, 2007-06-26 at 18:05 +0530, Gurjeet Singh wrote:
>         > Hi Rafal,
>         >
>         >     Just a note that this is not standard SQL... 'distinct
>         on' is an
>         > extension to SQL provided by postgres.
>         >
>         > Following query utilizes the standard SQL to get the same
>         results:
>         >
>         > select    t1.id as id, t2.id as "id+1",
>         >         t1.thread as thread, t2.thread as "thread+1",
>         >         t1.info as info, t2.info as "info+1"
>         > from test as t1, test as t2
>         > where t2.id = ( select min(id) from test as t3 where t3.id >
>         t1.id);
>         >
>         > HTH
>         > --
>         > gurjeet[.singh]@EnterpriseDB.com
>         > singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
>         >
>         > 17°29'34.37"N  78°30' 59.76"E - Hyderabad *
>         > 18°32'57.25"N  73°56'25.42 "E - Pune
>         >
>         > Sent from my BlackLaptop device
>         >
>         > On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
>         >         Marvelous! Thenx!
>         >
>         >         -R
>         >
>         >         On Tue, 2007-06-26 at 10:06 +0200, hubert depesz
>         lubaczewski
>         >         wrote:
>         >         > On 6/26/07, Rafal Pietrak <
>         rafal@zorro.isa-geek.com> wrote:
>         >         >         Is there an SQL construct to get it?
>         >         >
>         >         > select
>         >         >     distinct on ( t1.id)
>         >         >     t1.*, t2.*
>         >         > from
>         >         >     test t1
>         >         >     join test t2 on t2.id > t1.id
>         >         > order by t1.id asc, t2.id asc
>         >         >
>         >         > should do the trick.
>         >         >
>         >         > depesz
>         >         >
>         >         > --
>         >         > http://www.depesz.com/ - nowy, lepszy depesz
>         >
>         >         ---------------------------(end of
>         >         broadcast)---------------------------
>         >         TIP 4: Have you searched our list archives?
>         >
>         >                        http://archives.postgresql.org/
>         >
>
>
>
> --
> gurjeet[.singh]@EnterpriseDB.com
> singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
>
> 17°29'34.37"N  78°30'59.76"E - Hyderabad *
> 18°32'57.25"N  73°56'25.42"E - Pune
>
> Sent from my BlackLaptop device

Re: a JOIN on same table, but 'slided over'

From
PFC
Date:
    OK, check...

test=> CREATE TABLE test (id INTEGER PRIMARY KEY);
test=> INSERT INTO test SELECT random()*5 + n*10 FROM
generate_series( 1,100000 ) AS n;
test=> SELECT * FROM test LIMIT 10;
   id
-----
   11
   23
   31
   41
   52
   63
   70
   85
   94
  103

test=> ANALYZE test;
ANALYZE

    -- Self Join 1

test=> EXPLAIN ANALYZE SELECT t1.id AS current_id, t2.id AS next_id
 FROM test t1, test t2
WHERE t2.id = ( SELECT min(id) FROM test AS t3 WHERE t3.id > t1.id )
ORDER BY t1.id;
                                                                         QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=26703.19..26953.19 rows=100000 width=8) (actual
time=5240.392..5271.529 rows=99999 loops=1)
    Sort Key: t1.id
    ->  Hash Join  (cost=2691.00..18398.37 rows=100000 width=8) (actual
time=106.588..5179.737 rows=99999 loops=1)
          Hash Cond: ((subplan) = t2.id)
          ->  Seq Scan on test t1  (cost=0.00..1441.00 rows=100000 width=4)
(actual time=0.013..34.782 rows=100000 loops=1)
          ->  Hash  (cost=1441.00..1441.00 rows=100000 width=4) (actual
time=106.420..106.420 rows=100000 loops=1)
                ->  Seq Scan on test t2  (cost=0.00..1441.00 rows=100000
width=4) (actual time=0.007..43.077 rows=100000 loops=1)
          SubPlan
            ->  Result  (cost=0.03..0.04 rows=1 width=0) (actual
time=0.023..0.023 rows=1 loops=199999)
                  InitPlan
                    ->  Limit  (cost=0.00..0.03 rows=1 width=4) (actual
time=0.021..0.022 rows=1 loops=199999)
                          ->  Index Scan using test_pkey on test t3
(cost=0.00..1029.59 rows=33333 width=4) (actual time=0.020..0.020 rows=1
loops=199999)
                                Index Cond: (id > $0)
                                Filter: (id IS NOT NULL)
  Total runtime: 5295.677 ms

    -- Self Join 2

test=> set enable_hashjoin TO 0;
test=> EXPLAIN ANALYZE SELECT t1.id AS current_id, t2.id AS next_id
 FROM test t1, test t2
WHERE t2.id = ( SELECT min(id) FROM test AS t3 WHERE t3.id > t1.id )
ORDER BY t1.id;
                                                                               QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Sort  (cost=30806.48..31056.48 rows=100000 width=8) (actual
time=2876.249..2903.011 rows=99999 loops=1)
    Sort Key: t1.id
    ->  Merge Join  (cost=9745.82..22501.66 rows=100000 width=8) (actual
time=2547.830..2820.347 rows=99999 loops=1)
          Merge Cond: (t2.id = "inner"."?column2?")
          ->  Index Scan using test_pkey on test t2  (cost=0.00..2828.26
rows=100000 width=4) (actual time=0.035..67.747 rows=100000 loops=1)
          ->  Sort  (cost=9745.82..9995.82 rows=100000 width=4) (actual
time=2547.779..2582.889 rows=100000 loops=1)
                Sort Key: (subplan)
                ->  Seq Scan on test t1  (cost=0.00..1441.00 rows=100000
width=4) (actual time=0.060..2487.728 rows=100000 loops=1)
                      SubPlan
                        ->  Result  (cost=0.03..0.04 rows=1 width=0)
(actual time=0.023..0.023 rows=1 loops=100000)
                              InitPlan
                                ->  Limit  (cost=0.00..0.03 rows=1 width=4)
(actual time=0.021..0.022 rows=1 loops=100000)
                                      ->  Index Scan using test_pkey on
test t3  (cost=0.00..1029.59 rows=33333 width=4) (actual time=0.020..0.020
rows=1 loops=100000)
                                            Index Cond: (id > $0)
                                            Filter: (id IS NOT NULL)
  Total runtime: 2923.804 ms

    -- DISTINCT ON

test=> EXPLAIN SELECT DISTINCT ON (t1.id) t1.id AS current_id, t2.id AS
next_id
 FROM test t1 JOIN test t2 ON t2.id > t1.id
ORDER BY t1.id, t2.id;
                                            QUERY PLAN
-------------------------------------------------------------------------------------------------
  Unique  (cost=729806679.75..746473346.41 rows=100000 width=8)
    ->  Sort  (cost=729806679.75..738140013.08 rows=3333333333 width=8)
          Sort Key: t1.id, t2.id
          ->  Nested Loop  (cost=0.00..100028973.00 rows=3333333333 width=8)
                ->  Seq Scan on test t1  (cost=0.00..1441.00 rows=100000
width=4)
                ->  Index Scan using test_pkey on test t2
(cost=0.00..583.61 rows=33333 width=4)
                      Index Cond: (t2.id > t1.id)
(7 lignes)

    This one takes much longer (I interrupted it).

    -- Using a function

CREATE TYPE test_type AS ( current_id INTEGER, next_id INTEGER );

CREATE OR REPLACE FUNCTION testfunc( )
     RETURNS SETOF test_type
     LANGUAGE plpgsql
     AS
$$
DECLARE
    _row        test_type;
BEGIN
    _row.current_id = NULL;

    FOR _row.next_id IN SELECT id FROM test ORDER BY id LOOP
        IF _row.current_id IS NOT NULL THEN
            RETURN NEXT _row;
        END IF;
        _row.current_id = _row.next_id;
    END LOOP;
END;
$$;

test=> EXPLAIN ANALYZE SELECT * FROM testfunc();
                                                      QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
  Function Scan on testfunc  (cost=0.00..12.50 rows=1000 width=8) (actual
time=211.702..238.322 rows=100000 loops=1)
  Total runtime: 262.369 ms

    Same results, at least 10x faster on large datasets...








Re: a JOIN on same table, but 'slided over'

From
"Gurjeet Singh"
Date:
On 6/26/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"news.gmane.org" <nis@superlativ.dk> writes:
> Gurjeet Singh skrev:
>> Also note that this query is much cheaper that the 'distinct on' query
>> by more than two orders on magnitude ( 217.86 vs. 98040.67):

> No it isn't. The estimate is much lower, but the actual times are very
> close:

> [explain of distinct on]
>> Time: 5.003 ms

> [explain of correlated subquery]
>> Time: 4.125 ms

You're both confused:

???

the planner estimate certainly should not be taken
as gospel,

true

but the actual runtime of an EXPLAIN (not EXPLAIN ANALYZE)
only reflects planning effort.

Agree completely

EXPLAIN ANALYZE output would be a lot more suitable to settle the
question which one is faster.

Agree again. I was using the EXPLAIN output just to make a point that optimizer thinks the query utilizing a subquery is much cheaper (and hence maybe faster) than the 'distinct on' query.

In a later mail I posted the analyze o/p too...

--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

17°29'34.37"N  78°30'59.76 "E - Hyderabad *
18°32'57.25"N  73°56'25.42"E - Pune

Sent from my BlackLaptop device

Re: a JOIN on same table, but 'slided over'

From
Rafal Pietrak
Date:
Gurjeet,

Focusing on the standars solution, I did some 'exercises' - works fine,
just learning.

But the ambarasing thing is, that I looks like I really don't get it,
meaning - what exactly the internal query does. I've never ever seen or
used a subquery with data/params from 'upper level' query used within a
subquery - any time I've written a hierarchical query (e.g. with
subqueries), the relations were always hierarchical. In other words, I
was always able to run an internal subquery outside of the compound
query and get consistant results. With this one I cannot do that due to
the 'entanglement' of t3 and t1.

Postgress query plan from EXPLAIN doesn't help me here - probably I'm
unable to interpret it correctly without 'a paradigm mind shift'.

So, would you mind commenting a little on how exactly the t1.id
influences subquery (with t3), and the result influences back the
selection of t1 set?

Will greatly apreciate that.

-R

On Tue, 2007-06-26 at 19:14 +0530, Gurjeet Singh wrote:
> I missed the ORDER BY clause... Here it goes:
>
> select    t1.id as id, t2.id as "id+1",
>         t1.thread as thread, t2.thread as "thread+1",
>         t1.info as info, t2.info as "info+1"
> from test as t1, test as t2
> where t2.id = ( select min(id) from test as t3 where t3.id > t1.id )
> order by t1.id asc;
>
> Also note that this query is much cheaper that the 'distinct on' query
> by more than two orders on magnitude ( 217.86 vs. 98040.67):
>
> postgres=# explain
> postgres-# select
> postgres-#     distinct on (t1.id)
> postgres-#     t1.*, t2.*
> postgres-# from
> postgres-#     test t1
> postgres-#     join test t2 on t2.id > t1.id
> postgres-# order by t1.id asc, t2.id asc;
>                                            QUERY PLAN
> ------------------------------------------------------------------------------------------------
>  Unique  (cost=95798.00..98040.67 rows=1160 width=80)
>    ->  Sort  (cost=95798.00..96919.33 rows=448533 width=80)
>          Sort Key: t1.id, t2.id
>          ->  Nested Loop  (cost=0.00..13827.29 rows=448533 width=80)
>                ->  Seq Scan on test t1  (cost=0.00..21.60 rows=1160
> width=40)
>                ->  Index Scan using test_id_key on test t2
> (cost=0.00..7.06 rows=387 width=40)
>                      Index Cond: (t2.id > t1.id)
> (7 rows)
> Time: 5.003 ms
> postgres=# explain
> postgres-# select       t1.id as id, t2.id as "id+1",
> postgres-#              t1.thread as thread, t2.thread as "thread+1",
> postgres-#              t1.info as info, t2.info as "info+1"
> postgres-# from test as t1, test as t2
> postgres-# where t2.id = ( select min(id) from test as t3 where t3.id
> > t1.id )
> postgres-# order by t1.id asc;
>                                                 QUERY PLAN
> ----------------------------------------------------------------------------------------------------------
>  Sort  (cost=214.96..217.86 rows=1160 width=80)
>    Sort Key: t1.id
>    ->  Hash Join  (cost= 36.10..155.92 rows=1160 width=80)
>          Hash Cond: ((subplan) = t2.id)
>          ->  Seq Scan on test t1  (cost=0.00..21.60 rows=1160
> width=40)
>          ->  Hash  (cost=21.60..21.60 rows=1160 width=40)
>                ->  Seq Scan on test t2  (cost=0.00..21.60 rows=1160
> width=40)
>          SubPlan
>            ->  Result  (cost=0.13..0.14 rows=1 width=0)
>                  InitPlan
>                    ->  Limit  (cost= 0.00..0.13 rows=1 width=4)
>                          ->  Index Scan using test_id_key on test t3
> (cost=0.00..51.02 rows=387 width=4)
>                                Index Cond: (id > $0)
>                                Filter: (id IS NOT NULL)
> (14 rows)
> Time: 4.125 ms
>
>
> Best regards,
> --
> gurjeet[.singh]@EnterpriseDB.com
> singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
>
> 17°29'34.37"N  78°30'59.76"E - Hyderabad *
> 18°32'57.25"N  73°56'25.42"E - Pune
>
> Sent from my BlackLaptop device
>
> On 6/26/07, Gurjeet Singh <singh.gurjeet@gmail.com > wrote:
>         Hi Rafal,
>
>             Just a note that this is not standard SQL... 'distinct on'
>         is an extension to SQL provided by postgres.
>
>         Following query utilizes the standard SQL to get the same
>         results:
>
>         select    t1.id as id, t2.id as "id+1",
>                 t1.thread as thread, t2.thread as "thread+1",
>                 t1.info as info, t2.info as "info+1"
>         from test as t1, test as t2
>         where t2.id = ( select min(id) from test as t3 where t3.id >
>         t1.id);
>
>         HTH
>         --
>         gurjeet[.singh]@EnterpriseDB.com
>         singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
>
>         17°29'34.37"N  78°30'59.76"E - Hyderabad *
>         18°32'57.25"N  73°56' 25.42 "E - Pune
>
>         Sent from my BlackLaptop device
>
>
>         On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
>                 Marvelous! Thenx!
>
>                 -R
>
>                 On Tue, 2007-06-26 at 10:06 +0200, hubert depesz
>                 lubaczewski wrote:
>                 > On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com>
>                 wrote:
>                 >         Is there an SQL construct to get it?
>                 >
>                 > select
>                 >     distinct on (t1.id)
>                 >     t1.*, t2.*
>                 > from
>                 >     test t1
>                 >     join test t2 on t2.id > t1.id
>                 > order by t1.id asc, t2.id asc
>                 >
>                 > should do the trick.
>                 >
>                 > depesz
>                 >
>                 > --
>                 > http://www.depesz.com/ - nowy, lepszy depesz
>
>                 ---------------------------(end of
>                 broadcast)---------------------------
>                 TIP 4: Have you searched our list archives?
>
>                                http://archives.postgresql.org/
>
>

Re: a JOIN on same table, but 'slided over'

From
Alban Hertroys
Date:
Rafal Pietrak wrote:
> Gurjeet,
>
> Focusing on the standars solution, I did some 'exercises' - works fine,
> just learning.
>
> But the ambarasing thing is, that I looks like I really don't get it,
> meaning - what exactly the internal query does. I've never ever seen or
> used a subquery with data/params from 'upper level' query used within a
> subquery - any time I've written a hierarchical query (e.g. with
> subqueries), the relations were always hierarchical. In other words, I
> was always able to run an internal subquery outside of the compound
> query and get consistant results. With this one I cannot do that due to
> the 'entanglement' of t3 and t1.

This is called a 'correlated subquery'. Basically the subquery is
performed for each record in the top query.

Google gave me this:
http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=/sqlp/rbafycorrs.htm

And there's probably more to find. Interestingly enough wikipedia
doesn't seem to have an article on the subject.

> Postgress query plan from EXPLAIN doesn't help me here - probably I'm
> unable to interpret it correctly without 'a paradigm mind shift'.
>
> So, would you mind commenting a little on how exactly the t1.id
> influences subquery (with t3), and the result influences back the
> selection of t1 set?
>
> Will greatly apreciate that.


--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

Re: a JOIN on same table, but 'slided over'

From
"Gurjeet Singh"
Date:
On 6/28/07, Alban Hertroys <alban@magproductions.nl> wrote:

This is called a 'correlated subquery'. Basically the subquery is
performed for each record in the top query.

Google gave me this:
http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=/sqlp/rbafycorrs.htm

I think the sub-section titled "Example: Correlated subquery in a WHERE Clause" is appropriate to explain our query at hand.

Simply put, correlated queries are like nested FOR loops of any high level programming language.

1. FOR( record R in result of outer-query )
2.   execute inner query, using any R.colname1
3.   compare R.colname2 with the result of the correlated-subquery
4.   produce R in output, iff the above comparison succeeded

Line 2 can be treated as another FOR loop, where every record of inner-query is being processed, and comparing the local expressions with a column (or expression) that comes from outer query.

The comparison in step 3 can be against any expression, with columns or against a pure constant too!

For example, the following query produces the name of all the employees, who manage at least one other employee.

select empno, ename
from   emp e1
where  exists (select 1
               from   emp e2
               where e2.mgr = e1.empno);

The only thing I would add for our query is that, that the outer SELECT of our query produces a cartesian product (no join-condition between t1 and t2), but only one row from t2 qualifies for the join, since the WHERE condition is on a unique column, and the correlated subquery returns just the required value (lowest of the IDs that are greater than current t1.ID being processed).

I know the above one-line-paragraph may sound a bit cryptic for someone new to correlated subqueries, but if you understand the example in the link above, then this would start making some sense.

And there's probably more to find. Interestingly enough wikipedia
doesn't seem to have an article on the subject.
 



Regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

17°29'34.37"N  78°30'59.76"E - Hyderabad *
18°32'57.25"N  73°56'25.42"E - Pune

Sent from my BlackLaptop device

Re: a JOIN on same table, but 'slided over'

From
Rafal Pietrak
Date:
Thank you All for this extensive help!

BTW: google helps, once you know that the construct is called
"correlated subquery" - there is no way to get an answer before one
knows the question :)

Thenx again!

-R

On Thu, 2007-06-28 at 23:23 +0530, Gurjeet Singh wrote:
> On 6/28/07, Alban Hertroys <alban@magproductions.nl> wrote:
>
>         This is called a 'correlated subquery'. Basically the subquery
>         is
>         performed for each record in the top query.
>
>         Google gave me this:
>         http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=/sqlp/rbafycorrs.htm
>
> I think the sub-section titled "Example: Correlated subquery in a
> WHERE Clause" is appropriate to explain our query at hand.
>
> Simply put, correlated queries are like nested FOR loops of any high
> level programming language.
>
> 1. FOR( record R in result of outer-query )
> 2.   execute inner query, using any R.colname1
> 3.   compare R.colname2 with the result of the correlated-subquery
> 4.   produce R in output, iff the above comparison succeeded
>
> Line 2 can be treated as another FOR loop, where every record of
> inner-query is being processed, and comparing the local expressions
> with a column (or expression) that comes from outer query.
>
> The comparison in step 3 can be against any expression, with columns
> or against a pure constant too!
>
> For example, the following query produces the name of all the
> employees, who manage at least one other employee.
>
> select empno, ename
> from   emp e1
> where  exists (select 1
>                from   emp e2
>                where e2.mgr = e1.empno);
>
> The only thing I would add for our query is that, that the outer
> SELECT of our query produces a cartesian product (no join-condition
> between t1 and t2), but only one row from t2 qualifies for the join,
> since the WHERE condition is on a unique column, and the correlated
> subquery returns just the required value (lowest of the IDs that are
> greater than current t1.ID being processed).
>
> I know the above one-line-paragraph may sound a bit cryptic for
> someone new to correlated subqueries, but if you understand the
> example in the link above, then this would start making some sense.
>
>
>         And there's probably more to find. Interestingly enough
>         wikipedia
>         doesn't seem to have an article on the subject.
>
>
>
>
>
> Regards,
> --
> gurjeet[.singh]@EnterpriseDB.com
> singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
>
> 17°29'34.37"N  78°30'59.76"E - Hyderabad *
> 18°32'57.25"N  73°56'25.42"E - Pune
>
> Sent from my BlackLaptop device

Re: a JOIN on same table, but 'slided over'

From
"Gurjeet Singh"
Date:
    It _is_ the optimised version.... as you can see from the explain plans posted in the other mail, the planner shows that the cost is drastically less than the 'distinct on' version.

    For smaller data-sets 'distinct-on' version might seem faster, but for reasonably larger datasets, it's performance deteriorates exponentially... This is because of the Nested-loops involved in the plan...

    I increased your data-set to 10240 rows by executing the following query 10 times:

insert into test select id+(select max(id) from test), thread, info from test;

    On such data-set (which is not very large by any means), the standard SQL version executes in almost a second, and on the other hand, I had to cancel the EXPLAIN ANALYZE of the 'distinct on' query after letting it run for over three minutes!!!

postgres=# explain analyze
postgres-# select       t1.id as id, t2.id as "id+1",
postgres-#              t1.thread as thread, t2.thread as "thread+1",
postgres-#              t1.info as info, t2.info as "info+1"
postgres-# from test as t1, test as t2
postgres-# where t2.id = ( select min(id) from test as t3 where t3.id > t1.id )
postgres-# order by t1.id asc;
                                                                              QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=2971.36..2996.96 rows=10240 width=24) (actual time=1004.031..1030.116 rows=10239 loops=1)
   Sort Key: t1.id
   Sort Method:  external sort  Disk: 416kB
   ->  Merge Join  (cost=840.48..2289.28 rows=10240 width=24) (actual time=834.218..956.595 rows=10239 loops=1)
         Merge Cond: (t2.id = ((subplan)))
         ->  Index Scan using test_id_key on test t2  (cost=0.00..332.85 rows=10240 width=12) (actual time=0.060..24.503 rows=10240 loops=1)
         ->  Sort  (cost=840.48..866.08 rows=10240 width=12) (actual time=834.129..854.776 rows=10240 loops=1)
               Sort Key: ((subplan))
               Sort Method:  quicksort  Memory: 928kB
               ->  Seq Scan on test t1  (cost=0.00..158.40 rows=10240 width=12)(actual time=0.196..797.752 rows=10240 loops=1)
                     SubPlan
                       ->  Result  (cost= 0.04..0.05 rows=1 width=0) (actual time=0.062..0.064 rows=1 loops=10240)
                             InitPlan
                               ->  Limit  (cost=0.00..0.04 rows=1 width=4) (actual time=0.047..0.050 rows=1 loops=10240)
                                     ->  Index Scan using test_id_key on test t3  (cost=0.00..121.98 rows=3413 width=4) (actual time= 0.038..0.038 rows=1 loops=10240)
                                           Index Cond: (id > $0)
                                           Filter: (id IS NOT NULL)
 Total runtime: 1052.802 ms
(18 rows)
Time: 1056.740 ms

postgres=# explain analyze
postgres-# select
postgres-#     distinct on (t1.id)
postgres-#     t1.*, t2.*
postgres-# from
postgres-#     test t1
postgres-#     join test t2 on t2.id > t1.id
postgres-# order by t1.id asc, t2.id asc;
Cancel request sent
ERROR:  canceling statement due to user request
postgres=#



On 6/26/07, Rafal Pietrak <rafal@zorro.isa-geek.com> wrote:
OK. Have tried this one.... looks like close to 6 times slower then the
'non-standard' phrase with 'distinct on'.

On the small dataset that I've included in my original post (ten rows of
data within TEST), I've run both queries through EXPLAIN ANALYSE, with
the following result summary (for clearity, I've cut away the details
from EXPLAIN output):

-----------STANDARD
Total runtime: 10.660 ms
-----------DISTINCT-ON
Total runtime: 1.479 ms
-----------

Would there be ways to optimise the standard query to get the
performance closer to the none-standard one?


-R


On Tue, 2007-06-26 at 18:05 +0530, Gurjeet Singh wrote:
> Hi Rafal,
>
>     Just a note that this is not standard SQL... 'distinct on' is an
> extension to SQL provided by postgres.
>
> Following query utilizes the standard SQL to get the same results:
>
> select    t1.id as id, t2.id as "id+1",
>         t1.thread as thread, t2.thread as "thread+1",
>         t1.info as info, t2.info as "info+1"
> from test as t1, test as t2
> where t2.id = ( select min(id) from test as t3 where t3.id > t1.id);
>
> HTH
> --
> gurjeet[.singh]@EnterpriseDB.com
> singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com
>
> 17°29'34.37"N  78°30' 59.76"E - Hyderabad *
> 18°32'57.25"N  73°56'25.42 "E - Pune
>
> Sent from my BlackLaptop device
>
> On 6/26/07, Rafal Pietrak < rafal@zorro.isa-geek.com> wrote:
>         Marvelous! Thenx!
>
>         -R
>
>         On Tue, 2007-06-26 at 10:06 +0200, hubert depesz lubaczewski
>         wrote:
>         > On 6/26/07, Rafal Pietrak < rafal@zorro.isa-geek.com> wrote:
>         >         Is there an SQL construct to get it?
>         >
>         > select
>         >     distinct on ( t1.id)
>         >     t1.*, t2.*
>         > from
>         >     test t1
>         >     join test t2 on t2.id > t1.id
>         > order by t1.id asc, t2.id asc
>         >
>         > should do the trick.
>         >
>         > depesz
>         >
>         > --
>         > http://www.depesz.com/ - nowy, lepszy depesz
>
>         ---------------------------(end of
>         broadcast)---------------------------
>         TIP 4: Have you searched our list archives?
>
>                        http://archives.postgresql.org/
>



--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

17°29'34.37"N  78°30'59.76"E - Hyderabad *
18°32'57.25"N  73°56'25.42"E - Pune

Sent from my BlackLaptop device