Thread: Research/Implementation of Nested Loop Join optimization

Research/Implementation of Nested Loop Join optimization

From

"Manoel Henrique"

Date:

23 July 2008, 17:17:07

Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an Joins, and we`d like to implement an optimization on the Nested Loop Join, this optimization consists on while scanning the inner table, the iteration would go from up-down then backwards(down-up) to take advantage of the buffer pages in memory.

We`d work with MaterialScan and only NestedLoop (we`re dropping all indexes and keys to make it this way). The research objective is to show some students how a DBMS works.

Does PostgreSQL already works this way?

Is it possible to implement such thing? Is it easy? how hard?

Thank you in advance,

Manoel Henrique Souza.

Re: Research/Implementation of Nested Loop Join optimization

From

"Dann Corbit"

Date:

23 July 2008, 17:23:38

When you install the source tree (e.g. in folder \postgresql-8.3.x) you will want to examine nodeMergejoin.c typically found in a path similar to this:

\postgresql-8.3.x\src\backend\executor\nodeMergejoin.c

Here are the comments from the version on my machine:

* INTERFACE ROUTINES

* ExecMergeJoin mergejoin outer and inner relations.

* ExecInitMergeJoin creates and initializes run time states

* ExecEndMergeJoin cleans up the node.

* NOTES

* Merge-join is done by joining the inner and outer tuples satisfying

* join clauses of the form ((= outerKey innerKey) ...).

* The join clause list is provided by the query planner and may contain

* more than one (= outerKey innerKey) clause (for composite sort key).

* However, the query executor needs to know whether an outer

* tuple is "greater/smaller" than an inner tuple so that it can

* "synchronize" the two relations. For example, consider the following

* relations:

* outer: (0 ^1 1 2 5 5 5 6 6 7) current tuple: 1

* inner: (1 ^3 5 5 5 5 6) current tuple: 3

* To continue the merge-join, the executor needs to scan both inner

* and outer relations till the matching tuples 5. It needs to know

* that currently inner tuple 3 is "greater" than outer tuple 1 and

* therefore it should scan the outer relation first to find a

* matching tuple and so on.

* Therefore, rather than directly executing the merge join clauses,

* we evaluate the left and right key expressions separately and then

* compare the columns one at a time (see MJCompare). The planner

* passes us enough information about the sort ordering of the inputs

* to allow us to determine how to make the comparison. We may use the

* appropriate btree comparison function, since Postgres' only notion

* of ordering is specified by btree opfamilies.

* Consider the above relations and suppose that the executor has

* just joined the first outer "5" with the last inner "5". The

* next step is of course to join the second outer "5" with all

* the inner "5's". This requires repositioning the inner "cursor"

* to point at the first inner "5". This is done by "marking" the

* first inner 5 so we can restore the "cursor" to it before joining

* with the second outer 5. The access method interface provides

* routines to mark and restore to a tuple.

* Essential operation of the merge join algorithm is as follows:

* Join {

* get initial outer and inner tuples INITIALIZE

* do forever {

* while (outer != inner) { SKIP_TEST

* if (outer < inner)

* advance outer SKIPOUTER_ADVANCE

* else

* advance inner SKIPINNER_ADVANCE

* }

* mark inner position SKIP_TEST

* do forever {

* while (outer == inner) {

* join tuples JOINTUPLES

* advance inner position NEXTINNER

* }

* advance outer position NEXTOUTER

* if (outer == mark) TESTOUTER

* restore inner position to mark TESTOUTER

* else

* break // return to top of outer loop

* }

* The merge join operation is coded in the fashion

* of a state machine. At each state, we do something and then

* proceed to another state. This state is stored in the node's

* execution state information and is preserved across calls to

* ExecMergeJoin. -cim 10/31/89

From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Manoel Henrique
Sent: Wednesday, July 23, 2008 1:17 PM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Research/Implementation of Nested Loop Join optimization

We`d work with MaterialScan and only NestedLoop (we`re dropping all indexes and keys to make it this way). The research objective is to show some students how a DBMS works.

Does PostgreSQL already works this way?

Is it possible to implement such thing? Is it easy? how hard?

Thank you in advance,

Manoel Henrique Souza.

Re: Research/Implementation of Nested Loop Join optimization

From

"Manoel Henrique"

Date:

23 July 2008, 17:48:40

The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to find a way to change the Nested Loop Join, It would be more like on nodeNestloop.c when rescanning the inner plan, (second time scanning the inner plan and so on) he`d change the scan direction, If the scan direction was from first tuple to last tuple it would go backwards, if it was from last to first it would go forward... The code I`m looking atm is from 8.3.1 , seems to have some kind of direction manager but doesn`t seems to be in use.

--Manoel

On Wed, Jul 23, 2008 at 5:23 PM, Dann Corbit <DCorbit@connx.com> wrote:

When you install the source tree (e.g. in folder \postgresql-8.3.x) you will want to examine nodeMergejoin.c typically found in a path similar to this:
\postgresql-8.3.x\src\backend\executor\nodeMergejoin.c

Here are the comments from the version on my machine:

/*
* INTERFACE ROUTINES
*                            ExecMergeJoin                                 mergejoin outer and inner relations.
*                            ExecInitMergeJoin                          creates and initializes run time states
*                            ExecEndMergeJoin                         cleans up the node.
*
* NOTES
*
*                            Merge-join is done by joining the inner and outer tuples satisfying
*                            join clauses of the form ((= outerKey innerKey) ...).
*                            The join clause list is provided by the query planner and may contain
*                            more than one (= outerKey innerKey) clause (for composite sort key).
*
*                            However, the query executor needs to know whether an outer
*                            tuple is "greater/smaller" than an inner tuple so that it can
*                            "synchronize" the two relations. For example, consider the following
*                            relations:
*
*                                                            outer: (0 ^1 1 2 5 5 5 6 6 7)            current tuple: 1
*                                                            inner: (1 ^3 5 5 5 5 6)                                       current tuple: 3
*
*                            To continue the merge-join, the executor needs to scan both inner
*                            and outer relations till the matching tuples 5. It needs to know
*                            that currently inner tuple 3 is "greater" than outer tuple 1 and
*                            therefore it should scan the outer relation first to find a
*                            matching tuple and so on.
*
*                            Therefore, rather than directly executing the merge join clauses,
*                            we evaluate the left and right key expressions separately and then
*                            compare the columns one at a time (see MJCompare). The planner
*                            passes us enough information about the sort ordering of the inputs
*                            to allow us to determine how to make the comparison. We may use the
*                            appropriate btree comparison function, since Postgres' only notion
*                            of ordering is specified by btree opfamilies.
*
*
*                            Consider the above relations and suppose that the executor has
*                            just joined the first outer "5" with the last inner "5". The
*                            next step is of course to join the second outer "5" with all
*                            the inner "5's". This requires repositioning the inner "cursor"
*                            to point at the first inner "5". This is done by "marking" the
*                            first inner 5 so we can restore the "cursor" to it before joining
*                            with the second outer 5. The access method interface provides
*                            routines to mark and restore to a tuple.
*
*
*                            Essential operation of the merge join algorithm is as follows:
*
*                            Join {
*                                            get initial outer and inner tuples                                                               INITIALIZE
*                                            do forever {
*                                                            while (outer != inner) {                                                                  SKIP_TEST
*                                                                            if (outer < inner)
*                                                                                            advance outer                                                                                   SKIPOUTER_ADVANCE
*                                                                            else
*                                                                                            advance inner                                                                                   SKIPINNER_ADVANCE
*                                                            }
*                                                            mark inner position                                                                                                         SKIP_TEST
*                                                            do forever {
*                                                                            while (outer == inner) {
*                                                                                            join tuples                                                                                                           JOINTUPLES
*                                                                                            advance inner position                                                  NEXTINNER
*                                                                            }
*                                                                            advance outer position                                                                 NEXTOUTER
*                                                                            if (outer == mark)                                                                                            TESTOUTER
*                                                                                            restore inner position to mark                   TESTOUTER
*                                                                            else
*                                                                                            break    // return to top of outer loop
*                                                            }
*                                            }
*                            }
*
*                            The merge join operation is coded in the fashion
*                            of a state machine. At each state, we do something and then
*                            proceed to another state. This state is stored in the node's
*                            execution state information and is preserved across calls to
*                            ExecMergeJoin. -cim 10/31/89
*/

From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Manoel Henrique
Sent: Wednesday, July 23, 2008 1:17 PM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Research/Implementation of Nested Loop Join optimization

Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an Joins, and we`d like to implement an optimization on the Nested Loop Join, this optimization consists on while scanning the inner table, the iteration would go from up-down then backwards(down-up) to take advantage of the buffer pages in memory.
We`d work with MaterialScan and only NestedLoop (we`re dropping all indexes and keys to make it this way). The research objective is to show some students how a DBMS works.

Does PostgreSQL already works this way?
Is it possible to implement such thing? Is it easy? how hard?

Thank you in advance,
Manoel Henrique Souza.

Re: Research/Implementation of Nested Loop Join optimization

From

"Dann Corbit"

Date:

23 July 2008, 19:50:46

From: Manoel Henrique [mailto:mhenriquesgbd@gmail.com]
Sent: Wednesday, July 23, 2008 1:47 PM
To: Dann Corbit
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Research/Implementation of Nested Loop Join optimization

You are right. You want file: nodeNestloop.c

Re: Research/Implementation of Nested Loop Join optimization

From

Tom Lane

Date:

24 July 2008, 02:49:24

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:
> The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to
> find a way to change the Nested Loop Join, It would be more like on
> nodeNestloop.c when rescanning the inner plan, (second time scanning the
> inner plan and so on) he`d change the scan direction, If the scan direction
> was from first tuple to last tuple it would go backwards, if it was from
> last to first it would go forward... The code I`m looking atm is from 8.3.1
> , seems to have some kind of direction manager but doesn`t seems to be in
> use.

I find this a bit dubious.  If the inner rel is small enough to fit in
memory then it buys nothing.  If not, then you win only to the extent
that a pretty large fraction of the inner rel fits in memory.  In any
case you are relying on the assumption that backwards scan is just as
efficient as forward scan, which seems to me to be a pretty large
assumption --- we expect forward seqscans to get a performance boost
from kernel readahead, but I'd be surprised if the kernel recognized
what was happening in a backwards scan.

Note also that backwards scan doesn't work at all in some plan
node types (cf ExecSupportsBackwardScan).  You'd need to check
what the inner input node was before trying this.
        regards, tom lane

Re: Research/Implementation of Nested Loop Join optimization

From

"Manoel Henrique"

Date:

25 July 2008, 16:06:15

Yes, I'm relying on the assumption that backwards scan has the same cost as forward scan, why shouldn't it?

Yet, all plan node types we are testing works with backwards scan (looking on ExecSupportsBackwardScan). But, is there a easy way to make a query execute only in backwards scan? How we can do that? Our first objective is to make a backwards scan only and then test a forward-and-backward scan.

-- Manoel

On Thu, Jul 24, 2008 at 2:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:
> The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to
> find a way to change the Nested Loop Join, It would be more like on
> nodeNestloop.c when rescanning the inner plan, (second time scanning the
> inner plan and so on) he`d change the scan direction, If the scan direction
> was from first tuple to last tuple it would go backwards, if it was from
> last to first it would go forward... The code I`m looking atm is from 8.3.1
> , seems to have some kind of direction manager but doesn`t seems to be in
> use.

I find this a bit dubious. If the inner rel is small enough to fit in
memory then it buys nothing. If not, then you win only to the extent
that a pretty large fraction of the inner rel fits in memory. In any
case you are relying on the assumption that backwards scan is just as
efficient as forward scan, which seems to me to be a pretty large
assumption --- we expect forward seqscans to get a performance boost
from kernel readahead, but I'd be surprised if the kernel recognized
what was happening in a backwards scan.

Note also that backwards scan doesn't work at all in some plan
node types (cf ExecSupportsBackwardScan). You'd need to check
what the inner input node was before trying this.

regards, tom lane

Re: Research/Implementation of Nested Loop Join optimization

From

Gregory Stark

Date:

25 July 2008, 19:11:07

"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:

> Yes, I'm relying on the assumption that backwards scan has the same cost as
> forward scan, why shouldn't it?

Because hard drives only spin one direction

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!

Re: Research/Implementation of Nested Loop Join optimization

From

"Jonah H. Harris"

Date:

25 July 2008, 20:31:15

On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark@enterprisedb.com> wrote:
> "Manoel Henrique" <mhenriquesgbd@gmail.com> writes:
>
>> Yes, I'm relying on the assumption that backwards scan has the same cost as
>> forward scan, why shouldn't it?
>
> Because hard drives only spin one direction

:)

-- 
Jonah H. Harris, Senior DBA
myYearbook.com

Re: Research/Implementation of Nested Loop Join optimization

From

"Joshua D. Drake"

Date:

25 July 2008, 20:35:20

On Fri, 2008-07-25 at 19:31 -0400, Jonah H. Harris wrote:
> On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark@enterprisedb.com> wrote:
> > "Manoel Henrique" <mhenriquesgbd@gmail.com> writes:
> >
> >> Yes, I'm relying on the assumption that backwards scan has the same cost as
> >> forward scan, why shouldn't it?
> >
> > Because hard drives only spin one direction
> 
> :)

What if you are below the equator?

> -- 
> Jonah H. Harris, Senior DBA
> myYearbook.com
> 
-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

Re: Research/Implementation of Nested Loop Join optimization

From

Alvaro Herrera

Date:

25 July 2008, 21:17:23

Joshua D. Drake escribió:
> On Fri, 2008-07-25 at 19:31 -0400, Jonah H. Harris wrote:
> > On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark@enterprisedb.com> wrote:
> > > "Manoel Henrique" <mhenriquesgbd@gmail.com> writes:
> > >
> > >> Yes, I'm relying on the assumption that backwards scan has the same cost as
> > >> forward scan, why shouldn't it?
> > >
> > > Because hard drives only spin one direction
> > 
> > :)
> 
> What if you are below the equator?

They spin the same direction here too, thanks :-)  (Coriolis does not
affect much in this case)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Research/Implementation of Nested Loop Join optimization

From

Tom Lane

Date:

25 July 2008, 21:28:55

Gregory Stark <stark@enterprisedb.com> writes:
> "Manoel Henrique" <mhenriquesgbd@gmail.com> writes:
>> Yes, I'm relying on the assumption that backwards scan has the same cost as
>> forward scan, why shouldn't it?

> Because hard drives only spin one direction

Good joke, but to be serious: we expect that forward scans will result
in the kernel doing read-ahead, which will allow overlapping of
CPU work to process one page with the I/O to bring in the next page.
A backwards scan will get no such overlapping and thus be up to 2X
slower, unless the kernel is smart enough to do read-ahead for
descending-order read requests.  Which seems not too probable.  A fairly
typical kernel behavior is that read-ahead is triggered by successive
read() requests without any intervening seek(), and this is impossible
for a backward scan.

(Yes, we do optimize out the seek calls in a forward scan.  IIRC it's
done in fd.c.)
        regards, tom lane

Re: Research/Implementation of Nested Loop Join optimization

From

Alvaro Herrera

Date:

25 July 2008, 21:39:59

Tom Lane escribió:
> Gregory Stark <stark@enterprisedb.com> writes:
> > "Manoel Henrique" <mhenriquesgbd@gmail.com> writes:
> >> Yes, I'm relying on the assumption that backwards scan has the same cost as
> >> forward scan, why shouldn't it?
> 
> > Because hard drives only spin one direction
> 
> Good joke, but to be serious: we expect that forward scans will result
> in the kernel doing read-ahead, which will allow overlapping of
> CPU work to process one page with the I/O to bring in the next page.

I wonder if this is spoiled (or rather, the backwards case fixed) by the
attempts to call posix_fadvise() on certain types of scan.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Research/Implementation of Nested Loop Join optimization

From

Tom Lane

Date:

25 July 2008, 21:45:01

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane escribi�:
>> Good joke, but to be serious: we expect that forward scans will result
>> in the kernel doing read-ahead, which will allow overlapping of
>> CPU work to process one page with the I/O to bring in the next page.

> I wonder if this is spoiled (or rather, the backwards case fixed) by the
> attempts to call posix_fadvise() on certain types of scan.

Yeah, I started wondering about that too after sending off the above.
The fadvise patch might eliminate the distinction ... on platforms where
fadvise exists and actually works well.
        regards, tom lane

Re: Research/Implementation of Nested Loop Join optimization

From

Ron Mayer

Date:

26 July 2008, 02:28:11

Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>> "Manoel Henrique" <mhenriquesgbd@gmail.com> writes:
>>> Yes, I'm relying on the assumption that backwards scan has the same cost as
>>> forward scan, why shouldn't it?
> 
> G...we expect that forward scans will result
> in the kernel doing read-ahead, ...
> A backwards scan will get no such overlapping and thus be up to 2X
> slower, unless the kernel is smart enough to do read-ahead for
> descending-order read requests.  Which seems not too probable. 

Linux's old adaptive readahead patches claimed to[1]:  It also have methods to detect some less common cases:  -
readingbackward"
 
Interestingly the author of that patch used postgres as the example
application that benefits from the patch (30%).

I'm not sure if the backward reading feature got kept
in the simplified on-demand readahead that seems to have
superseded the adaptive readahead stuff in 2.6.23[2].

[1] http://lwn.net/Articles/185469/
[2] http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7

Re: Research/Implementation of Nested Loop Join optimization

From

Tom Lane

Date:

26 July 2008, 02:38:36

Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Tom Lane wrote:
>> A backwards scan will get no such overlapping and thus be up to 2X
>> slower, unless the kernel is smart enough to do read-ahead for
>> descending-order read requests.  Which seems not too probable. 

> Linux's old adaptive readahead patches claimed to[1]:

I didn't say that there were *no* platforms that could do it.
        regards, tom lane

Re: Research/Implementation of Nested Loop Join optimization

From

Gregory Stark

Date:

26 July 2008, 07:29:26

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> "Manoel Henrique" <mhenriquesgbd@gmail.com> writes:
>>> Yes, I'm relying on the assumption that backwards scan has the same cost as
>>> forward scan, why shouldn't it?
>
>> Because hard drives only spin one direction
>
> Good joke, but to be serious: we expect that forward scans will result
> in the kernel doing read-ahead, which will allow overlapping of
> CPU work to process one page with the I/O to bring in the next page.

Well it wasn't a joke but you're right that it's not the whole picture. But
then neither is considering interleaving of I/O with CPU work.

Hard drives spin in a particular direction which means if you stream I/O
requests to them in that direction you can stream data off the hard drive as
fast as it passes under the read head. That's going to be 50-60MB/s for a
single modern 7200rpm drive.

On the other hand if you send an I/O request for the previous block then you
have to wait a whole rotation before it passes under the head. On a 7200rpm
drive that's over 8ms which is a *lot* of CPU work to interleave. The most
bandwidth you'll be able to get is under 1MB/s.

However there's another reason fadvise helps -- the kernel or the drive gets a
chance to reorder the I/O. If we read-ahead a whole track's worth of I/O
backwards then during the first 8ms latency the kernel has a chance to notice
that it should reorder the queued up requests and do them the right way
around.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!